Lessons learned from the Microsoft SOC—Part 2: Organizing people

April 23, 2019

In the second post in our series, we focus on the most valuable resource in the security operations center (SOC)—our people. This series is designed to share our approach and experience with operations, so you can use what we learned to improve your SOC. In Part 1: Organization, we covered the SOC’s organizational role and mission, culture, and metrics.

The lessons in the series come primarily from Microsoft’s corporate IT security operation team, one of several specialized teams in the Microsoft Cyber Defense Operations Center (CDOC). We also include lessons our Detection and Response Team (DART) have learned helping our customers respond to major incidents.

People are the most valuable asset in the SOC—their experience, skill, insight, creativity, and resourcefulness are what makes our SOC effective. Our SOC management team spends a lot of time thinking about how to ensure our people are set up with what they need to succeed and stay engaged. As we’ve improved our processes, we’ve been able to decrease the time it takes to ramp people up and increase employee enjoyment of their jobs.

Today, we cover the first two aspects of how to set up people in the SOC for success:

  • Empower humans with automation.
  • Microsoft SOC teams and tiers model.

Empower humans with automation

Rapidly sorting out the signal (real detections) from the noise (false positives) in the SOC requires investing in both humans and automation. We strongly believe in the power of automation and technology to reduce human toil, but ultimately, we’re dealing with human attack operators and human judgement is critical to the process.

In our SOC, automation is not about using efficiency to remove humans from the process—it is about empowering humans. We continuously think about how we can automate repetitive tasks from the analyst’s job, so they can focus on the complex problems that people are uniquely able to solve.

Automation empowers humans to do more in the SOC by increasing response speed and capturing human expertise. The toil our staff experiences comes mostly from repetitive tasks and repetitive tasks come from either attackers or defenders doing the same things over and over. Repetitive tasks are ideal candidates for automation.

We also found that we need to constantly refine the automation because attackers are creative and persistent, constantly innovating to avoid detections and preventive controls. When an effective attack method is identified (like phishing), they exploit it until it stops working. But they also continually innovate new tactics to evade defenses introduced by the cybersecurity community. Given the profit potential of attacks, we expect the challenges of evolving attacks to continue for the foreseeable future.

When repetitive and boring work is automated, analysts can apply more of their creative minds and energy to solving the new problems that attackers present to them and proactively hunting for attackers that got past the first lines of defense. We’ll discuss areas where we use automation and machine learning in “Part 3: Technology.”

Microsoft SOC teams and tiers model

At Microsoft, we organized our SOC into specialized teams, allowing them to better develop and apply deep expertise, which supports the overall goals of reducing time to acknowledge and remediate.

This diagram represents the key SOC functions: threat intelligence, incident management, and SOC analyst tiers:

Image showing key SOC functions: threat intelligence, incident management, and SOC analysts (tiers 1, 2, and 3).

Threat intelligence—We have several threat intelligence teams at Microsoft that support the SOC and other business functions. Their role is to both inform business stakeholders of risk and provide technical support for incident investigations, hunting operations, and defensive measures for known threats. These strategic (business) and tactical (technical) intelligence goals are related but distinctly different from each other. We task different teams for each goal and ensure processes are in place (such as daily standup meetings) to keep them in close contact.

Incident management—Enterprise-wide coordination of incidents, impact assessment, and related tasks are handled by dedicated personnel separate from technical analyst teams. At Microsoft, these incident response teams work with the SOC and business stakeholders to coordinate actions that may impact services or business units. Additionally, this team brings in legal, compliance, and privacy experts as needed to consult and advise on actions regarding regulatory aspects of incidents. This is particularly important at Microsoft because we’re compliant with a large number of international standards and regulations.

SOC analyst tiers—This three-tier model for SOC analysts will probably look familiar to seasoned SOC professionals, though there are some subtleties in our model we don’t see widely in the industry.

Image showing Microsoft's Corporate IT SOC tiers and tools: alert queue (hot path), proactive hunt (cold path), tiers 3, 2, and 1, and finally, automation.

Our organization uses the term hot path and cold path to describe how we discover adversaries and optimize processes to handle them.

  • Hot path—Reflects detection of confirmed active attack activity that must be investigated and remediated as soon as possible. Managing and remediating these incidents are primarily handled by Tier 1 and Tier 2, though a small percentage (about 4 percent) are escalated to Tier 3. Automation of investigations and remediations are also beginning to help reduce hot path workloads.
  • Cold path—Refers to all other activities including proactively hunting for adversary campaigns that haven’t triggered a hot path alert.

Roles and functions of the SOC analyst tiers

Tier 1—This team is the primary front line for and focuses on high-speed remediation over a large volume of incidents. Tier 1 analysts respond to a very specific set of alert sources and follow prescriptive instructions to investigate, remediate, and document the incidents. The rule of thumb for alerts that Tier 1 handles is that it can be typically remediated within seconds to minutes. The incidents will be escalated to Tier 2 if the incident isn’t covered by a documented Tier 1 procedure or it requires involved/advanced remediation (for example, device isolation and cleanup).

In addition:

  • The Tier 1 function is currently performed by full-time employees in our corporate IT SOC. In the past and in other teams at Microsoft, we staffed contracted employees or managed service agreements for Tier 1 functions.
  • A current initiative for the full-time employee Tier 1 team is to increase the use of automated investigation and remediation for these incidents. One goal of this initiative is to grow the skills of our current Tier 1 employees, so they can shift to proactive work in other security assignments in SOC or across the company.
  • Tier 1 (and Tier 2) SOC analysts may stay involved with an escalated incident until it is remediated. This helps preserve context during and after transferring ownership of an incident and also accelerates their learning and skills growth.
  • The typical ratio of alert volumes is noted in the Tiers and Tools diagram above. (We’ll share more details in “Part 3: Technology.”)

Tier 2—This team is focused on incidents that require deeper analysis and remediation. Many Tier 2 incidents have been escalated from Tier 1 analysts, but Tier 2 also directly monitors alerts for sensitive assets and known attacker campaigns. These incidents are usually more complex and require an approach that is still structured, but much more flexible than Tier 1 procedures. Additionally, some Tier 2 analysts also proactively hunt for adversaries (typically using lower priority alerts from the same Microsoft Threat Protection tools they use to manage reactive incidents).

Tier 3—This team is focused primarily on advanced hunting and sophisticated analysis to identify anomalies that may indicate advanced adversaries. Most incidents are remediated at Tiers 1 and 2 (96 percent) and only unprecedented findings or deviations from norms are escalated to Tier 3 teams. Tier 3 team members have a high degree of freedom to bring their different skills, backgrounds, and approaches to the goal of ferreting out red team/hidden adversaries. Tier 3 team members have backgrounds as security professionals, data scientists, intelligence analysts, and more. These teams use different tools (Microsoft, custom, and third-party) to sift through a number of different datasets to uncover hidden adversary activity. A favorite of many analysts is the use of Kusto Query Language (KQL) queries across Microsoft Threat Protection tool datasets.

The structure of Tier 3 has changed over time, but has recently gravitated to four different functions:

  • Major incident engineering—Handles escalation of incidents from Tier 2. These virtual teams are created as needed to support the duration of the incident and often include both reactive investigations, as well as proactive hunting for additional adversary presence.
  • External adversary research and threat analysis—Focuses on hunting for adversaries using existing tools and data sources, as well as signals from various external intelligence sources. The team is focused on both hunting for undiscovered adversaries as well as creating and refining alerts and automation.
  • Purple team operations—A temporary duty assignment where Tier 3 analysts (blue team) are paired with our in-house attack team members (red team) as they perform authorized attacks. We found this purple (red+blue) activity results in high-value learning by both teams, strengthening our overall security posture and resilience. This team is also responsible for the critical task of coordinating with red team to deconflict whether a detection is an authorized red team or a live attacker. At customer organizations, we’ve seen failure to deconflict red team activity result in our DART teams flying onsite to respond to a false alarm (an avoidable, expensive, and embarrassing mistake).
  • Future operations team—Focuses on future-proofing our technology and processes by building and testing new capabilities.

Learn more

For more insights into Microsoft’s approach to using technology to empower people, watch Ann Johnson’s keynote at RSA 2019 and download our poster. For information on organizational culture and goals, read Lessons learned from the Microsoft SOC—Part 1: Organization. In addition, see our CISO series to learn more.

Stayed tuned for the second segment in “Lessons learned from the Microsoft SOC—Part 2,” where we’ll cover career paths and readiness programs for people in our SOC. And finally, we’ll wrap up this series with “Part 3: Technology,” where we’ll discuss the technology that enables our people to accomplish their mission.

For more discussion on some of these topics, see John and Kristina’s session (starting at 1:05:48) at Microsoft’s recent Virtual Security Summit.