Kill SOC Toil, Do SOC Eng

As you are reading our recent paper “Autonomic Security Operations — 10X Transformation of the Security Operations Center”, some of you may think “Hey, marketing inserted that 10X thing in there.”

Well, 10X thinking is, in fact, an ancient tradition here at Google. We think that it is definitely possible to apply “10X thinking” to many areas of security (at the same link, they say that sometimes it is “easier to make something 10 times better than it is to make it 10 percent better”). However, our beloved domain of cyber is full of skeptics and cynics, as well as well-meaning people who just can’t take the exaggerations anymore…

With this post, I wanted to explore one particular area of 10X possibility. This area is “toil”, an SRE term that is crisply defined in Chapter 5 of Google SRE book. If you read the above short and fun chapter, and then look back at your SOC, you will realize that 100% of what a typical SOC analyst does on a daily basis fits the definition of toil.

Here in the post, we will present two components of the definition that are the juiciest, in my opinion.

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. “

“If your service remains in the same state after you have finished a task, the task was probably toil.”

Does this remind you of SOC analyst work? Well, it is an exact match, no need to write any regexes here…

Now, some of you may say at this point: but Anton, SOC work is inherently like this. Attackers come, alerts trigger, we clear them, adjust, tune, response, rinse, repeat. If our IT remains in “the same state” after this, it is good, not bad, right?

Well, I bet the sysadmins and IT operations people of the 1990s thought the same when responding to availability incidents: “but our work is inherently like that”, and they were proven wrong by the SREs.

So, let’s talk about how we can make your SOC behave more the way good SRE teams do. But before we go there: where is that 10X?

Well, if you have increase in attacks, increase in assets under protection or increase in environment complexity, your “toil-based” SOC will need to grow linearly with all those changes. To get to 2X the attacks or to 2X increased scope (such as cloud added to your SOC coverage), you need 2X the people, and sometimes also 2X budget to spend on tools.

However, if we really transform the SOC based on the principles we discuss, your effort increase may range from nothing to minimal. Hence, you WILL achieve 10X effectiveness in real life, not on a marketing glossy. The evolution of security operations in general and SOCs in particular is heavily dependent on a drive towards an engineering-first mindset while operating modern, more secure systems at large scale. So, you can’t “ops” your way to SOC success, but you can “dev” your way there, just like we do at Google!

So, how can we put these and other SRE lessons to work in your SOC?

First, educate your team on how SRE philosophies can be implemented in SOC. Find opportunities to do team-building exercises and empower your team to define this cultural transformation. Driving a cultural shift requires an inspired, motivated, and disciplined team — as well as specific skills in this area.

Next, seek to minimize your ops time to 50%, gradually. Try spending the remaining 50% on improving systems and detections with an “automate-first”, engineering mindset. BTW, engineering here is NOT the same as writing code: “Engineering work is novel and intrinsically requires human judgment. It produces a permanent improvement in your service, and is guided by a strategy.“

“Commit to eliminate a bit of toil each week with some good engineering” in your SOC. Here are some SOC examples: tweak that rule that produces non-actionables alerts, write a SOAR playbook to auto-close some alerts, script the test for log collection running optimally, etc, etc.

One route to go is hiring security automation engineers who have operations experience, or have the ability to ramp up quickly. The right person can set the tone for leading your whole team through evolution to “SRE-inspired” SOC.

We think that the largest current and future challenges in Security Operations can be solved with this approach. Otherwise, 30+ years of SOC work and we’re still facing the age-old challenges we had in the past (believe it or not, “too many [IDS] alerts” was a SOC challenge in 2002!).

Huge thanks to Iman Ghanizada for his contributions to this post.

Related blog posts:


Kill SOC Toil, Do SOC Eng was originally published in Anton on Security on Medium, where people are continuing the conversation by highlighting and responding to this story.

*** This is a Security Bloggers Network syndicated blog from Stories by Anton Chuvakin on Medium authored by Anton Chuvakin. Read the original post at: https://medium.com/anton-on-security/kill-soc-toil-do-soc-eng-50f29bfe52bd?source=rss-11065c9e943e——2