Security companies regularly extol the use of data science, machine learning and artificial intelligence in their products. There are important conceptual differences when applying these approaches to different cybersecurity use cases.
To start with, I want to clarify the meaning behind these terms.
- Data science involves using fancy math to extract knowledge from data, either via machine learning or using more straightforward data analytics techniques.
- Machine learning involves employing an algorithm to construct a model that is trained to recognize patterns by feeding (usually large amounts of) data into it. Models may be supervised (fed labeled data and usually tuned by a data scientist) or unsupervised (trained on unlabeled data and identifying anomalies and outliers).
- Artificial intelligence is a more abstract concept that can involve techniques like machine learning but can also include approaches such as interviewing subject matter experts and writing code to approximate their thought processes (we used to call this “expert systems”).
Because artificial intelligence can be a somewhat abstract concept, I will focus on the application of data analytics and machine learning to the threat detection and incident investigation use cases.
Data science is broadly applicable to a number of different threat detection problems:
- Supervised machine learning models can be trained by ingesting data extracted from malware execution and benign software execution in an effort to learn the subtle difference between malware and benign programs.
Most Endpoint Protection Platform (EPP) products have some threat detection features that are constructed in this manner.
- Supervised machine learning models can be trained by ingesting data extracted from network traffic in an effort to identify particular attacker behaviors.
Network traffic analysis (NTA) products can use machine learning models trained on the time series data of packets sent and received to identify when machines are being remotely controlled by an external entity.
- Unsupervised machine learning models can be trained using global or per-customer data and the can be utilized to find outliers and anomalies based on the trained model.
User and entity behavioral analytics (UEBA) products broadly observe patterns of behavior to find anomalies and outliers in areas that may by indicators of a cyberattack.
NTA products can use unsupervised machine learning to identify specific threatening behaviors like remote execution that looks suspicious given the prior observed patterns of benign remote execution (e.g. system updates).
Threat detection often presents a trade-off between coverage and noise. Any software, whether it is an EPP, NTA or UEBA product, presents a potential for threat detection as well as noise.
Given the size of the haystack and the relative rarity of needles in it, methods that look for a confluence of multiple suspicious factors before presenting a detection are often employed to improve the signal-to-noise ratio.
An incident investigation is inherently different than the problem presented by generalized threat detection.
An incident investigation begins with a particular premise such as a machine that is communicating with an external entity in what appears to be a command-and-control (C&C) channel and progresses from there.
As the inquiry is much more focused, noise is less of a concern and the data science must attempt to (a) highlight data that supports or refutes the premise and (b) if the premise is proven, provide additional avenues the investigation should pursue.
Supporting or refuting the premise of the investigation presents one opportunity for data science. In the example above, if the investigator believes that a machine is communicating on a C&C channel, the natural lines of inquiry lead in two directions:
- What is known about the external entity? When was its domain registered? Where is the IP? Which organization does the IP belong to? Is any other machine in the organization’s network communicating with the entity? Is there some threat intel available about the entity? Little of this involves data science.
- How can the behavior of the machine around the potential C&C communications be characterized? Beyond the potential C&C communication, is there any other observable change in the machine’s behavior before or after the event in question?
This comes down to characterizing the machine’s behavior in relation to its prior observed baseline or in relation to the behavior observed among a set of similar machines. Data science – both data analytics and machine learning – can help advance this line of inquiry.
Once the premise is proven, the next issue is that the machine in question might just be the tip of the iceberg. What other machines might be involved?
At this stage, the data science needs to find the unusual behaviors of the compromised asset, no matter how innocuous they seem, and look for other machines that might have been targeted by those behaviors or for machines that exhibit “similar” behavior.
While searches for indicators of compromise (IOCs) are often very precise (e.g. look for any machine that uses a particular user-agent), data science can help where the pattern you’re looking for is less precise, namely where the search is for something that is just similar.
The rule of thumb on data analytics and machine learning that I apply is simple: If you understand the use case, identify the data available to pursue it, and identify whether this is a needle-in-a-haystack problem or a follow-the-bouncing-ball problem. In doing so, you are less likely to be led astray by someone who has a hammer and insists that it can be used to drive a screw into a piece of wood.
This article is published as part of the IDG Contributor Network. Want to Join?