At AWS, we encourage you to use automation to help quickly detect and respond to security events within your AWS environments. In addition to increasing the speed of detection and response, automation also helps you scale your security operations as you expand your workloads running on AWS. For these reasons, security automation is a key principle outlined in both the Well-Architected and Cloud Adoption frameworks as well as in the AWS Security Incident Response Guide.
In this blog post, you’ll learn to implement automated security response mechanisms within your AWS environments. This post will include common patterns, implementation considerations, and an example solution. Security response automation is a broad topic that spans many areas. The goal of this blog post is to introduce you to core concepts and help you get started.
A word from our lawyers: Please note that you are responsible for making your own independent assessment of the information in this post. This post: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers, or licensors.
What is security response automation?
Security response automation is a planned and programmed action taken to achieve a desired state for an application or resource based on a condition or event. When you implement security response automation, you should adopt an approach that draws from existing security frameworks. Frameworks are published materials which consist of standards, guidelines, and best practices in order help organizations manage cybersecurity-related risk. Using frameworks helps you achieve consistency and scalability and enables you to focus more on the strategic aspects of your security program. You should work with compliance professionals within your organization to understand any specific security frameworks that may also be relevant for your AWS environment.
Our example solution is based on the NIST Cybersecurity Framework (CSF), which is designed to help organizations assess and improve their ability to prevent, detect, and respond to security events. According to the CSF, “cybersecurity incident response” supports your ability to contain the impact of potential cybersecurity incidents. Although automation is not a CSF requirement, automating responses to events enables you to create repeatable, predictable approaches to monitoring and responding to threats.
The five main steps in the CSF are identify, protect, detect, respond and recover. We’ve expanded the detect and respond steps to include automation and investigation activities.
The following definitions for each step in the diagram above are based on the CSF but have been adapted for our example in this blog post. Although we will focus on the detect, automate and respond steps, it’s important to understand the entire process flow.
- Identify: Identify and understand the resources, applications, and data within your AWS environment.
- Protect: Develop and implement appropriate controls and safeguards to ensure delivery of services.
- Detect: Develop and implement appropriate activities to identify the occurrence of a cybersecurity event. This step includes the implementation of monitoring capabilities which will be discussed further in the next section.
- Automate: Develop and implement planned, programmed actions that will achieve a desired state for an application or resource based on a condition or event.
- Investigate: Perform a systematic examination of the security event to establish the root cause.
- Respond: Develop and implement appropriate activities to take automated or manual actions regarding a detected security event.
- Recover: Develop and implement appropriate activities to maintain plans for resilience and to restore any capabilities or services that were impaired due to a security event.
Security response automation on AWS
AWS CloudTrail, AWS Config, and Amazon EventBridge continuously record details about the resources and configuration changes in your AWS account. You can use this information to automatically detect resource changes and to react to deviations from your desired state.
As shown in the diagram above, an automated remediation flow on AWS has three stages:
- Monitor: Your automated monitoring tools collect information about resources and applications running in your AWS environment. For example, they might collect AWS CloudTrail information about activities performed in your AWS account, usage metrics from your Amazon EC2 instances, or flow log information about the traffic going to and from network interfaces in your Amazon Virtual Private Cloud (VPC).
- Detect: When a monitoring tool detects a predefined condition—such as a breached threshold, anomalous activity, or configuration deviation—it raises a flag within the system. A triggering condition might be an anomalous activity detected by Amazon GuardDuty, a resource becoming out of compliance with an AWS Config Rule, or a high rate of blocked requests on an Amazon VPC security group or AWS WAF web access control list.
- Respond: When a condition is flagged, an automated response is triggered that performs an action you’ve predefined—something intended to remediate or mitigate the flagged condition. Examples of automated response actions might include modifying a VPC security group, patching an Amazon EC2 instance, or rotating credentials.
You can use the event-driven flow described above to achieve many automated response patterns with varying degrees of complexity. Your response pattern could be as simple as invoking a single AWS Lambda function, or it could be a complex series of AWS Step Function tasks with advanced logic. In this blog post, we’ll use two simple Lambda functions in our example solution.
How to define your response automation
Now that we’ve introduced the concept of security response automation, start thinking about security requirements within your environment that you’d like to enforce through automation. These design requirements might come from general best practices you’d like to follow, or they might be specific controls from compliance frameworks relevant for your business. Regardless, your objectives should be quantitative, not qualitative. Here are some examples of quantitative objectives:
- Remote administrative network access to servers should be limited.
- Server storage volumes should be encrypted.
- AWS console logins should be protected by multi-factor authentication.
As an optional step, you can expand these objectives into user stories that define the conditions and remediation actions when there is an event. User stories are informal descriptions that briefly document a feature within a software system. User stories may be global and span across multiple applications or they may be specific to a single application. For example:
“Remote administrative network access to servers should be limited. Remote access ports include SSH TCP port 22 and RDP TCP port 3389. If open remote access ports are detected within the environment, they should be automatically closed and the owner will be notified.”
Once you’ve completed your user story, you can determine how to use automated remediation to help achieve these objectives in your AWS environment. User stories should be stored in a location that provides versioning support and can reference the associated automation code.
You should carefully consider the effect of your remediation mechanisms in order to prevent unintended impact on your resources and applications. Remediation actions such as instance termination, credential revocation, and security group modification can adversely affect application availability. Depending on the level of risk that’s acceptable to your organization, your automated mechanism might only provide a notification which can then be manually investigated prior to remediation. Once you’ve identified an automated remediation mechanism, you can build out the required components and test them in a non-production environment.
Sample response automation walkthrough
In the following section, we’ll walk you through an automated remediation for a simulated event that indicates potential unauthorized activity—the unintended disabling of CloudTrail logging. Outside parties might want to disable logging to prevent detection and recording of their unauthorized activity. Our response is to re-enable the CloudTrail logging and immediately notify the security contact. Here’s the user story for this scenario:
“CloudTrail logging should be enabled for all AWS accounts and regions. If CloudTrail logging is disabled, it will automatically be enabled and the security operations team will be notified.”
Note: The sample response automation below references Amazon EventBridge which extends and builds upon CloudWatch Events. Amazon EventBridge uses the same Amazon CloudWatch Events API, so the event structure and rules configuration are the same. This blog post uses base functionality that is identical in both EventBridge and CloudWatch Events.
In order to use our sample remediation, you will need to enable Amazon GuardDuty and AWS Security Hub in the AWS Region you have selected. Both of these services include a 30-day free trial. See the AWS Security Hub pricing page and the Amazon GuardDuty pricing page for additional details.
Important: You’ll use AWS CloudTrail to test the sample remediation. Running more than one CloudTrail trail in your AWS account will result in charges based on the number of events processed while the trail is running. Charges for additional copies of management events recorded in a Region are applied based on the published pricing plan. To minimize the charges, follow the clean-up steps that we provide later in this post to remove the sample automation and delete the trail.
Deploy the sample response automation
In this section, we’ll show you how to deploy and test the CloudTrail logging remediation sample. Amazon GuardDuty generates the finding Stealth:IAMUser/CloudTrailLoggingDisabled when CloudTrail logging is disabled, and AWS Security Hub collects findings from GuardDuty using the standardized finding format mentioned earlier. We recommend that you deploy this sample into a non-production AWS account.
Select the Launch Stack button below to deploy a CloudFormation template with an automation sample in the us-east-1 Region. You can also download the template and implement it in another Region. The template consists of an Amazon EventBridge rule, an AWS Lambda function and the IAM permissions necessary for both components to execute. It takes several minutes for the CloudFormation stack build to complete.
- In the CloudFormation console, choose the Select Template form, and then select Next.
- On the Specify Details page, provide the email address for a security contact. (For the purpose of this walkthrough, it should be an email address you have access to.) Then select Next.
- On the Options page, accept the defaults, then select Next.
- On the Review page, confirm the details, then select Create.
- While the stack is being created, check the inbox of the email address you provided in step 2. Look for an email message with the subject AWS Notification – Subscription Confirmation. Select the link in the body of the email to confirm your subscription to the Amazon Simple Notification Service (Amazon SNS) topic. You should see a success message similar to the screenshot below:
- Return to the CloudFormation console. Once the Status field for the CloudFormation stack changes to CREATE COMPLETE (as shown in figure 4), the solution is implemented and is ready for testing.
Test the sample automation
You’re now ready to test the automated response by creating a test trail in CloudTrail, then trying to stop it.
- From the AWS Management Console, choose Services > CloudTrail.
- Select Trails, then select Create Trail.
- On the Create Trail form:
- Enter a value for Trail name. We use test-trail in our example below.
- Under Management events, select Write-only (to minimize event volume).
- Under Storage location, choose an existing S3 bucket or create a new one. Note that since S3 bucket names are globally unique, you must add characters (such as a random string) to the name. For example: my-test-trail-bucket-<random-string>.
- On the Trails page of the CloudTrail console, verify that the new trail has started. You should see a green checkmark in the Status column, as shown in figure 6.
- You’re now ready to act like an unauthorized user trying to cover their tracks! Stop the logging for the trail you just created:
- Select the new trail name to display its configuration page.
- Toggle the Logging switch in the top-right corner to OFF.
- When prompted with a warning dialog box, select Continue.
- Verify that the Logging switch is now off, as shown below.
You have now simulated a security event by disabling logging for one of the trails in the CloudTrail service. Within the next few seconds, the near real-time automated response will detect the stopped trail, restart it, and send an email notification. You can refresh the Trails page of the CloudTrail console to verify that the trail’s status is ON again.
Within the next several minutes, the investigatory automated response will also begin. GuardDuty will detect the action that stopped the trail and enrich the data about the source of unexpected behavior. Security Hub will then ingest that information and optionally correlate with other security events.
Following the steps below, you can monitor findings within Security Hub for the finding type TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled to be generated:
- In the AWS Management Console, choose Services > Security Hub
- Select Findings in the left pane.
- Select the Add filters field, then select Type.
- Select EQUALS, paste TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled into the field, then select Apply.
- Refresh your browser periodically until the finding is generated.
While you wait on that detection, let’s dig into the components of automation.
How the sample automation works
This example incorporates two automated responses: a near real-time workflow and an investigatory workflow. The near real-time workflow provides a rapid response to an individual event, in this case the stopping of a trail. The goal is to restore the trail to a functioning state and alert security responders as quickly as possible. The investigatory workflow still includes a response to provide defense in depth and also uses services that support a more in-depth investigation of the incident.
In the near real-time workflow, Amazon EventBridge monitors for the undesired activity. When a trail is stopped, AWS CloudTrail publishes an event on the EventBridge bus. An EventBridge rule detects the trail-stopping event and invokes a Lambda function to respond to the event by restarting the trail and notifying the security contact via an Amazon Simple Notification Service (SNS) topic.
In the investigative workflow, CloudTrail logs are monitored for undesired activities. For example, if a trail is stopped, there will be a corresponding log record. GuardDuty detects this activity and retrieves additional data points about the source IP that executed the API call. Two common examples of those additional data points in GuardDuty findings include whether the API call came from an IP address on a threat list, or whether it came from a network not commonly used in your AWS account. An AWS Lambda function responds by restarting the trail and notifying the security contact. Finally, the finding is imported into AWS Security Hub for additional investigation.
AWS Security Hub imports findings from AWS security services such as GuardDuty, Amazon Macie and Amazon Inspector, plus from any third-party product integrations you’ve enabled. All findings are provided to Security Hub in AWS Security Finding Format, which eliminates the need for data conversion. Security Hub correlates these findings to help you identify related security events and determine a root cause. Security Hub also publishes its findings to Amazon EventBridge to enable further processing by other AWS services such as AWS Lambda.
Respond step deep dive
Amazon EventBridge and AWS Lambda work together to respond to a security finding. Amazon EventBridge is a service that provides real-time access to changes in data in AWS services, your own applications, and Software-as-a-Service (SaaS) applications without writing code. In this example, EventBridge identifies a Security Hub finding that requires action and invokes a Lambda function that performs remediation. As shown in figure 10, the Lambda function both notifies the security operator via SNS and restarts the stopped CloudTrail.
To set this response up, we looked for an event to indicate that a trail had stopped or was disabled. We knew that the GuardDuty finding Stealth:IAMUser/CloudTrailLoggingDisabled is raised when CloudTrail logging is disabled. Therefore, we configured the default event bus to look for this event. You can learn more about all of the available GuardDuty findings in the user guide.
How the code works
When Security Hub publishes a finding to EventBridge, it includes full details of the incident as discovered by GuardDuty. The finding is published in JSON format. If you review the details of the sample finding, note that it has several fields helping you identify the specific events that you’re looking for. Here are some of the relevant details:
You can build an event pattern using these fields, which an EventBridge filtering rule can then use to identify events and to invoke the remediation Lambda function. Below is a snippet from the CloudFormation template we provided earlier that defines that event pattern for the EventBridge filtering rule:
Once the rule is in place, EventBridge continuously scans the event bus for this pattern. When EventBridge finds a match, it invokes the remediating Lambda function and passes the full details of the event to the function. The Lambda function then parses the JSON fields in the event so that it can act as shown in this Python code snippet:
The code also issues a notification to security operators so they can review the findings and insights in Security Hub and other services to better understand the incident and to decide whether further manual actions are warranted. Here’s the code snippet that uses SNS to send out a note to security operators:
While notifications to human operators are important, the Lambda function will not wait to take action. It immediately remediates the condition by restarting the stopped trail in CloudTrail. Here’s a code snippet that restarts the trail to reenable logging:
After the trail has been restarted, API activity is once again logged and can be audited. This can help provide relevant data for the remaining steps in the incident response process. The data is especially important for the post-incident phase, when your team analyzes lessons learned to prevent future incidents. You can also use this phase to identify additional steps to automate in your incident response.
After you’ve completed the sample security response automation, we recommend that you remove the resources created in this walkthrough example from your account in order to minimize any charges associated with the trail in CloudTrail and data stored in S3.
Important: Deleting resources in your account can negatively impact the applications running in your AWS account. Verify that applications and AWS account security do not depend on the resources you’re about to delete.
Here are the clean-up steps:
- Delete the CloudFormation stack.
- Delete the trail you created in CloudTrail.
- If you created an S3 bucket for CloudTrail logs, you can also delete that S3 bucket.
- New accounts can try GuardDuty at no cost for 30 days. You can suspend or disable GuardDuty before the free trial period to avoid charges.
- Security Hub comes with a 30-day free trial. You can avoid charges by disabling the service before the trial period is over.
You’ve learned the basic concepts and considerations behind security response automation on AWS and how to use Amazon EventBridge, Amazon GuardDuty and AWS Security Hub to automatically re-enable AWS CloudTrail when it becomes disabled unexpectedly. As a next step, you may want to start building your own response automations and dive deeper into the AWS Security Incident Response Guide, NIST Cybersecurity Framework (CSF) or the AWS Cloud Adoption Framework (CAF) Security Perspective. You can explore additional automatic remediation solutions on the AWS Security Blog. You can find the code used in this example on GitHub.
If you have feedback about this blog post, submit them in the Comments section below. If you have questions about using this solution, start a thread in the EventBridge, GuardDuty or Security Hub forums, or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.