Secure deployment of Amazon SageMaker resources

Amazon SageMaker, like other services in Amazon Web Services (AWS), includes security-related parameters and configurations that you can use to improve the security posture of resources as you deploy them. However, many of these security-related parameters are optional, allowing you to deploy resources without them. While this might be acceptable in the initial exploration stage, customers want resources to be deployed more securely in production.

In this post I will discuss three approaches for deploying Amazon SageMaker resources more securely and highlight some pros and cons with each approach.

Before you begin

This post assumes general familiarity with machine learning and Amazon SageMaker. In addition, it assumes knowledge of the services used to implement security controls, including:

Approaches

Amazon SageMaker contains security-related parameters for the secure deployment of resources within it. For example, when creating an Amazon SageMaker notebook instance, root access on the notebook instance can be disabled. Another example is when creating an Amazon SageMaker training job, it can be set up to access other services like Amazon Simple Storage Service (Amazon S3) through an endpoint in the customer’s Amazon Virtual Private Cloud (Amazon VPC).

However, these and most other security-related parameters and configurations are optional. As examples of less-secure configuration, Amazon SageMaker notebook instances can be created with root access enabled, and training jobs can access Amazon S3 over the public endpoints.

There are two main methods of implementing controls to improve the security of AWS services during deployment. One of them is preventive and uses controls to stop an event from occurring. The other is responsive, and uses controls that are applied in response to events.

Preventive controls protect workloads and mitigate threats and vulnerabilities. A couple of approaches to implement preventive controls are:

  • Use IAM condition keys supported by the service to ensure that resources without necessary security controls cannot be deployed.
  • Use the AWS Service Catalog to invoke AWS CloudFormation templates that deploy resources with all the necessary security controls in place.

Responsive controls drive remediation of potential deviations from security baselines. An approach to implement responsive controls is:

  • Use CloudWatch Events to catch resource creation events, then use a Lambda function to validate that resources were deployed with the necessary security controls, or terminate resources any if the necessary security controls aren’t present.

The next few sections talk about each of these approaches in respect to Amazon SageMaker.

IAM condition keys approach

IAM condition keys can be used to improve security by preventing resources from being created without security controls. When a principal makes an API request to AWS to create a resource, the request information is gathered into a request context. This request context is compared to conditions in the principal’s policy. If the conditions pass, the API request is allowed to proceed and the resource will be created. However, if the conditions fail, the API request is stopped and the resource won’t be created.

The optional Condition element (or block) in an IAM policy is where expressions are built using condition operators (such as StringEquals or NumericLessThan). These condition expressions match the condition keys and values in the policy to the keys and values in the request context. The condition key specified in a condition element can be global or service-specific.

A condition element has the following syntax:

 "Condition": { "{condition-operator}": { "{condition-key}": "{condition-value}" }
}

The following condition element contains an Amazon SageMaker service-specific condition key to ensure that when an Amazon SageMaker notebook instance is created, the request must ask that root access on the notebook instance be disabled. If a request is made, and the request doesn’t ask that root access on the notebook instance be disabled, it will be denied.

 "Condition": { "StringEquals": { "sagemaker:RootAccess": "Disabled" }
}

The AWS User Guide topic IAM JSON Policy Elements: Condition provides more information.

Amazon SageMaker IAM condition keys

Amazon SageMaker supports a few global condition context keys and adds several Amazon SageMaker service-specific condition keys.

Global condition context keys

Global condition context keys are documented in AWS Global Condition Context Keys. Global condition context keys start with an aws: prefix. The following global condition context keys are applicable to Amazon SageMaker.

  • aws:RequestTag/${TagKey} – This key is used to compare the tag key-value pair that was passed in the request with the tag pair specified in the policy.
  • aws:ResourceTag/${TagKey} – This key is used to compare the tag key-value pair that is specified in the policy with the key-value pair attached to the resource.
  • aws:SourceIp – This key is used to compare the requester’s IP address with the IP address specified in the policy.
  • aws:SourceVpc – This key is used to check whether the request comes from the Amazon VPC specified in the policy.
  • aws:SourceVpce – This key is used to compare the Amazon VPC endpoint identifier of the request with the endpoint ID specified in the policy.
  • aws:TagKeys – This key is used to compare the tag keys in the request with the keys specified in the policy.

Amazon SageMaker service-specific condition keys

Amazon SageMaker service-specific condition keys are documented in Actions, Resources, and Condition Keys for Amazon SageMaker and Amazon SageMaker Identity-Based Policy Examples. They have a sagemaker: prefix.

  • sagemaker:AcceleratorTypes – This key is used to use a specific Amazon Elastic Inference accelerator when creating or updating notebook instances and when creating endpoint configurations. Elastic Inference allows addition of inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
  • sagemaker:DirectInternetAccess – This key is used to control direct internet access from notebook instances. Allowed values are Enabled and Disabled. The default behavior is to allow direct internet access. Direct internet access should be disabled to prevent unfettered internet access after connecting notebook instances to the customer’s Amazon VPC. This can be done by using the sagemaker:VPCSubnets and sagemaker:VPCSecurityGroupIds parameters.
  • sagemaker:FileSystemAccessMode – Amazon SageMaker can be used in conjunction with Amazon Elastic File System (Amazon EFS) or Amazon FSx file systems for training jobs and hyperparameter tuning jobs. This key is used to specify the access mode of the directory associated with the input data channel. The directory can be mounted either in read-only or read-write mode.
  • sagemaker:FieSystemDirectoryPath – This key is used to specify the file system directory path associated with the resource in the training and hyperparameter tuning request.
  • sagemaker:FileSystemId – This key is used to specify the file system ID associated with the resource in the training and hyperparameter tuning request.
  • sagemaker:FileSystemType – This key is used to specify the file system type associated with the resource in the training and hyperparameter tuning request.
  • sagemaker:InstanceTypes – This key is used to specify the list of all instance types for notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, and endpoint configurations for hosting real-time inferencing. Restricting instance types can be done to only allow enhanced-security Nitro instances or to control costs by not allowing GPU instances.
  • sagemaker:InterContainerTrafficEncryption – This key is used to control inter-container traffic encryption for distributed training and hyperparameter tuning jobs. Allowed values are true and false. The default value is false. Distributed machine learning frameworks and algorithms usually transmit information that is directly related to the model such as weights, not the training dataset. This parameter should be set to true to comply with regulatory requirements with the understanding that it could increase the training time of distributed jobs.
  • sagemaker:MaxRuntimeInSeconds – This key is used to control costs by specifying the maximum length of time, in seconds, that the training, hyperparameter tuning, or compilation job can run. If a job doesn’t complete within this time, Amazon SageMaker ends the job. If a value isn’t specified, the default value is 1 day. The maximum value that can be specified is 28 days.
  • sagemaker:ModelArn – This key is used to specify the Amazon Resource Name (ARN) of the model associated for batch transform jobs and endpoint configurations for hosting real-time inferencing. When creating a batch transform job or endpoint configuration, a model name is passed in the API request. The name of the model must be associated with model ARN specified in the policy.
  • sagemaker:NetworkIsolation – This key is used to enable network isolation when creating training, hyperparameter tuning, and inference jobs. Allowed values are true and false. The default value is false. This parameter should be set to true to prevent containers from making any outbound network calls, even to other AWS services such as Amazon S3. Network isolation is required for training jobs and models run using resources from AWS Marketplace.
  • sagemaker:OutputKmsKey – This key is used to specify the AWS KMS key to encrypt output data stored in Amazon S3. Either the KMS key ID or key ARN can be specified. This key shouldn’t be confused with the key to encryption storage volumes specified in sagemaker:VolumeKmsKey.
  • sagemaker:RequestTag/${TagKey} – This key is used to compare the tag key-value pair that was passed in the request with the tag pair that is specified in the policy. This could be used to ensure that a particular tag is always used.
  • sagemaker:ResourceTag/${TagKey} – This key is used to compare the tag key-value pair that is specified in the policy with the key-value pair that is attached to the resource. This could be used to ensure that a particular tag and value pair is always used.
  • sagemaker:RootAccess – This key is used to control root access on the notebook instances. Allowed values are Enabled and Disabled. The default behavior is to allow root access. Root access is usually not a best practice and should be disabled. Disabling root access prevents notebook users from deleting system-level software, installing new software, and modifying essential components.
  • sagemaker:VolumeKmsKey – This key is used to specify an AWS KMS key to encrypt storage volumes when creating notebook instances, training jobs, hyperparameter tuning jobs, batch transform jobs, and endpoint configurations for hosting real-time inferencing. Either the KMS key ID or key ARN can be specified. This key shouldn’t be confused with the key to encrypt output data in Amazon S3 specified in sagemaker:OutputKmsKey.
  • sagemaker:VPCSecurityGroupIds – The list of all Amazon VPC security group IDs associated with the elastic network interface (ENI) that Amazon SageMaker creates in the Amazon VPC subnet.
  • sagemaker:VPCSubnets – The list of all Amazon VPC subnets where Amazon SageMaker creates ENIs to communicate with other resources like Amazon S3.

AWS Service Catalog approach

AWS Service Catalog allows organizations to create and manage catalogs of IT services that are approved for use on AWS. You can use it to create a preventive approach to improving security by invoking templates with security controls already in place. These IT services can include everything from virtual machine images, servers, software, and databases to complete multi-tier application architectures. AWS Service Catalog allows for the central management of commonly deployed IT services, helps achieve consistent governance, and supports you in meeting your compliance requirements. It does this while enabling users to quickly deploy only the IT services they need and that are approved by their organization.

AWS Service Catalog products are created by importing AWS CloudFormation templates that provision the resources in services. CloudFormation provides a common language for the description and provisioning of all the infrastructure resources in a cloud environment. CloudFormation lets you use programming languages or a simple text file to model and provision all the resources needed for applications across all regions and accounts in an automated and secure manner. This gives a single source of truth for the AWS resources.

AWS CloudFormation templates implement resources in various services as resource types. For example, there is a CloudFormation resource type called AWS::SageMaker::NotebookInstance that models an Amazon SageMaker notebook instance. When a CloudFormation stack with this resource type is created, the notebook instance is provisioned based on the template parameters.

Since AWS CloudFormation is typically used to provision infrastructure as opposed to executing workflows, CloudFormation models Amazon SageMaker notebook instances but not Amazon SageMaker training jobs. In situations like this, a custom resource can be used. Custom resource providers, typically implemented as Lambda functions, are invoked when a CloudFormation stack with a custom resource is created. The Lambda function can use the AWS SDKs—which are available in several programming languages—to create the resource. In the case of Amazon SageMaker training jobs, when the CloudFormation stack is created, it will call a Lambda function that can use the Boto3 Python SDK to create a training job.

The following Amazon SageMaker resource types are supported by AWS CloudFormation. All other Amazon SageMaker resources need to be created using the custom resource approach.

  • AWS::SageMaker::CodeRepository creates a Git repository that can be used for source control.
  • AWS::SageMaker::Endpoint creates an endpoint for inferencing.
  • AWS::SageMaker::EndpointConfig creates a configuration for endpoints for inferencing.
  • AWS::SageMaker::Model creates a model for inferencing.
  • AWS::SageMaker::NotebookInstance creates a notebook instance for development.
  • AWS::SageMaker::NotebookInstanceLifecycleConfig creates shell scripts that run when notebook instances are created and/or started.
  • AWS::SageMaker::Workteam creates a work team for labeling data.

AWS CloudFormation collects user input in the form of parameters that can be defined in the template. The value for the security related parameters should fall into one of the following three categories.

Security parameters that shouldn’t be exposed

In the AWS CloudFormation templates, don’t implement these settings as parameters. Instead, automatically set them without providing a choice to the user:

  • DirectInternetAccess – Set this to Disabled when creating notebook instances after connecting to the customer’s Amazon VPC using VpcConfig.Subnets and VpcConfig.SecurityGroupIds.
  • EnableInterContainerTrafficEncryption – Set this to true when creating distributed training and hyperparameter tuning jobs. Note that it might increase the training time.
  • EnableNetworkIsolation – Set this to true when creating training, hyperparameter tuning, and inference jobs to prevent situations like malicious code being accidentally installed and transferring data to a remote host.
  • MaxRuntimeInSeconds – Set this to a reasonable value.
  • RootAccess – Set this to Disabled when creating notebook instances as it is generally not a best practice to permit root access. Disabling root access prevents notebook users from deleting system-level software, installing new software, and modifying essential environment components.
  • VpcConfig.SecurityGroupIds – Set this to a pre-created security group that has been configured with the necessary controls.

Security parameters that should be restricted

For the following parameters, require the user to select a value from a dropdown list either by hard-coding the list or by using a supported AWS-specific parameter.

  • AcceleratorTypes – The dropdown list has to be hard coded. Elastic Inference accelerators allow the addition of inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
  • InstanceTypes – The dropdown list has to be hard coded. Restricting instance types can be used to only allow enhanced-security Nitro instances or to control costs by not allowing GPU instances.
  • VpcConfig.Subnets – The dropdown list can be built by using an AWS::EC2::Subnet::Id parameter type.

Security parameters that should be validated

Require the user to input values for the following parameters by using the AllowedPattern property for the parameter with a regular expression of “+”:

  • OutputKmsKey
  • VolumeKmsKey

CloudWatch Events approach

Amazon CloudWatch and CloudWatch Events can be used to implement responsive controls to improve security. CloudWatch is a service that provides data and actionable insights to monitor applications, respond to system-wide performance changes, optimize resource utilization, and provide a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events. CloudWatch uses the data to provide a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. CloudWatch can be used to detect anomalous behavior in environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep applications running smoothly.

CloudWatch Events—a subsystem within CloudWatch—delivers a near real-time stream of events that describe changes in AWS resources. Using simple rules, events can be matched and routed to one or more target functions or streams. The target of a CloudWatch Events rule might be, for example, a Lambda function that will be invoked with an event every time the rule matches.

The following figure shows how to create a CloudWatch Events rule that will match events pertaining to state changes of Amazon SageMaker training jobs. The steps using the AWS Management Console are:

  1. Go to Amazon CloudWatch and select Rules under Events.
  2. Select SageMaker from the Service Name dropdown.
  3. Select SageMaker Training Job State Change from the Event Type dropdown. The Event Pattern Preview is automatically populated.
  4. Select Lambda function from the Targets dropdown.
  5. Select the AWS Lambda function that you have implemented from the Function dropdown.
    Figure 1: Create a CloudWatch event rule

    Figure 1: Create a CloudWatch event rule

Amazon SageMaker provides the following training job statuses:

  • InProgress – The training is in progress.
  • Completed – The training job has completed.
  • Failed – The training job has failed. To see the reason for the failure, see the FailureReason field in the response to a DescribeTrainingJob call.
  • Stopping – The training job is stopping.
  • Stopped – The training job has stopped.

The Lambda function that is configured as the target for the CloudWatch Events rule should inspect the event and retrieve the Amazon SageMaker training job status. If the status is InProgress, the Lambda function will do the following:

  1. Call the DescribeTrainingJob API, and pass in the training job name.
  2. From the response, check to see if the training job has all of the necessary security controls.
  3. If the training job is deemed insecure, call the StopTrainingJob API to stop it.

Similar CloudWatch events rules can be set up for other Amazon SageMaker events.

Discussing the approaches

The IAM condition keys approach doesn’t involve coding. All it requires is adding condition elements to IAM policies. When this approach is used, users deploying resources are free to choose any approach including the console, CLI, and SDKs. Additionally, Amazon SageMaker also has a higher-level Python SDK implemented on top of Boto3 that makes deploying Amazon SageMaker resources easy.

However, there are a few caveats with this approach. First, not all AWS services support IAM condition keys. Fortunately, Amazon SageMaker has comprehensive support for IAM condition keys. Second, since this approach involves IAM policies, IAM service limits, some of which are documented below (a full list of IAM limits can be found at IAM and STS Limits), need to be taken into consideration.

  • An IAM user, group, or role can have a maximum of 10 managed policies.
  • The size of each managed policy cannot exceed 6,144 characters (not counting white spaces).

All of the conditions should be documented clearly. Otherwise, users deploying resources might have to use trial-and-error to successfully deploy them.

The AWS Service Catalog approach involves coding of AWS CloudFormation templates. When the necessary resource types aren’t supported by CloudFormation, custom resource Lambda functions have to be implemented by the customer. This approach is always available without any special support needed from the service. This approach also takes guesswork out of the equation when deploying resources as the CloudFormation templates can guide the user with providing proper security parameters.

Finally, the CloudWatch Events approach also involves coding by the customer. Because it’s a responsive control, the resource will start to be deployed before it will be stopped or terminated, if users create it without the necessary security controls. CloudWatch Events are available very soon after the resource provisioning starts. Amazon SageMaker resources typically take a couple of minutes after a resource is requested before it becomes available. It should be noted that users don’t get any direct feedback when resources are stopped or terminated in response to CloudWatch Events. They need to review CloudWatch Logs or notifications sent by the Lambda function code to figure out why a resource was terminated. Customers can implement this approach could be used in combination with one of the preventive approaches to enhance security.

Summary

In this post we discussed three different approaches for deploying Amazon SageMaker securely – IAM condition keys, Service Catalog, and CloudWatch Events. You can use each of these methods to improve the security of your AWS resources as you deploy them. After reading this post, you should now have a better understanding of the pros and cons of each of the approaches and how you can use them to deploy your Amazon SageMaker resources more securely in production.

Contributors

Contributors to this document include:

  • Paco Hope, Principal Security Consultant, AWS ProServe
  • Jeff Puchalski, Technical Training Specialist, AWS Security
  • Kumar Venkateswar, Principal PM, Amazon SageMaker, AI Platforms
  • Ross Warren, Senior Solutions Architect, Security

Additional resources

For additional information, see:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Rajesh Ramchander

Rajesh is a Senior Big Data Consultant in Professional Services at AWS. He helps customers migrate big data and machine learning/artificial intelligence workloads to AWS using Amazon EMR, AWS Glue, and Amazon SageMaker. Before joining AWS, Rajesh was a member of senior management of software development teams. He holds an MS in Computer Science and an MS in Electrical Engineering.