AWS Machine Learning Specialty Study Guide: SageMaker, Pipelines, and Model Deployment

The AWS Certified Machine Learning - Specialty (MLS-C01) validates the ability to design, build, train, tune, and deploy machine learning models on AWS. It requires both understanding of ML concepts — supervised learning, deep learning, feature engineering, model evaluation — and the AWS services that implement them. Neither AWS knowledge nor ML knowledge alone is sufficient; you need both.

This guide covers all exam domains with emphasis on SageMaker architecture, model training, deployment patterns, and the supporting data engineering services.

Exam Overview

The MLS-C01 exam contains 65 questions (50 scored, 15 unscored) with a 180-minute time limit. The passing score is 750 out of 1000.

Domain Weights

Domain	Weight
Domain 1: Data Engineering	20%
Domain 2: Exploratory Data Analysis	24%
Domain 3: Modeling	36%
Domain 4: Machine Learning Implementation and Operations	20%

Domain 3 (Modeling) is by far the most heavily tested area. You must know ML algorithms, their assumptions, their hyperparameters, and when one is preferred over another.

Domain 1: Data Engineering (20%)

Data Ingestion and Storage

ML pipelines require data at scale. The exam tests which AWS service to use at each stage.

Stage	AWS Service
Batch data ingestion	AWS Glue, AWS Data Pipeline, S3 batch operations
Streaming ingestion	Amazon Kinesis Data Streams, Kinesis Data Firehose
Feature storage	Amazon SageMaker Feature Store
Data lake	Amazon S3 + AWS Glue Data Catalog
Data warehouse	Amazon Redshift

Kinesis Data Streams vs. Kinesis Data Firehose for ML:

Kinesis Data Streams enables custom consumers (Lambda, custom applications) to process each record with low latency. Use it when you need to invoke a SageMaker endpoint for real-time inference on each event.

Kinesis Data Firehose delivers data to destinations (S3, Redshift, OpenSearch) with optional transformation via Lambda. Use it when the goal is landing streaming data in S3 for batch training.

AWS Glue for Data Preparation

AWS Glue provides a serverless ETL service:

Glue Crawlers: Discover schema automatically from S3, RDS, and other sources; populate the Glue Data Catalog
Glue Jobs: PySpark or Python shell scripts for transformation; run on a managed Spark cluster
Glue DataBrew: Visual data preparation without code; profile data, detect anomalies, apply transformations

Data Catalog: Central metadata repository. Athena, Redshift Spectrum, and EMR can query data registered in the Data Catalog without moving it.

SageMaker Feature Store

Feature Store provides a centralized repository for ML features:

Online store: Low-latency retrieval for real-time inference (milliseconds)
Offline store: Historical feature values stored in S3 for training

Ensures consistency between training features (offline store) and serving features (online store), preventing training-serving skew.

Domain 2: Exploratory Data Analysis (24%)

Data Analysis Tools

Amazon Athena: Serverless SQL queries directly on S3 data. Use for exploring raw datasets before building pipelines. Pay per query (per TB scanned).

Amazon SageMaker Data Wrangler: Visual interface within SageMaker Studio for:

Importing data from S3, Athena, Redshift, Feature Store
Profiling data (distributions, missing values, correlations)
Applying 300+ built-in transformations
Generating a feature engineering pipeline exportable as code

Feature Engineering Concepts

The exam tests feature engineering heavily because it is the most impactful step in improving model quality.

Common transformations:

Transformation	When to Apply
Normalization (min-max scaling)	When features have different ranges; required for distance-based algorithms (KNN, SVM)
Standardization (z-score)	When features need zero mean and unit variance; for gradient-based algorithms
Log transform	When a feature has a right-skewed distribution
One-hot encoding	For nominal categorical variables (no ordinal relationship)
Ordinal encoding	For ordinal categorical variables (e.g., small/medium/large)
Binning	Convert continuous values to discrete categories
Imputation	Fill missing values with mean, median, or a model-based estimate

Handling imbalanced datasets:

Class imbalance (e.g., 1% fraud cases, 99% non-fraud) causes models to predict the majority class. Solutions:

Oversampling: Duplicate or synthesize minority class samples (SMOTE)
Undersampling: Remove majority class samples
Class weights: Assign higher weight to minority class during training
Evaluation metric: Use F1, AUC-ROC, or precision-recall curve rather than accuracy

Domain 3: Modeling (36%)

AWS Built-in Algorithms

SageMaker includes built-in algorithms optimized for distributed training. These are among the most tested topics.

Algorithm	Type	Use Case
XGBoost	Supervised: classification, regression	Tabular data; frequently top performer
Linear Learner	Supervised: classification, regression	Linear relationships; fast training
K-Nearest Neighbors (KNN)	Supervised: classification, regression	Simple; expensive at inference time
Factorization Machines	Supervised: classification, regression	Sparse data, recommendation systems
DeepAR	Supervised: time series forecasting	Forecast multiple related time series
Object2Vec	Unsupervised / supervised	Embedding pairs (e.g., sentence similarity)
BlazingText	NLP: word embeddings, text classification	Fast word2vec, sentence classification
Seq2Seq	NLP: sequence to sequence	Translation, text summarization
LDA (Latent Dirichlet Allocation)	Unsupervised: topic modeling	Discover topics in documents
k-means	Unsupervised: clustering	Group similar items
PCA	Unsupervised: dimensionality reduction	Reduce feature count before training
IP Insights	Unsupervised: anomaly detection	Detect unusual IP address behavior
Random Cut Forest	Unsupervised: anomaly detection	Time series anomaly detection
Object Detection	Computer vision	Identify and locate objects in images
Image Classification	Computer vision	Classify images into categories
Semantic Segmentation	Computer vision	Pixel-level image classification

Hyperparameter Tuning

SageMaker Automatic Model Tuning (AMT):

AMT searches the hyperparameter space to find the best combination:

Bayesian optimization: Uses probabilistic model of the objective function to select promising hyperparameter sets; efficient for expensive experiments
Grid search: Exhaustive search over defined parameter values; not practical for large spaces
Random search: Random sampling; simple but less efficient than Bayesian

Specify the objective metric, hyperparameter ranges, and maximum number of training jobs. AMT runs jobs in parallel (respecting concurrency limits) and focuses searches based on prior results.

Model Evaluation Metrics

Classification:

Metric	Formula	When to Use
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes only
Precision	TP/(TP+FP)	When false positives are costly
Recall (Sensitivity)	TP/(TP+FN)	When false negatives are costly
F1 Score	2(PrecisionRecall)/(Precision+Recall)	Balanced precision-recall trade-off
AUC-ROC	Area under ROC curve	Ranking quality; threshold-independent

Regression:

MSE (Mean Squared Error): Penalizes large errors heavily
MAE (Mean Absolute Error): More robust to outliers
RMSE: Square root of MSE; same units as target variable
R-squared: Proportion of variance explained; 1.0 is perfect

Overfitting and Regularization

Indicators of overfitting: Low training loss, high validation loss (large gap between them).

Remedies:

Reduce model complexity (fewer layers, fewer trees)
L1 regularization (Lasso): Shrinks some coefficients to zero; feature selection
L2 regularization (Ridge): Shrinks all coefficients; reduces magnitude
Dropout: Randomly disable neurons during training (neural networks)
Early stopping: Stop training when validation loss stops improving
More training data or data augmentation

Domain 4: ML Implementation and Operations (20%)

SageMaker Training and Deployment

Training job configuration:

estimator = sagemaker.estimator.Estimator(
    image_uri=container_uri,
    role=role,
    instance_count=2,
    instance_type='ml.p3.8xlarge',
    volume_size=50,
    max_run=3600,
    output_path=s3_output_path
)

Distributed training strategies:

Data parallelism: Split training data across instances; each instance has a copy of the model; gradients are aggregated. Use with SageMaker Distributed Data Parallel (SMDP) library
Model parallelism: Split a model too large to fit on one GPU across multiple devices. Use with SageMaker Distributed Model Parallel library

SageMaker Inference Options

Option	Latency	Use Case
Real-time endpoint	Milliseconds	Interactive applications, low-latency requirements
Serverless inference	Variable (cold start)	Infrequent, variable traffic
Batch transform	Minutes to hours	Offline batch scoring
Asynchronous inference	Seconds to minutes	Large payloads, long inference time

Multi-model endpoint: Host multiple models on a single endpoint. SageMaker loads models into memory on demand and caches them. Reduces endpoint costs when hosting many low-traffic models.

Inference pipelines: Chain preprocessing, ML model, and post-processing containers into a single endpoint invocation. Ensures the same transformations are applied at inference time as during training.

SageMaker Pipelines

SageMaker Pipelines provides a CI/CD-like workflow for ML:

Step Type	Purpose
Processing step	Feature engineering, data validation
Training step	Model training
Evaluation step	Compute model metrics
Condition step	Branch based on metric thresholds
Register step	Register model in Model Registry if metrics pass
Transform step	Batch inference

Model Registry: Central catalog of trained models with version tracking and approval workflow. Approved models can be deployed to endpoints automatically via pipeline.

MLOps: Monitoring and Drift Detection

SageMaker Model Monitor:

Data quality monitoring: Compare feature distributions in production against the training baseline
Model quality monitoring: Compare predictions against ground truth labels (requires actuals to be captured)
Bias drift monitoring: Detect changes in model fairness metrics over time
Feature attribution drift: Track changes in which features contribute most to predictions (using SHAP values)

Model Monitor generates violations when drift exceeds configured thresholds. Violations trigger CloudWatch alarms.

SageMaker Clarify:

Clarify provides:

Bias detection: Pre-training (data bias) and post-training (model bias) analysis
Explainability: SHAP values for global and per-prediction feature importance

"Feature engineering remains the highest-leverage activity in applied machine learning. A well-engineered feature set with a simple model almost always outperforms a poorly engineered feature set with a complex one. The MLS-C01 exam reflects this reality — it tests feature engineering concepts more heavily than algorithm tuning." — Aurélien Géron, author of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly, 3rd edition, 2022)

Study Timeline

Recommended: 10-14 weeks. Requires basic ML knowledge and Python familiarity.

Week	Focus
1-2	Data engineering: Glue, Kinesis, Feature Store, S3 data lake
3-4	EDA: Data Wrangler, feature engineering techniques, imbalanced data
5-6	SageMaker built-in algorithms: supervised, unsupervised, NLP, CV
7-8	Training configuration, distributed training, hyperparameter tuning
9-10	Model evaluation metrics, overfitting, regularization
11-12	Inference options, SageMaker Pipelines, Model Registry
13-14	MLOps, Model Monitor, Clarify, practice exams

References

AWS. "AWS Certified Machine Learning - Specialty Exam Guide (MLS-C01)." https://d1.awsstatic.com/training-and-certification/docs-ml/AWS-Certified-Machine-Learning-Specialty_Exam-Guide.pdf
AWS. "Amazon SageMaker Developer Guide." https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
AWS. "Use Amazon SageMaker Built-in Algorithms." https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
AWS. "Amazon SageMaker Pipelines." https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html
Faye Ellis. "AWS Certified Machine Learning Specialty MLS-C01." Udemy, 2023.
AWS. "Amazon SageMaker Model Monitor." https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html
Géron, Aurélien. "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow." O'Reilly Media, 3rd edition, 2022.
AWS. "AWS Machine Learning Blog." https://aws.amazon.com/blogs/machine-learning/

Frequently Asked Questions

What ML background is needed before studying for MLS-C01?

You should understand supervised and unsupervised learning concepts, common algorithms (regression, classification, clustering), model evaluation metrics, and overfitting/regularization before studying AWS-specific services. The exam tests both ML theory and AWS implementation.

Which SageMaker algorithm is best for tabular data classification tasks?

XGBoost is the most widely used built-in algorithm for tabular classification and regression tasks. It is frequently the top performer on structured data and supports distributed training on multiple instances.

What is the difference between data parallelism and model parallelism in SageMaker?

Data parallelism splits the training dataset across multiple instances, each holding a full copy of the model, then aggregates gradients. Model parallelism splits the model itself across devices when it is too large to fit in a single GPU's memory.

What is training-serving skew and how does SageMaker Feature Store prevent it?

Training-serving skew occurs when features used during training differ from features used at inference time. Feature Store prevents this by providing the same feature definitions to both the offline store (training) and the online store (real-time inference).

What does SageMaker Model Monitor detect?

Model Monitor detects data quality drift (distribution changes in incoming features), model quality drift (degradation in prediction accuracy), bias drift (changes in fairness metrics), and feature attribution drift (changes in which features most influence predictions).

AWS Machine Learning Specialty Study Guide: SageMaker, Pipelines, and Model Deployment

Exam Overview

Domain Weights

Domain 1: Data Engineering (20%)

Data Ingestion and Storage

AWS Glue for Data Preparation

SageMaker Feature Store

Domain 2: Exploratory Data Analysis (24%)

Data Analysis Tools

Feature Engineering Concepts

Domain 3: Modeling (36%)

AWS Built-in Algorithms

Hyperparameter Tuning

Model Evaluation Metrics

Overfitting and Regularization

Domain 4: ML Implementation and Operations (20%)

SageMaker Training and Deployment

SageMaker Inference Options

SageMaker Pipelines

MLOps: Monitoring and Drift Detection

Study Timeline

References

Tags

Frequently Asked Questions

What ML background is needed before studying for MLS-C01?

Which SageMaker algorithm is best for tabular data classification tasks?

What is the difference between data parallelism and model parallelism in SageMaker?

What is training-serving skew and how does SageMaker Feature Store prevent it?

What does SageMaker Model Monitor detect?

Share this article

Continue Reading

AWS Cloud Practitioner (CLF-C02) Complete Study Guide: All Four Domains

AWS Developer Associate (DVA-C02) Study Guide: What the Exam Really Tests

AWS Database Specialty Study Guide: RDS, DynamoDB, Aurora, and Migration

AWS DevOps Engineer Professional Study Guide: CI/CD, Monitoring, and Automation

AWS Advanced Networking Specialty Study Guide: VPC, Direct Connect, and Transit Gateway

AWS Security Specialty Study Guide: IAM, Encryption, and Threat Detection Deep Dive

Search Pass4Sure

Popular Searches

Exam Overview

Domain Weights

Domain 1: Data Engineering (20%)

Data Ingestion and Storage

AWS Glue for Data Preparation

SageMaker Feature Store

Domain 2: Exploratory Data Analysis (24%)

Data Analysis Tools

Feature Engineering Concepts

Domain 3: Modeling (36%)

AWS Built-in Algorithms

Hyperparameter Tuning

Model Evaluation Metrics

Overfitting and Regularization

Domain 4: ML Implementation and Operations (20%)

SageMaker Training and Deployment

SageMaker Inference Options

SageMaker Pipelines

MLOps: Monitoring and Drift Detection

Study Timeline

References

Tags

Frequently Asked Questions

What ML background is needed before studying for MLS-C01?

Which SageMaker algorithm is best for tabular data classification tasks?

What is the difference between data parallelism and model parallelism in SageMaker?

What is training-serving skew and how does SageMaker Feature Store prevent it?

What does SageMaker Model Monitor detect?

Share this article

Continue Reading

AWS Cloud Practitioner (CLF-C02) Complete Study Guide: All Four Domains

AWS Developer Associate (DVA-C02) Study Guide: What the Exam Really Tests

AWS Database Specialty Study Guide: RDS, DynamoDB, Aurora, and Migration

AWS DevOps Engineer Professional Study Guide: CI/CD, Monitoring, and Automation

AWS Advanced Networking Specialty Study Guide: VPC, Direct Connect, and Transit Gateway

AWS Security Specialty Study Guide: IAM, Encryption, and Threat Detection Deep Dive

We Value Your Privacy

Cookie Preferences

Essential Cookies

Analytics & Performance Cookies

Advertising & Marketing Cookies