The AWS Certified Machine Learning - Specialty (MLS-C01) validates the ability to design, build, train, tune, and deploy machine learning models on AWS. It requires both understanding of ML concepts — supervised learning, deep learning, feature engineering, model evaluation — and the AWS services that implement them. Neither AWS knowledge nor ML knowledge alone is sufficient; you need both.
This guide covers all exam domains with emphasis on SageMaker architecture, model training, deployment patterns, and the supporting data engineering services.
Exam Overview
The MLS-C01 exam contains 65 questions (50 scored, 15 unscored) with a 180-minute time limit. The passing score is 750 out of 1000.
Domain Weights
| Domain | Weight |
|---|---|
| Domain 1: Data Engineering | 20% |
| Domain 2: Exploratory Data Analysis | 24% |
| Domain 3: Modeling | 36% |
| Domain 4: Machine Learning Implementation and Operations | 20% |
Domain 3 (Modeling) is by far the most heavily tested area. You must know ML algorithms, their assumptions, their hyperparameters, and when one is preferred over another.
Domain 1: Data Engineering (20%)
Data Ingestion and Storage
ML pipelines require data at scale. The exam tests which AWS service to use at each stage.
| Stage | AWS Service |
|---|---|
| Batch data ingestion | AWS Glue, AWS Data Pipeline, S3 batch operations |
| Streaming ingestion | Amazon Kinesis Data Streams, Kinesis Data Firehose |
| Feature storage | Amazon SageMaker Feature Store |
| Data lake | Amazon S3 + AWS Glue Data Catalog |
| Data warehouse | Amazon Redshift |
Kinesis Data Streams vs. Kinesis Data Firehose for ML:
Kinesis Data Streams enables custom consumers (Lambda, custom applications) to process each record with low latency. Use it when you need to invoke a SageMaker endpoint for real-time inference on each event.
Kinesis Data Firehose delivers data to destinations (S3, Redshift, OpenSearch) with optional transformation via Lambda. Use it when the goal is landing streaming data in S3 for batch training.
AWS Glue for Data Preparation
AWS Glue provides a serverless ETL service:
- Glue Crawlers: Discover schema automatically from S3, RDS, and other sources; populate the Glue Data Catalog
- Glue Jobs: PySpark or Python shell scripts for transformation; run on a managed Spark cluster
- Glue DataBrew: Visual data preparation without code; profile data, detect anomalies, apply transformations
Data Catalog: Central metadata repository. Athena, Redshift Spectrum, and EMR can query data registered in the Data Catalog without moving it.
SageMaker Feature Store
Feature Store provides a centralized repository for ML features:
- Online store: Low-latency retrieval for real-time inference (milliseconds)
- Offline store: Historical feature values stored in S3 for training
Ensures consistency between training features (offline store) and serving features (online store), preventing training-serving skew.
Domain 2: Exploratory Data Analysis (24%)
Data Analysis Tools
Amazon Athena: Serverless SQL queries directly on S3 data. Use for exploring raw datasets before building pipelines. Pay per query (per TB scanned).
Amazon SageMaker Data Wrangler: Visual interface within SageMaker Studio for:
- Importing data from S3, Athena, Redshift, Feature Store
- Profiling data (distributions, missing values, correlations)
- Applying 300+ built-in transformations
- Generating a feature engineering pipeline exportable as code
Feature Engineering Concepts
The exam tests feature engineering heavily because it is the most impactful step in improving model quality.
Common transformations:
| Transformation | When to Apply |
|---|---|
| Normalization (min-max scaling) | When features have different ranges; required for distance-based algorithms (KNN, SVM) |
| Standardization (z-score) | When features need zero mean and unit variance; for gradient-based algorithms |
| Log transform | When a feature has a right-skewed distribution |
| One-hot encoding | For nominal categorical variables (no ordinal relationship) |
| Ordinal encoding | For ordinal categorical variables (e.g., small/medium/large) |
| Binning | Convert continuous values to discrete categories |
| Imputation | Fill missing values with mean, median, or a model-based estimate |
Handling imbalanced datasets:
Class imbalance (e.g., 1% fraud cases, 99% non-fraud) causes models to predict the majority class. Solutions:
- Oversampling: Duplicate or synthesize minority class samples (SMOTE)
- Undersampling: Remove majority class samples
- Class weights: Assign higher weight to minority class during training
- Evaluation metric: Use F1, AUC-ROC, or precision-recall curve rather than accuracy
Domain 3: Modeling (36%)
AWS Built-in Algorithms
SageMaker includes built-in algorithms optimized for distributed training. These are among the most tested topics.
| Algorithm | Type | Use Case |
|---|---|---|
| XGBoost | Supervised: classification, regression | Tabular data; frequently top performer |
| Linear Learner | Supervised: classification, regression | Linear relationships; fast training |
| K-Nearest Neighbors (KNN) | Supervised: classification, regression | Simple; expensive at inference time |
| Factorization Machines | Supervised: classification, regression | Sparse data, recommendation systems |
| DeepAR | Supervised: time series forecasting | Forecast multiple related time series |
| Object2Vec | Unsupervised / supervised | Embedding pairs (e.g., sentence similarity) |
| BlazingText | NLP: word embeddings, text classification | Fast word2vec, sentence classification |
| Seq2Seq | NLP: sequence to sequence | Translation, text summarization |
| LDA (Latent Dirichlet Allocation) | Unsupervised: topic modeling | Discover topics in documents |
| k-means | Unsupervised: clustering | Group similar items |
| PCA | Unsupervised: dimensionality reduction | Reduce feature count before training |
| IP Insights | Unsupervised: anomaly detection | Detect unusual IP address behavior |
| Random Cut Forest | Unsupervised: anomaly detection | Time series anomaly detection |
| Object Detection | Computer vision | Identify and locate objects in images |
| Image Classification | Computer vision | Classify images into categories |
| Semantic Segmentation | Computer vision | Pixel-level image classification |
Hyperparameter Tuning
SageMaker Automatic Model Tuning (AMT):
AMT searches the hyperparameter space to find the best combination:
- Bayesian optimization: Uses probabilistic model of the objective function to select promising hyperparameter sets; efficient for expensive experiments
- Grid search: Exhaustive search over defined parameter values; not practical for large spaces
- Random search: Random sampling; simple but less efficient than Bayesian
Specify the objective metric, hyperparameter ranges, and maximum number of training jobs. AMT runs jobs in parallel (respecting concurrency limits) and focuses searches based on prior results.
Model Evaluation Metrics
Classification:
| Metric | Formula | When to Use |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes only |
| Precision | TP/(TP+FP) | When false positives are costly |
| Recall (Sensitivity) | TP/(TP+FN) | When false negatives are costly |
| F1 Score | 2*(Precision*Recall)/(Precision+Recall) | Balanced precision-recall trade-off |
| AUC-ROC | Area under ROC curve | Ranking quality; threshold-independent |
Regression:
- MSE (Mean Squared Error): Penalizes large errors heavily
- MAE (Mean Absolute Error): More robust to outliers
- RMSE: Square root of MSE; same units as target variable
- R-squared: Proportion of variance explained; 1.0 is perfect
Overfitting and Regularization
Indicators of overfitting: Low training loss, high validation loss (large gap between them).
Remedies:
- Reduce model complexity (fewer layers, fewer trees)
- L1 regularization (Lasso): Shrinks some coefficients to zero; feature selection
- L2 regularization (Ridge): Shrinks all coefficients; reduces magnitude
- Dropout: Randomly disable neurons during training (neural networks)
- Early stopping: Stop training when validation loss stops improving
- More training data or data augmentation
Domain 4: ML Implementation and Operations (20%)
SageMaker Training and Deployment
Training job configuration:
estimator = sagemaker.estimator.Estimator(
image_uri=container_uri,
role=role,
instance_count=2,
instance_type='ml.p3.8xlarge',
volume_size=50,
max_run=3600,
output_path=s3_output_path
)
Distributed training strategies:
- Data parallelism: Split training data across instances; each instance has a copy of the model; gradients are aggregated. Use with SageMaker Distributed Data Parallel (SMDP) library
- Model parallelism: Split a model too large to fit on one GPU across multiple devices. Use with SageMaker Distributed Model Parallel library
SageMaker Inference Options
| Option | Latency | Use Case |
|---|---|---|
| Real-time endpoint | Milliseconds | Interactive applications, low-latency requirements |
| Serverless inference | Variable (cold start) | Infrequent, variable traffic |
| Batch transform | Minutes to hours | Offline batch scoring |
| Asynchronous inference | Seconds to minutes | Large payloads, long inference time |
Multi-model endpoint: Host multiple models on a single endpoint. SageMaker loads models into memory on demand and caches them. Reduces endpoint costs when hosting many low-traffic models.
Inference pipelines: Chain preprocessing, ML model, and post-processing containers into a single endpoint invocation. Ensures the same transformations are applied at inference time as during training.
SageMaker Pipelines
SageMaker Pipelines provides a CI/CD-like workflow for ML:
| Step Type | Purpose |
|---|---|
| Processing step | Feature engineering, data validation |
| Training step | Model training |
| Evaluation step | Compute model metrics |
| Condition step | Branch based on metric thresholds |
| Register step | Register model in Model Registry if metrics pass |
| Transform step | Batch inference |
Model Registry: Central catalog of trained models with version tracking and approval workflow. Approved models can be deployed to endpoints automatically via pipeline.
MLOps: Monitoring and Drift Detection
SageMaker Model Monitor:
- Data quality monitoring: Compare feature distributions in production against the training baseline
- Model quality monitoring: Compare predictions against ground truth labels (requires actuals to be captured)
- Bias drift monitoring: Detect changes in model fairness metrics over time
- Feature attribution drift: Track changes in which features contribute most to predictions (using SHAP values)
Model Monitor generates violations when drift exceeds configured thresholds. Violations trigger CloudWatch alarms.
SageMaker Clarify:
Clarify provides:
- Bias detection: Pre-training (data bias) and post-training (model bias) analysis
- Explainability: SHAP values for global and per-prediction feature importance
"Feature engineering remains the highest-leverage activity in applied machine learning. A well-engineered feature set with a simple model almost always outperforms a poorly engineered feature set with a complex one. The MLS-C01 exam reflects this reality — it tests feature engineering concepts more heavily than algorithm tuning." — Aurélien Géron, author of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly, 3rd edition, 2022)
Study Timeline
Recommended: 10-14 weeks. Requires basic ML knowledge and Python familiarity.
| Week | Focus |
|---|---|
| 1-2 | Data engineering: Glue, Kinesis, Feature Store, S3 data lake |
| 3-4 | EDA: Data Wrangler, feature engineering techniques, imbalanced data |
| 5-6 | SageMaker built-in algorithms: supervised, unsupervised, NLP, CV |
| 7-8 | Training configuration, distributed training, hyperparameter tuning |
| 9-10 | Model evaluation metrics, overfitting, regularization |
| 11-12 | Inference options, SageMaker Pipelines, Model Registry |
| 13-14 | MLOps, Model Monitor, Clarify, practice exams |
See also: AWS Solutions Architect Associate (SAA-C03) Study Guide: Domains, Services, and Scenarios
References
- AWS. "AWS Certified Machine Learning - Specialty Exam Guide (MLS-C01)." https://d1.awsstatic.com/training-and-certification/docs-ml/AWS-Certified-Machine-Learning-Specialty_Exam-Guide.pdf
- AWS. "Amazon SageMaker Developer Guide." https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
- AWS. "Use Amazon SageMaker Built-in Algorithms." https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
- AWS. "Amazon SageMaker Pipelines." https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html
- Faye Ellis. "AWS Certified Machine Learning Specialty MLS-C01." Udemy, 2023.
- AWS. "Amazon SageMaker Model Monitor." https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html
- Géron, Aurélien. "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow." O'Reilly Media, 3rd edition, 2022.
- AWS. "AWS Machine Learning Blog." https://aws.amazon.com/blogs/machine-learning/
Frequently Asked Questions
What ML background is needed before studying for MLS-C01?
You should understand supervised and unsupervised learning concepts, common algorithms (regression, classification, clustering), model evaluation metrics, and overfitting/regularization before studying AWS-specific services. The exam tests both ML theory and AWS implementation.
Which SageMaker algorithm is best for tabular data classification tasks?
XGBoost is the most widely used built-in algorithm for tabular classification and regression tasks. It is frequently the top performer on structured data and supports distributed training on multiple instances.
What is the difference between data parallelism and model parallelism in SageMaker?
Data parallelism splits the training dataset across multiple instances, each holding a full copy of the model, then aggregates gradients. Model parallelism splits the model itself across devices when it is too large to fit in a single GPU's memory.
What is training-serving skew and how does SageMaker Feature Store prevent it?
Training-serving skew occurs when features used during training differ from features used at inference time. Feature Store prevents this by providing the same feature definitions to both the offline store (training) and the online store (real-time inference).
What does SageMaker Model Monitor detect?
Model Monitor detects data quality drift (distribution changes in incoming features), model quality drift (degradation in prediction accuracy), bias drift (changes in fairness metrics), and feature attribution drift (changes in which features most influence predictions).
