Search Pass4Sure

AWS Machine Learning Specialty Study Guide: SageMaker, Pipelines, and Model Deployment

Comprehensive MLS-C01 study guide covering SageMaker built-in algorithms, feature engineering, hyperparameter tuning, distributed training, inference patterns, MLOps pipelines, and Model Monitor for the AWS Machine Learning Specialty exam.

AWS Machine Learning Specialty Study Guide: SageMaker, Pipelines, and Model Deployment

The AWS Certified Machine Learning - Specialty (MLS-C01) validates the ability to design, build, train, tune, and deploy machine learning models on AWS. It requires both understanding of ML concepts — supervised learning, deep learning, feature engineering, model evaluation — and the AWS services that implement them. Neither AWS knowledge nor ML knowledge alone is sufficient; you need both.

This guide covers all exam domains with emphasis on SageMaker architecture, model training, deployment patterns, and the supporting data engineering services.

Exam Overview

The MLS-C01 exam contains 65 questions (50 scored, 15 unscored) with a 180-minute time limit. The passing score is 750 out of 1000.

Domain Weights

Domain Weight
Domain 1: Data Engineering 20%
Domain 2: Exploratory Data Analysis 24%
Domain 3: Modeling 36%
Domain 4: Machine Learning Implementation and Operations 20%

Domain 3 (Modeling) is by far the most heavily tested area. You must know ML algorithms, their assumptions, their hyperparameters, and when one is preferred over another.

Domain 1: Data Engineering (20%)

Data Ingestion and Storage

ML pipelines require data at scale. The exam tests which AWS service to use at each stage.

Stage AWS Service
Batch data ingestion AWS Glue, AWS Data Pipeline, S3 batch operations
Streaming ingestion Amazon Kinesis Data Streams, Kinesis Data Firehose
Feature storage Amazon SageMaker Feature Store
Data lake Amazon S3 + AWS Glue Data Catalog
Data warehouse Amazon Redshift

Kinesis Data Streams vs. Kinesis Data Firehose for ML:

Kinesis Data Streams enables custom consumers (Lambda, custom applications) to process each record with low latency. Use it when you need to invoke a SageMaker endpoint for real-time inference on each event.

Kinesis Data Firehose delivers data to destinations (S3, Redshift, OpenSearch) with optional transformation via Lambda. Use it when the goal is landing streaming data in S3 for batch training.

AWS Glue for Data Preparation

AWS Glue provides a serverless ETL service:

  • Glue Crawlers: Discover schema automatically from S3, RDS, and other sources; populate the Glue Data Catalog
  • Glue Jobs: PySpark or Python shell scripts for transformation; run on a managed Spark cluster
  • Glue DataBrew: Visual data preparation without code; profile data, detect anomalies, apply transformations

Data Catalog: Central metadata repository. Athena, Redshift Spectrum, and EMR can query data registered in the Data Catalog without moving it.

SageMaker Feature Store

Feature Store provides a centralized repository for ML features:

  • Online store: Low-latency retrieval for real-time inference (milliseconds)
  • Offline store: Historical feature values stored in S3 for training

Ensures consistency between training features (offline store) and serving features (online store), preventing training-serving skew.

Domain 2: Exploratory Data Analysis (24%)

Data Analysis Tools

Amazon Athena: Serverless SQL queries directly on S3 data. Use for exploring raw datasets before building pipelines. Pay per query (per TB scanned).

Amazon SageMaker Data Wrangler: Visual interface within SageMaker Studio for:

  • Importing data from S3, Athena, Redshift, Feature Store
  • Profiling data (distributions, missing values, correlations)
  • Applying 300+ built-in transformations
  • Generating a feature engineering pipeline exportable as code

Feature Engineering Concepts

The exam tests feature engineering heavily because it is the most impactful step in improving model quality.

Common transformations:

Transformation When to Apply
Normalization (min-max scaling) When features have different ranges; required for distance-based algorithms (KNN, SVM)
Standardization (z-score) When features need zero mean and unit variance; for gradient-based algorithms
Log transform When a feature has a right-skewed distribution
One-hot encoding For nominal categorical variables (no ordinal relationship)
Ordinal encoding For ordinal categorical variables (e.g., small/medium/large)
Binning Convert continuous values to discrete categories
Imputation Fill missing values with mean, median, or a model-based estimate

Handling imbalanced datasets:

Class imbalance (e.g., 1% fraud cases, 99% non-fraud) causes models to predict the majority class. Solutions:

  • Oversampling: Duplicate or synthesize minority class samples (SMOTE)
  • Undersampling: Remove majority class samples
  • Class weights: Assign higher weight to minority class during training
  • Evaluation metric: Use F1, AUC-ROC, or precision-recall curve rather than accuracy

Domain 3: Modeling (36%)

AWS Built-in Algorithms

SageMaker includes built-in algorithms optimized for distributed training. These are among the most tested topics.

Algorithm Type Use Case
XGBoost Supervised: classification, regression Tabular data; frequently top performer
Linear Learner Supervised: classification, regression Linear relationships; fast training
K-Nearest Neighbors (KNN) Supervised: classification, regression Simple; expensive at inference time
Factorization Machines Supervised: classification, regression Sparse data, recommendation systems
DeepAR Supervised: time series forecasting Forecast multiple related time series
Object2Vec Unsupervised / supervised Embedding pairs (e.g., sentence similarity)
BlazingText NLP: word embeddings, text classification Fast word2vec, sentence classification
Seq2Seq NLP: sequence to sequence Translation, text summarization
LDA (Latent Dirichlet Allocation) Unsupervised: topic modeling Discover topics in documents
k-means Unsupervised: clustering Group similar items
PCA Unsupervised: dimensionality reduction Reduce feature count before training
IP Insights Unsupervised: anomaly detection Detect unusual IP address behavior
Random Cut Forest Unsupervised: anomaly detection Time series anomaly detection
Object Detection Computer vision Identify and locate objects in images
Image Classification Computer vision Classify images into categories
Semantic Segmentation Computer vision Pixel-level image classification

Hyperparameter Tuning

SageMaker Automatic Model Tuning (AMT):

AMT searches the hyperparameter space to find the best combination:

  • Bayesian optimization: Uses probabilistic model of the objective function to select promising hyperparameter sets; efficient for expensive experiments
  • Grid search: Exhaustive search over defined parameter values; not practical for large spaces
  • Random search: Random sampling; simple but less efficient than Bayesian

Specify the objective metric, hyperparameter ranges, and maximum number of training jobs. AMT runs jobs in parallel (respecting concurrency limits) and focuses searches based on prior results.

Model Evaluation Metrics

Classification:

Metric Formula When to Use
Accuracy (TP+TN)/(TP+TN+FP+FN) Balanced classes only
Precision TP/(TP+FP) When false positives are costly
Recall (Sensitivity) TP/(TP+FN) When false negatives are costly
F1 Score 2*(Precision*Recall)/(Precision+Recall) Balanced precision-recall trade-off
AUC-ROC Area under ROC curve Ranking quality; threshold-independent

Regression:

  • MSE (Mean Squared Error): Penalizes large errors heavily
  • MAE (Mean Absolute Error): More robust to outliers
  • RMSE: Square root of MSE; same units as target variable
  • R-squared: Proportion of variance explained; 1.0 is perfect

Overfitting and Regularization

Indicators of overfitting: Low training loss, high validation loss (large gap between them).

Remedies:

  • Reduce model complexity (fewer layers, fewer trees)
  • L1 regularization (Lasso): Shrinks some coefficients to zero; feature selection
  • L2 regularization (Ridge): Shrinks all coefficients; reduces magnitude
  • Dropout: Randomly disable neurons during training (neural networks)
  • Early stopping: Stop training when validation loss stops improving
  • More training data or data augmentation

Domain 4: ML Implementation and Operations (20%)

SageMaker Training and Deployment

Training job configuration:

estimator = sagemaker.estimator.Estimator(
    image_uri=container_uri,
    role=role,
    instance_count=2,
    instance_type='ml.p3.8xlarge',
    volume_size=50,
    max_run=3600,
    output_path=s3_output_path
)

Distributed training strategies:

  • Data parallelism: Split training data across instances; each instance has a copy of the model; gradients are aggregated. Use with SageMaker Distributed Data Parallel (SMDP) library
  • Model parallelism: Split a model too large to fit on one GPU across multiple devices. Use with SageMaker Distributed Model Parallel library

SageMaker Inference Options

Option Latency Use Case
Real-time endpoint Milliseconds Interactive applications, low-latency requirements
Serverless inference Variable (cold start) Infrequent, variable traffic
Batch transform Minutes to hours Offline batch scoring
Asynchronous inference Seconds to minutes Large payloads, long inference time

Multi-model endpoint: Host multiple models on a single endpoint. SageMaker loads models into memory on demand and caches them. Reduces endpoint costs when hosting many low-traffic models.

Inference pipelines: Chain preprocessing, ML model, and post-processing containers into a single endpoint invocation. Ensures the same transformations are applied at inference time as during training.

SageMaker Pipelines

SageMaker Pipelines provides a CI/CD-like workflow for ML:

Step Type Purpose
Processing step Feature engineering, data validation
Training step Model training
Evaluation step Compute model metrics
Condition step Branch based on metric thresholds
Register step Register model in Model Registry if metrics pass
Transform step Batch inference

Model Registry: Central catalog of trained models with version tracking and approval workflow. Approved models can be deployed to endpoints automatically via pipeline.

MLOps: Monitoring and Drift Detection

SageMaker Model Monitor:

  • Data quality monitoring: Compare feature distributions in production against the training baseline
  • Model quality monitoring: Compare predictions against ground truth labels (requires actuals to be captured)
  • Bias drift monitoring: Detect changes in model fairness metrics over time
  • Feature attribution drift: Track changes in which features contribute most to predictions (using SHAP values)

Model Monitor generates violations when drift exceeds configured thresholds. Violations trigger CloudWatch alarms.

SageMaker Clarify:

Clarify provides:

  • Bias detection: Pre-training (data bias) and post-training (model bias) analysis
  • Explainability: SHAP values for global and per-prediction feature importance

"Feature engineering remains the highest-leverage activity in applied machine learning. A well-engineered feature set with a simple model almost always outperforms a poorly engineered feature set with a complex one. The MLS-C01 exam reflects this reality — it tests feature engineering concepts more heavily than algorithm tuning." — Aurélien Géron, author of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (O'Reilly, 3rd edition, 2022)

Study Timeline

Recommended: 10-14 weeks. Requires basic ML knowledge and Python familiarity.

Week Focus
1-2 Data engineering: Glue, Kinesis, Feature Store, S3 data lake
3-4 EDA: Data Wrangler, feature engineering techniques, imbalanced data
5-6 SageMaker built-in algorithms: supervised, unsupervised, NLP, CV
7-8 Training configuration, distributed training, hyperparameter tuning
9-10 Model evaluation metrics, overfitting, regularization
11-12 Inference options, SageMaker Pipelines, Model Registry
13-14 MLOps, Model Monitor, Clarify, practice exams

See also: AWS Solutions Architect Associate (SAA-C03) Study Guide: Domains, Services, and Scenarios

References

  1. AWS. "AWS Certified Machine Learning - Specialty Exam Guide (MLS-C01)." https://d1.awsstatic.com/training-and-certification/docs-ml/AWS-Certified-Machine-Learning-Specialty_Exam-Guide.pdf
  2. AWS. "Amazon SageMaker Developer Guide." https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
  3. AWS. "Use Amazon SageMaker Built-in Algorithms." https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
  4. AWS. "Amazon SageMaker Pipelines." https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html
  5. Faye Ellis. "AWS Certified Machine Learning Specialty MLS-C01." Udemy, 2023.
  6. AWS. "Amazon SageMaker Model Monitor." https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html
  7. Géron, Aurélien. "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow." O'Reilly Media, 3rd edition, 2022.
  8. AWS. "AWS Machine Learning Blog." https://aws.amazon.com/blogs/machine-learning/

Frequently Asked Questions

What ML background is needed before studying for MLS-C01?

You should understand supervised and unsupervised learning concepts, common algorithms (regression, classification, clustering), model evaluation metrics, and overfitting/regularization before studying AWS-specific services. The exam tests both ML theory and AWS implementation.

Which SageMaker algorithm is best for tabular data classification tasks?

XGBoost is the most widely used built-in algorithm for tabular classification and regression tasks. It is frequently the top performer on structured data and supports distributed training on multiple instances.

What is the difference between data parallelism and model parallelism in SageMaker?

Data parallelism splits the training dataset across multiple instances, each holding a full copy of the model, then aggregates gradients. Model parallelism splits the model itself across devices when it is too large to fit in a single GPU's memory.

What is training-serving skew and how does SageMaker Feature Store prevent it?

Training-serving skew occurs when features used during training differ from features used at inference time. Feature Store prevents this by providing the same feature definitions to both the offline store (training) and the online store (real-time inference).

What does SageMaker Model Monitor detect?

Model Monitor detects data quality drift (distribution changes in incoming features), model quality drift (degradation in prediction accuracy), bias drift (changes in fairness metrics), and feature attribution drift (changes in which features most influence predictions).