AWS SAA-C03 — Complete Study Guide

Domain 1 Overview

Design architectures that protect AWS resources, workloads, data, and network traffic. Covers IAM, multi-account governance, network security, threat protection, and data encryption controls.

⚡ 30% of scored content — Highest weighted domain

📊 Visual Study Guides — Exam Overview & Domain 1

Cheat SheetVisual

▾

SAA-C03 Exam Blueprint

Domain 1 — Secure Architectures

Domain 1: Secure Architecture Design Study Guide

Task 1.1

Design Secure Access to AWS Resources

IAM, federated identity, SCPs, multi-account strategy, shared responsibility model.

Knowledge of:

Access controls and management across multiple accounts
AWS federated access and identity services (for example, IAM, AWS IAM Identity Center)
AWS global infrastructure (for example, Availability Zones, AWS Regions)
AWS security best practices (for example, the principle of least privilege)
The AWS shared responsibility model

Skills in:

Applying AWS security best practices to IAM users and root users (for example, multi-factor authentication [MFA])
Designing a flexible authorization model that includes IAM users, groups, roles, and policies
Designing a role-based access control strategy (for example, AWS STS, role switching, cross-account access)
Designing a security strategy for multiple AWS accounts (for example, AWS Control Tower, service control policies [SCPs])
Determining the appropriate use of resource policies for AWS services
Determining when to federate a directory service with IAM roles

🔀 AWS Shared Responsibility Model

FoundationalExam Fave

▾

AWS and the customer divide security obligations at a clear boundary. The exam tests this boundary constantly.

The Split

AWS — "Security OF the Cloud"	Customer — "Security IN the Cloud"
Physical datacenters, hardware, networking, hypervisor	OS patches, app code, data encryption
Managed service durability & HA (S3, RDS failover)	IAM policies, S3 bucket policies, security groups
Global infrastructure (Regions, AZs, Edge)	Data classification and access management

Scenario

EC2 runs an unpatched Apache web server — who's responsible for the patch? The customer. AWS delivers the hardware and hypervisor; OS-level software is the customer's domain.

💡

Mnemonic: AWS secures OF the cloud (Physical/Infra). You secure IN the cloud (Data/Access). Think: Owned by AWS = OF; Input by Customer = IN.

graph TD subgraph Customer ["Customer: Security IN the Cloud"] Data["Customer Data"] IAM["IAM & Access"] OS["OS, Network & Firewall Config"] Encrypt["Client/Server Encryption"] end subgraph AWS ["AWS: Security OF the Cloud"] Compute["Compute / Storage"] Net["Networking"] Infra["Global Infra (Regions, AZs)"] end Customer --- AWS

🎯

Rule of thumb: The more managed the service (Lambda, DynamoDB), the more AWS owns. You always own your data and access controls regardless of service type.

⚠️

Common traps:

"AWS is responsible for patching RDS OS" — TRUE for RDS (managed), FALSE for EC2.
"AWS encrypts S3 by default so customer doesn't need to manage access" — FALSE; encryption ≠ access control.
"Customers are never responsible for network infrastructure" — FALSE on-prem hybrid; customer owns their side of Direct Connect.
Questions often swap "of" and "in" — read carefully.

🔑 IAM — Users, Groups, Roles & Policies

IAMHigh Frequency

▾

IAM is the control plane for all AWS access. Every exam scenario touches IAM at some level.

Entity Types

Entity	What It Is	When to Use
User	Long-term credentials for a person or app	Human workforce with permanent access
Group	Collection of users sharing policies	Assign permissions by job function
Role	Short-term STS credentials — no static keys	EC2/Lambda/cross-account/federated access
Policy	JSON Allow/Deny on actions and resources	Attached to any entity to grant/restrict permissions

Policy Priority Order (highest first)

SCPs — Org-level guardrails; constrain everything below
Permissions Boundaries — Max permissions a delegated entity can have
Identity-based Policies — Attached directly to user/group/role
Resource-based Policies — Attached to the resource (S3, KMS key, etc.)
Session Policies — Temporary scope passed at AssumeRole time

Evaluation Rule

Default = implicit Deny. Explicit Deny always wins — even over an explicit Allow. Explicit Allow grants access only when no Deny is present.

Least-Privilege Policy — S3 Read-Only

// Developer reads only from a specific bucket prefix
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-bucket",
      "arn:aws:s3:::my-bucket/dev-team/*"
    ]
  }]
}

🎯

Principle of Least Privilege: Grant only the minimum required. Prefer IAM Roles over long-term access keys. Never embed credentials in code — use instance profiles or Secrets Manager.

💡

Mnemonic: PIRATES evaluate policies: Permissions boundaries, Identity policies, Resource policies, All Together Explicit deny Supersedes.

⚠️

Common traps:

"Groups can be nested inside other groups" — FALSE; IAM groups are flat.
"An explicit Deny in a resource policy is overridden by an explicit Allow in an identity policy" — FALSE; explicit Deny always wins.
"Attaching a policy to a group grants permissions to the group itself" — FALSE; groups are not identities, permissions flow only to member users.
"Permissions Boundaries grant permissions" — FALSE; they only restrict the maximum.

👑 Root User Security & MFA

IAMExam Fave

▾

Root User Best Practices

Enable MFA immediately after account creation — use hardware MFA for maximum security
Never create access keys for root — use IAM users/roles for all programmatic access
Lock root credentials; share with nobody; store hardware MFA token securely offline
Use root only for the small set of tasks that only root can perform

Root-Only Tasks

Change account root email/password · Enable MFA Delete on S3 · Activate IAM Billing access · Restore IAM admin when locked out · Change AWS Support plan · Close the AWS account · Register as Reserved Instance Marketplace seller.

MFA Types

Type	Example	Recommended For
Virtual MFA	Google Authenticator, Authy	Standard IAM users
Hardware MFA (TOTP)	Gemalto token	Privileged / root accounts
Hardware MFA (FIDO)	YubiKey	Highest assurance — root, break-glass
SMS MFA	Text message OTP	Not recommended (SIM-swap risk)

🎯

Exam will describe a scenario and ask which MFA type to recommend. Hardware MFA (FIDO/YubiKey) = highest security. For root, always recommend hardware. For general employees, virtual MFA is acceptable.

💡

Mnemonic: Root is MACS: MFA Delete, Account closing, Change support plan, Sign up for GovCloud. Only Root can do MACS.

⚠️

Common traps:

"IAM admin can perform all root tasks" — FALSE; some tasks (like MACS) require root exclusively.
"Enabling MFA on root prevents all unauthorized root access" — partially true, but root access keys bypass MFA on CLI calls — never create root access keys.
"SCPs restrict the root user of the management account" — FALSE; SCPs do NOT apply to the management account's root user.

🔄 IAM Roles, AWS STS & Cross-Account Access

IAMSTSHigh Frequency

▾

Roles provide temporary credentials via STS — the preferred pattern for granting access to services, cross-account scenarios, and federated users.

How Role Assumption Works

Principal calls sts:AssumeRole → STS issues temporary credentials (AccessKeyId + SecretKey + SessionToken), valid 15 min–12 hrs
Trust Policy on the role defines who can assume it (the principal)
Permission Policy on the role defines what they can do
EC2 uses an Instance Profile — SDK auto-fetches and rotates credentials from IMDS

Cross-Account Pattern

Dev account (Account A) needs to read S3 in Prod account (Account B). Solution: Create an IAM Role in Account B with a trust policy allowing Account A's IAM principal. Devs call sts:AssumeRole → get scoped temp creds → access S3. No permanent keys are shared between accounts.

sequenceDiagram participant A as Account A (Dev) participant STS as AWS STS participant B as Account B (Prod S3) Note over B: Trust Policy allows Account A principal A->>STS: AssumeRole(RoleARN in Account B) STS-->>A: Temp Credentials (15 min–12 hr) A->>B: Access S3 with Temp Credentials Note over A,B: No permanent keys shared across accounts

Key STS API Calls

API	Use Case
AssumeRole	Cross-account or service-to-service access
AssumeRoleWithWebIdentity	Federated via OIDC (Cognito, Google, GitHub)
AssumeRoleWithSAML	Federated via corporate IdP (ADFS, Okta)
GetSessionToken	Add MFA enforcement to an existing user session

🎯

"EC2 needs access to S3" → Attach an IAM Role via Instance Profile. Never store access keys on the instance. The SDK auto-retrieves credentials from http://169.254.169.254/latest/meta-data/.

⚠️

Common traps:

"AssumeRole credentials never expire" — FALSE; they are temporary (max 12 hours).
"A role can only be assumed by one service" — FALSE; the trust policy can list multiple principals.
"Cross-account access requires VPC peering" — FALSE; it uses IAM role assumption via STS, which is an AWS API call with no network dependency.
"Instance Profile = IAM Role" — not exactly; an Instance Profile is the container that holds a role and attaches to EC2.

🏢 Multi-Account Strategy: Organizations, SCPs & Control Tower

OrganizationsHigh Frequency

▾

Key Concepts

Management Account: Creates the Org; cannot be restricted by SCPs
Member Accounts: Subject to SCPs from management account or OUs above them
Organizational Units (OUs): Logical groupings — Production OU, Dev OU, Sandbox OU
SCPs: Allow/Deny policies at Org/OU/Account level — they are guardrails, never grants by themselves

SCP vs. IAM Policy

Feature	SCP	IAM Policy
Scope	Account / OU / Org	User / Role / Group
Grants permissions	❌ No — only restricts	✅ Yes
Applies to root user	✅ Yes (member accounts)	❌ No
Can override	Trumps identity policies	Overridden by SCP

SCP Scenario

SCP on "Dev OU" denies ec2:TerminateInstances. An IAM admin in a member account tries to terminate EC2. Result: DENIED. SCP is an absolute ceiling — even AdministratorAccess cannot exceed what the SCP permits.

graph TD Root["Root"] --> Mgmt["Management Account\n(not restricted by SCPs)"] Root --> ProdOU["Production OU\nSCP: Deny DeleteBucket"] Root --> DevOU["Dev OU\nSCP: Deny EC2 Terminate"] Root --> Sandbox["Sandbox OU"] ProdOU --> ProdA["Account: Prod-US"] ProdOU --> ProdB["Account: Prod-EU"] DevOU --> DevA["Account: Dev-Team"] Sandbox --> SandA["Account: Sandbox"]

AWS Control Tower

Automates multi-account setup with a Landing Zone (secure baseline)
Guardrails: Preventive (SCPs) + Detective (Config rules) applied across all accounts
Account Factory: Vends new accounts with standard config — ideal for "spin up 50 accounts" scenarios
Integrates with IAM Identity Center for SSO across all accounts

🎯

Prevent account from leaving Org → SCP: Deny organizations:LeaveOrganization. 50 accounts with security baseline → Control Tower Account Factory. Centralized billing → AWS Organizations Consolidated Billing.

💡

Mnemonic: SCPs are a Ceiling, not a Grant. They don't Grant permissions, they just set the Ceiling for what's possible.

⚠️

Common traps:

"An SCP with Allow * grants full access" — FALSE; SCPs alone do not grant permissions; IAM policies must also allow the action.
"SCPs apply to the management account" — FALSE; SCPs never restrict the management account.
"Attaching an SCP to an OU immediately affects all child accounts" — TRUE and often tested as a gotcha when students expect a manual rollout.
"Control Tower replaces Organizations" — FALSE; Control Tower runs on top of Organizations.

🔗 Federated Identity — IAM Identity Center, SAML, OIDC

FederationExam Fave

▾

Federation lets an external Identity Provider authenticate users and map them to IAM roles — no individual IAM users needed for every employee.

Options Compared

Option	Use Case	Protocol
IAM Identity Center (SSO)	Workforce SSO across many AWS accounts + SaaS apps	SAML 2.0 / OIDC
SAML 2.0 Federation	Corporate IdP (ADFS) → AWS Console or CLI	SAML
OIDC / Web Identity	Mobile/web app users via Cognito, Google, GitHub Actions	OIDC / OAuth 2.0
AWS Directory Service	Extend on-prem AD to AWS; Managed Microsoft AD	Kerberos / LDAP

Scenario

Company has 1,000 AD employees. They need AWS console access without separate IAM users. Solution: Configure IAM Identity Center with AD as identity source → map AD groups to Permission Sets → employees log in with AD credentials and get access to assigned accounts.

🎯

"Employees / SSO / multiple accounts" → IAM Identity Center. "Mobile app / social login / customers" → Amazon Cognito. "GitHub Actions accessing AWS" → OIDC with IAM role (no stored access keys).

⚠️

Common traps:

"IAM Identity Center and Cognito are interchangeable" — FALSE; Identity Center = workforce/employees, Cognito = customer/consumer apps.
"SAML federation creates IAM users for each federated user" — FALSE; federated users assume IAM roles, no IAM users are created.
"You need AWS Directory Service to use IAM Identity Center" — FALSE; you can use an external IdP like Okta directly.
"OIDC tokens from Cognito can directly call AWS APIs" — FALSE; they must be exchanged via an Identity Pool for temporary STS credentials first.

Knowledge Check

Task 1.2

Design Secure Workloads and Applications

VPC security, endpoint security, WAF, Shield, Cognito, GuardDuty, Secrets Manager, hybrid connectivity.

Knowledge of:

Application configuration and credentials security
AWS service endpoints
Control ports, protocols, and network traffic on AWS
Secure application access
Security services with appropriate use cases (for example, AWS Cognito, AWS GuardDuty, AWS Macie)
Threat vectors external to AWS (for example, DDoS, SQL injection)
Data access and governance
Data recovery
Data retention and classification
Encryption and appropriate key management

Skills in:

Designing VPC architectures with security components (for example, security groups, route tables, network ACLs, NAT gateways)
Determining network segmentation strategies (for example, using public subnets and private subnets)
Integrating AWS services to secure applications (for example, AWS Shield, AWS WAF, IAM Identity Center, AWS Secrets Manager)
Securing external network connections to and from the AWS Cloud (for example, VPN, AWS Direct Connect)
Aligning AWS technologies to meet compliance requirements
Encrypting data at rest (for example, AWS KMS)
Encrypting data in transit (for example, AWS Certificate Manager [ACM] using TLS)
Implementing access policies for encryption keys
Implementing data backups and replications
Implementing policies for data access, lifecycle, and protection
Rotating encryption keys and renewing certificates

🌐 VPC Security — Subnets, Security Groups, NACLs, NAT

NetworkingHigh Frequency

▾

Public vs. Private Subnets

Attribute	Public Subnet	Private Subnet
Route to internet	Via Internet Gateway (IGW)	Via NAT Gateway (outbound only)
Resources here	ALBs, bastion hosts, NAT Gateways	App servers, databases, internal services
Direct inbound from internet	✅ Yes (if SG permits)	❌ No

Security Groups vs. Network ACLs

Feature	Security Group	Network ACL
Level	Instance / ENI	Subnet
State	Stateful — return traffic auto-allowed	Stateless — both directions must be allowed
Rule types	Allow only	Allow AND Deny
Rule evaluation	All rules evaluated	Rules processed in order (lowest # wins)
Block specific IP	❌ Cannot deny	✅ Use explicit Deny rule

3-Tier Architecture

ALB (public subnet, SG allows 443 from 0.0.0.0/0) → App servers (private subnet, SG allows 8080 from ALB SG only) → RDS (private subnet, SG allows 5432 from App SG only). NAT GW in public subnet lets private instances pull updates without being internet-reachable.

🎯

Block an IP → NACL Deny rule (SGs can't deny). Stateless NACL reminder: must open ephemeral ports 1024–65535 on outbound rules for return traffic from internet-facing resources.

💡

Mnemonic: SG is Stateful at the Group (Instance) level. NACL is Not Stateful, Applies to Complete Location (Subnet level).

⚠️

Common traps:

"Security groups are stateless" — FALSE; SGs are stateful (return traffic auto-allowed). NACLs are stateless.
"A NACL rule number 100 Allow and rule 200 Deny for the same CIDR — the Deny wins" — FALSE; NACLs process rules in ascending order — rule 100 Allow is evaluated first and traffic is allowed immediately.
"NACLs apply to specific EC2 instances" — FALSE; NACLs apply at the subnet level, affecting all resources in that subnet.
"You can attach multiple NACLs to a subnet" — FALSE; one NACL per subnet only.

🔐 Application Credentials — Secrets Manager vs. Parameter Store

CredentialsExam Fave

▾

Feature	Secrets Manager	SSM Parameter Store
Cost	$0.40 / secret / month	Free (Standard); $0.05 / adv. param / month
Auto rotation	✅ Built-in (RDS, Redshift, DocumentDB)	❌ Requires custom Lambda
Cross-account	✅ Resource policy	Limited
Encryption	Always KMS-encrypted	SecureString = KMS; String = plaintext
Best for	DB passwords, API keys needing rotation	Config values, feature flags, non-sensitive params

Pattern

Lambda needs RDS password → Store in Secrets Manager with rotation enabled → Lambda execution role gets secretsmanager:GetSecretValue → password never appears in code or env variables, and rotates automatically without application downtime.

🎯

Rotation = Secrets Manager. If the question mentions rotating credentials, automatic rotation, or "without application downtime" — Secrets Manager is the answer every time.

💡

Mnemonic: SM = Secrets Manager rotates automatically; PS = Parameter Store is static/cheap.

⚠️

Common traps:

"Parameter Store SecureString values are unencrypted" — FALSE; SecureString uses KMS encryption.
"Secrets Manager rotates secrets in place, so apps need to handle the change" — FALSE; rotation is designed to be seamless; Secrets Manager updates the secret value and the application retrieves the new value on next fetch.
"Parameter Store can automatically rotate RDS passwords" — FALSE; Parameter Store has no built-in rotation for RDS.
"SSM Parameter Store is free" — Partially True; Standard tier is free, but Advanced parameters cost money.

🛡️ DDoS Protection — AWS Shield & WAF

DDoSHigh Frequency

▾

AWS Shield

	Shield Standard	Shield Advanced
Cost	Free (automatic)	$3,000/month + data transfer
Protection layers	L3/L4 (SYN floods, UDP reflection)	L3/L4/L7 + financial protection
DRT access	❌	✅ AWS DDoS Response Team
Scope	All AWS customers	EC2, ELB, CloudFront, Route 53, Global Accelerator

AWS WAF — Layer 7

Attaches to: CloudFront, ALB, API Gateway, AppSync
Rules: block SQLi, XSS, bad bots, geo-restriction, IP reputation lists
Managed Rule Groups — pre-built, no authoring required (AWS or marketplace)
Rate-based rules — block IPs sending too many requests per interval

Architecture

Route 53 → CloudFront (WAF attached, blocks SQLi/XSS at edge) → ALB → EC2 in private subnet. Shield Standard protects CloudFront from volumetric DDoS. Shield Advanced adds financial protection and DRT support.

🎯

Layer mapping: Shield = L3/L4 (volumetric/network). WAF = L7 (HTTP). SQL injection, XSS, HTTP flood → WAF. SYN flood, UDP amplification, volumetric → Shield. Both together = full-stack DDoS protection.

💡

Mnemonic: WAF covers the Web (Layer 7). Shield covers the Network/Transport layers (Layer 3/4) against volumetric attacks.

⚠️

Common traps:

"AWS WAF can be attached directly to an EC2 instance" — FALSE; WAF attaches to CloudFront, ALB, API Gateway, or AppSync only.
"Shield Standard protects against L7 application-layer attacks" — FALSE; Standard only covers L3/L4.
"WAF blocks DDoS automatically without rules" — FALSE; WAF requires explicit rate-based or IP-block rules to act on DDoS.
"Shield Advanced covers all AWS services automatically" — FALSE; it must be explicitly enabled on specific resources (ELB, CloudFront, Route 53, EC2 EIP).

👤 Amazon Cognito — User Pools & Identity Pools

AuthMedium Frequency

▾

	User Pools	Identity Pools
Purpose	Authentication — sign-up/sign-in	Authorization — AWS credentials
Output	JWT tokens (ID, Access, Refresh)	Temp AWS creds via STS
Integrates with	ALB, API GW, social IdPs (Google, Facebook)	IAM roles, S3, DynamoDB

End-to-End Flow

Mobile app → authenticates with User Pool → receives JWT → exchanges JWT at Identity Pool → Identity Pool calls STS → app receives scoped AWS temp creds → uploads directly to user's S3 prefix. User Pool = who you are; Identity Pool = what you can access in AWS.

🎯

"Mobile / web app / social login / customers" → Cognito. "Employees / workforce / SSO" → IAM Identity Center. The distinction is customer-facing vs. workforce-facing.

💡

Mnemonic: User Pools = User Authentication (Who). Identity Pools = Identity Authorization (What they can do).

⚠️

Common traps:

"Cognito User Pool tokens can directly access AWS services like S3" — FALSE; User Pool JWTs authenticate the user but don't grant AWS permissions. You need an Identity Pool to exchange the JWT for STS credentials.
"Identity Pools require a User Pool" — FALSE; Identity Pools can also accept tokens from social IdPs, SAML, or even unauthenticated (guest) identities.
"Cognito is the right choice for employee workforce SSO" — FALSE; use IAM Identity Center for workforce.

🔍 GuardDuty & Macie — Threat Detection & Data Discovery

DetectionMedium Frequency

▾

Amazon GuardDuty

Intelligent threat detection — no agents, no infrastructure to manage
Data sources: VPC Flow Logs, CloudTrail API events, DNS logs, EKS audit logs, S3 data events
Detects: crypto mining, credential theft, port scans, unusual API calls, malware
Findings routed to EventBridge → Lambda auto-remediation or SNS alerts
Multi-account: delegate GuardDuty admin to a security account via Organizations

Amazon Macie

Discovers and protects sensitive data (PII, financial data, credentials) in S3
Uses ML + pattern matching — flags publicly accessible buckets containing sensitive data
Supports custom data identifiers (regex patterns) for proprietary data types

🎯

GuardDuty = threat/attack detection (compromised instances, unusual API activity). Macie = sensitive data discovery in S3 (PII exposure). If question mentions PII or S3 data exposure → Macie. Compromised EC2, coin mining → GuardDuty.

💡

Mnemonic: GuardDuty is a Guard (looks for bad behavior everywhere: VPC, DNS, CloudTrail). Macie is a Maid (cleans up/finds sensitive stuff in S3 buckets).

⚠️

Common traps:

"GuardDuty requires installing agents on EC2" — FALSE; it analyzes VPC Flow Logs, CloudTrail, and DNS logs without any agents.
"GuardDuty can block threats automatically" — FALSE by itself; it generates findings only. You must wire EventBridge → Lambda to block (e.g., update Security Group).
"Macie scans all AWS services for PII" — FALSE; Macie only analyzes S3 objects.
"Disabling GuardDuty deletes all findings" — TRUE and a common gotcha; findings are not retained after service is disabled.

🔌 Hybrid Connectivity — Site-to-Site VPN & Direct Connect

HybridHigh Frequency

▾

Feature	Site-to-Site VPN	AWS Direct Connect
Medium	IPsec over public internet	Dedicated private fiber
Setup time	Minutes–hours	Weeks–months
Bandwidth	Up to ~1.25 Gbps	1, 10, or 100 Gbps
Latency	Variable (internet-dependent)	Consistent, low latency
Encrypted	✅ IPsec	❌ Not by default — add VPN on top
Cost	Low	Higher (port-hour + data transfer)

Redundant Hybrid Pattern

Primary: Direct Connect (consistent, low latency). Backup: Site-to-Site VPN over internet. Add VPN on top of DX for encryption when compliance requires it. This gives performance + resilience.

graph LR DC["On-Premises\nDatacenter"] DC -->|"Primary: Direct Connect\nPrivate fiber, 1–100 Gbps\nNot encrypted by default"| VGW["Virtual Private\nGateway"] DC -->|"Backup: Site-to-Site VPN\nIPsec over internet\nEncrypted, variable latency"| VGW VGW --> VPC["AWS VPC\n(Private Subnets)"]

🎯

Consistent bandwidth + compliance + data must not traverse internet → Direct Connect. Quick setup + encrypted + lower cost → VPN. DX not encrypted by default — layer VPN over DX when encryption is required.

💡

Mnemonic: DX = Dedicated eXpress (Fast/Private but Unencrypted). VPN = Virtual Private Network (Encrypted but Public/Variable latency).

⚠️

Common traps:

"Direct Connect provides encrypted connectivity" — FALSE by default; DX is a private connection but not encrypted. Add IPsec VPN on top for encryption.
"Site-to-Site VPN is faster and more reliable than Direct Connect" — FALSE; VPN travels the public internet with variable latency.
"Direct Connect instantly fails over to VPN" — FALSE; failover requires Route 53 health checks or BGP failover configuration.
"Direct Connect provisioning takes minutes" — FALSE; it takes weeks to months to get a physical fiber connection provisioned.

Knowledge Check

Task 1.3

Determine Appropriate Data Security Controls

KMS, ACM, S3 encryption, data lifecycle, backup, compliance controls.

🗝️ AWS KMS — Key Management Service

EncryptionHigh Frequency

▾

Key Types

Type	Managed By	Rotation	Cost	Use Case
AWS Managed Keys	AWS	Auto (annual)	Free	Default for most services
Customer Managed Keys (CMK)	Customer	Optional / on-demand	$1/month/key	Fine-grained control, audit, cross-account
SSE-C (S3 only)	Customer (sent in API)	Customer manages	No KMS cost	Keys managed entirely outside AWS

Envelope Encryption

KMS generates a Data Encryption Key (DEK). Your data is encrypted with the DEK (AES-256, fast). The DEK is then encrypted by the CMK and stored alongside the ciphertext. To decrypt: KMS decrypts the DEK → DEK decrypts data. The CMK never leaves KMS HSMs.

KMS Key Policies

Every CMK must have a key policy — unlike IAM, KMS requires explicit policy to grant root account access
Both key policy + IAM policy must allow access (intersection of both)
Cross-account: add external account principal to key policy + IAM in that account grants kms:Decrypt

sequenceDiagram participant App as Application participant KMS as AWS KMS Note over App,KMS: Envelope Encryption Process App->>KMS: GenerateDataKey(CMK_ID) KMS-->>App: Plaintext DEK + Encrypted DEK Note over App: App encrypts payload
using Plaintext DEK Note over App: App drops Plaintext DEK from memory Note over App: App stores Encrypted Payload
alongside Encrypted DEK

🎯

Audit key usage → CMK (CloudTrail logs every API call). On-demand rotation → CMK only (AWS Managed keys rotate on AWS schedule). BYOK → Import key material into CMK. CloudHSM → single-tenant HSM; you control the hardware security module.

💡

Mnemonic: DEK = Data Encryption Key (Encrypts the Data directly). CMK = Customer Master Key (Encrypts the DEK). This is the Envelope Encryption concept.

⚠️

Common traps:

"Rotating a CMK re-encrypts all existing ciphertext" — FALSE; only new data is encrypted with the new key version. Old ciphertext is decryptable because KMS retains all previous key versions.
"You can use the same CMK across all regions" — FALSE; KMS keys are region-specific. Use multi-region keys (a newer feature) when cross-region decryption is needed.
"Deleting a CMK is immediate" — FALSE; KMS enforces a 7–30 day waiting period before deletion.
"CloudHSM is managed by AWS like KMS" — FALSE; with CloudHSM you manage the HSM cluster and are solely responsible for key backup.

🪣 S3 Data Security — Encryption, Policies, Object Lock

S3High Frequency

▾

Server-Side Encryption Options

Type	Key Managed By	Notes
SSE-S3	AWS (S3 service key)	Default; AES-256; no cost or config
SSE-KMS	AWS KMS CMK	CloudTrail audit + key rotation + cross-account
SSE-C	Customer (in API header)	HTTPS required; AWS does not store key
CSE	Customer (client-side)	Encrypted before upload; AWS never sees plaintext

Access Controls

Block Public Access: Account-level override — prevents any bucket/object ACL or policy from granting public access
Bucket Policies: Resource-based; enforce conditions like aws:SecureTransport (HTTPS-only)
MFA Delete: Requires MFA to delete object versions — enabled only by root user; prevents malicious deletion
Object Lock (WORM): Prevents deletion for a set retention period — Governance mode (admins can override) vs. Compliance mode (nobody can delete, even AWS)
VPC Gateway Endpoint: Private S3 access from VPC without NAT Gateway or internet

Force HTTPS

Bucket policy: Effect: Deny, Action: s3:*, Principal: *, Condition: aws:SecureTransport = false. Denies all non-HTTPS requests to the bucket at the resource level — no IAM Allow can override this Deny.

🎯

WORM / immutable data / SEC 17a-4 → S3 Object Lock in Compliance mode. Prevent version deletion → MFA Delete (root only). Private S3 access from Lambda in VPC → VPC Gateway Endpoint (free; no NAT needed).

💡

Mnemonic: SSE-S3 = Simple/Free (AWS managed). SSE-KMS = Key Audit/Control (CloudTrail). SSE-C = Customer provided key (Sent in HTTPS header).

⚠️

Common traps:

"S3 Block Public Access prevents all access to a bucket" — FALSE; it blocks public ACL and policy grants, but authenticated IAM users can still access objects.
"Object Lock in Governance mode prevents all deletion" — FALSE; Governance mode allows users with the s3:BypassGovernanceRetention permission to override. Compliance mode allows NO overrides.
"Versioning and Object Lock are the same thing" — FALSE; versioning keeps historical versions but doesn't prevent deletion of versions. Object Lock adds a WORM protection layer.
"MFA Delete can be enabled by any IAM admin" — FALSE; only the root user can enable MFA Delete.

📜 ACM — Encryption in Transit with TLS

TLSCertificates

▾

Free public TLS certificates for AWS services (ALB, CloudFront, API Gateway)
Auto-renewal — eliminates certificate expiry incidents
Private key stays in ACM — cannot be exported (use ACM Private CA for on-prem)
Critical: CloudFront certificates must be provisioned in us-east-1 regardless of origin region

Validation Methods

Method	How	Best For
DNS Validation	Add CNAME to Route 53 (ACM can automate)	Automated renewal; preferred
Email Validation	Click link emailed to WHOIS contacts	When DNS is not manageable

🎯

TLS terminates at ALB (ACM cert on the HTTPS listener). Backend EC2s communicate on HTTP within the VPC (acceptable) or HTTPS with self-signed cert. CloudFront + custom domain → provision ACM cert in us-east-1 first — this is a common gotcha.

💡

Mnemonic: ACM = Auto Certificate Management (Free, auto-renews with DNS, stays in AWS).

⚠️

Common traps:

"ACM certificates can be downloaded and installed on EC2" — FALSE; public ACM certs cannot be exported. Use ACM Private CA if you need exportable certs for EC2/on-prem.
"A certificate provisioned in us-west-2 works with CloudFront" — FALSE; CloudFront requires ACM certificates to be in us-east-1 specifically, regardless of where your origin is.
"ACM automatically renews all certificates" — FALSE; ACM only auto-renews if DNS validation is in place. Email-validated certs require manual re-validation.
"ACM certificates work with EC2 directly" — FALSE; ACM integrates with ELB, CloudFront, API Gateway — not directly on EC2.

💾 Data Backups, Replication & AWS Backup

BackupsMedium Frequency

▾

AWS Backup

Centralized policy-driven backup for: EC2, EBS, RDS, Aurora, DynamoDB, EFS, S3, FSx, Storage Gateway
Backup Plans: schedule, retention, lifecycle to cold storage tier
Cross-region and cross-account copies for DR
Backup Vault Lock: WORM on backup vaults — prevents deletion even by admins; Compliance mode = immutable

Service-Specific Patterns

Service	Backup Mechanism	Recovery
EBS	Incremental snapshots (stored in S3)	Restore to new volume, any point
RDS	Automated backups (1–35 days) + manual snapshots	Point-in-time within retention window
DynamoDB	On-demand backups + PITR (35 days)	Restore to new table
S3	Versioning + Cross-Region Replication (CRR)	Any prior version in same or other region

🎯

7-year immutable backup (compliance) → AWS Backup with Vault Lock in Compliance mode. Cross-region DR for S3 → CRR. Point-in-time recovery for DynamoDB → enable PITR (35-day window, continuous).

💡

Mnemonic: PITR = Point In Time Recovery (Creates a NEW table/DB, never overwrites the existing one).

⚠️

Common traps:

"RDS Multi-AZ standby can serve read traffic" — FALSE; Multi-AZ standby is passive — it only activates on failover. Use read replicas to serve reads.
"S3 Cross-Region Replication replicates existing objects automatically" — FALSE; CRR only replicates objects uploaded after CRR is enabled. Use S3 Batch Replication for existing objects.
"EBS snapshots are region-specific" — TRUE and often a trap; you must manually copy snapshots to other regions for DR.
"DynamoDB PITR lets you restore to any second in the last 35 days" — TRUE but the restored table is a new table — it does not overwrite the existing table.

📊 Data Classification, Lifecycle & Compliance Controls

GovernanceMedium Frequency

▾

S3 Lifecycle Transitions

Standard (0–30d) → Standard-IA (30–90d) → Glacier Instant Retrieval (90–180d) → Glacier Deep Archive (180d+). Expire/delete objects automatically after a set age.

Compliance Toolchain

Service	Purpose	Key Output
AWS Config	Continuous compliance monitoring; tracks config changes	Config rules, conformance packs
CloudTrail	API audit trail — who did what, when, from where	Log files to S3; EventBridge integration
Audit Manager	Automated evidence collection for audits	SOC2, PCI, HIPAA frameworks
Security Hub	Aggregates findings from GuardDuty, Inspector, Macie	Unified security posture score

🎯

CloudTrail = "who made the API call?" (event history). Config = "is this resource compliant right now?" (current state). Config auto-remediates with SSM Automation. Both feed Security Hub for unified dashboard.

💡

Mnemonic: Config evaluates the Current State. CloudTrail tracks the Trail of API calls (Who/What/When).

⚠️

Common traps:

"CloudTrail is enabled by default in all regions" — FALSE; by default only a limited management events trail may exist. You must create an organization trail or enable per-region trails explicitly.
"AWS Config prevents non-compliant resource creation" — FALSE; Config is detective, not preventive. Use SCPs or IAM policies to prevent; Config detects and reports after the fact.
"S3 lifecycle rules can transition objects from Standard-IA directly to Standard" — FALSE; lifecycle rules only move objects to colder tiers, not back to warmer ones. Minimum 30-day stay applies for Standard-IA before transitioning to Glacier.

Knowledge Check

Domain 2 Overview

Design architectures that survive failures, scale on demand, and decouple components. Covers microservices, messaging, serverless, containers, HA patterns, disaster recovery, and fault tolerance.

⚡ 26% of scored content

📊 Visual Study Guides — Domain 2

Cheat SheetVisual

▾

Domain 2 — Resilient Architectures

Domain 2: Design Resilient Cloud Architectures Overview

The Two Pillars of Resilient AWS Architecture

Domain 2: Resilient Cloud Architecture Study Guide

Task 2.1

Design Scalable and Loosely Coupled Architectures

Microservices, messaging, serverless, containers, caching, API Gateway, event-driven design.

⚖️ Decoupling with SQS, SNS & EventBridge

MessagingHigh Frequency

▾

Loosely coupled architectures use asynchronous messaging so components can scale and fail independently.

Service Comparison

Service	Model	Use Case	Retention
Amazon SQS	Queue (pull)	Work queues, job decoupling, rate limiting	Up to 14 days
Amazon SNS	Pub/Sub (push)	Fan-out to multiple subscribers simultaneously	No persistence
Amazon EventBridge	Event bus (push)	Event-driven routing, SaaS integration, scheduled rules	Archive optional
Amazon MQ	Queue (AMQP/MQTT)	Migrating existing message brokers (ActiveMQ, RabbitMQ)	Configurable

SQS — Key Concepts

Standard Queue: At-least-once delivery, best-effort ordering, nearly unlimited throughput
FIFO Queue: Exactly-once processing, strict ordering, up to 3,000 msg/s with batching
Visibility Timeout: Hides a message while a consumer processes it (default 30s); prevents duplicate processing
Dead Letter Queue (DLQ): Captures messages that fail processing after N attempts
Long Polling: Consumer waits up to 20s for messages — reduces empty API calls and cost

Fan-Out Pattern (SNS + SQS)

Pattern

Order service publishes to SNS topic → SNS fans out to: SQS queue for fulfillment service + SQS queue for billing service + SQS queue for notification service. Each service scales independently and processes at its own rate. No service is blocked by another.

🎯

Ordered + exactly-once → SQS FIFO. Fan-out to multiple consumers → SNS → SQS. Route events based on content/pattern → EventBridge. Migrating ActiveMQ → Amazon MQ (not SQS — preserves broker protocols).

⚠️

Common traps:

"SQS FIFO guarantees ordering across all message groups" — FALSE; ordering is guaranteed only within a message group ID.
"SNS delivers to SQS in order" — FALSE; SNS is a push/fanout service with no ordering guarantee.
"SQS Standard ensures exactly-once delivery" — FALSE; Standard is at-least-once. Only FIFO is exactly-once.
"Increasing SQS visibility timeout prevents all duplicate processing" — FALSE; if a consumer crashes before deleting the message, it reappears after the timeout and will be processed again.
"A DLQ automatically retries messages" — FALSE; DLQ just stores failed messages. You must manually reprocess or build re-drive logic.

⚡ Serverless — Lambda, Fargate & Step Functions

ServerlessHigh Frequency

▾

AWS Lambda

Event-driven, stateless functions — runs up to 15 minutes per invocation
Memory: 128 MB–10 GB (CPU allocated proportionally)
Triggers: API GW, ALB, SQS, SNS, S3, DynamoDB Streams, EventBridge, Kinesis
Concurrency: Account default 1,000; request increases for high-traffic workloads
Reserved Concurrency: Guarantees capacity; also throttles at that limit
Provisioned Concurrency: Eliminates cold starts — pre-warms execution environments

AWS Fargate vs. EC2 Launch Type (ECS/EKS)

	Fargate	EC2 Launch Type
Server management	Fully serverless	You manage EC2 instances
Cost model	Per vCPU + memory used	Per EC2 instance (even when idle)
Scaling	Per-task scaling	Cluster + service scaling
Best for	Variable workloads, no ops overhead	GPU workloads, custom AMIs, cost at scale

AWS Step Functions

Orchestrates multi-step workflows as state machines (JSON ASL definition)
Standard Workflows: Long-running (up to 1 year), exactly-once, audit history
Express Workflows: High-volume, short-duration (up to 5 min), at-least-once
Handles retries, error catching, parallel branches, and human approval steps

🎯

Cold start latency → Provisioned Concurrency. Orchestrate multi-Lambda workflow with retries → Step Functions. Containers without managing EC2 → Fargate. Lambda timeout limit: 15 minutes — long-running tasks need EC2, Batch, or ECS.

⚠️

Common traps:

"Lambda scales infinitely without limits" — FALSE; there is an account-level concurrency limit (default 1,000 per region).
"Provisioned Concurrency eliminates all cold starts" — TRUE FOR PROVISIONED INSTANCES, BUT IF TRAFFIC EXCEEDS PROVISIONED COUNT, NEW COLD INSTANCES SPIN UP. "LAMBDA CAN RUN INDEFINITELY" — FALSE; max 15 minutes per invocation.
"Fargate is always cheaper than EC2" — FALSE; for consistently high utilization, EC2 with Reserved Instances is cheaper. Fargate shines for variable/spiky workloads.
"Step Functions Express Workflows support exactly-once execution" — FALSE; Express is at-least-once. Only Standard Workflows are exactly-once.

🌀 Containers — ECS, EKS & When to Use Them

ContainersMedium Frequency

▾

	Amazon ECS	Amazon EKS
Orchestration	AWS-proprietary	Kubernetes (open standard)
Learning curve	Lower — AWS-native	Higher — requires K8s knowledge
Best for	AWS-native workloads, simpler ops	Kubernetes migrations, multi-cloud portability
Launch types	Fargate + EC2	Fargate + EC2 + Managed Node Groups

Container Migration Drivers

Portability: same container image runs locally, on ECS, EKS, or on-prem
Density: pack more workloads per EC2 instance than VMs
Faster deploys: images are immutable — promotes CI/CD best practices
ECR (Elastic Container Registry): private Docker registry, integrated with ECS/EKS

🎯

AWS-native container workload → ECS. Existing Kubernetes workload or multi-cloud → EKS. No server management → add Fargate. Store container images → ECR (not Docker Hub — keep it in AWS for lower latency and security).

⚠️

Common traps:

"ECS and EKS both require managing EC2 instances" — FALSE; both support Fargate (serverless compute).
"ECS is a Kubernetes service" — FALSE; ECS is AWS-proprietary orchestration. EKS runs actual Kubernetes.
"Containers are always stateless" — FALSE; containers can be stateful using EBS or EFS volumes.
"ECR is only for ECS" — FALSE; ECR stores container images used by ECS, EKS, Lambda, or any Docker-compatible runtime.
"EKS is free" — FALSE; you pay per EKS cluster per hour (~$0.10/hr) plus EC2/Fargate costs.

🔌 API Gateway & Microservice Patterns

APIMedium Frequency

▾

API Gateway Types

Type	Use Case	Protocol
REST API	Standard HTTP APIs; request/response transformation, caching	HTTP/S
HTTP API	Lower cost, lower latency than REST; JWT auth built-in	HTTP/S
WebSocket API	Real-time bidirectional — chat, live dashboards	WebSocket

Key Features

Throttling: Protects backends — default 10,000 RPS per account (configurable)
Caching: Cache responses 0.5 GB–237 GB; reduces backend calls
Usage Plans + API Keys: Tiered rate limiting per client
Authorizers: Lambda authorizer (custom logic) or Cognito User Pool (JWT)
Private APIs: Accessible only within VPC via interface endpoint

Microservice Design Principles

Stateless workloads: No server-side session state → easy horizontal scaling
Stateful workloads: Session state in ElastiCache or DynamoDB, not in-process
Read replicas: Offload read traffic from primary DB — scale reads independently

🎯

REST API + caching + transformation → REST API GW. Lower cost simple HTTP proxy → HTTP API GW. Real-time push → WebSocket API GW. Throttle specific clients → Usage Plans. Scale reads → RDS read replicas or ElastiCache.

⚠️

Common traps:

"API Gateway HTTP API supports request/response transformation" — FALSE; only REST APIs support mapping templates for transformation.
"API Gateway caches responses globally across all regions" — FALSE; caching is per stage, per region, per API.
"Increasing API GW timeout beyond 29 seconds is possible" — FALSE; API Gateway has a hard maximum integration timeout of 29 seconds. Use async patterns (SQS + Lambda) for longer operations.

⚡ Caching Strategies — ElastiCache, CloudFront, DAX

CachingHigh Frequency

▾

Service	Layer	Use Case	Engine
ElastiCache for Redis	In-memory DB cache	Sessions, leaderboards, pub/sub, complex data types	Redis
ElastiCache for Memcached	In-memory cache	Simple object caching, horizontal scaling	Memcached
Amazon DAX	DynamoDB accelerator	Microsecond reads for DynamoDB (no app code change)	Proprietary
Amazon CloudFront	Edge CDN cache	Static/dynamic content, API response caching at edge	Edge network

Caching Patterns

Lazy Loading (Cache-Aside): Check cache → miss → load from DB → write to cache. Stale data risk, but only caches what's requested.
Write-Through: Write to cache and DB simultaneously. Always fresh data but higher write latency.
TTL: Set expiry on cache entries to prevent serving stale data indefinitely.

🎯

DynamoDB read latency too high → DAX (microseconds, no code change). Session management → ElastiCache Redis. Global content delivery / static assets → CloudFront. Need pub/sub in cache layer → Redis (Memcached has no pub/sub).

⚠️

Common traps:

"DAX can be used with any database" — FALSE; DAX is exclusively for DynamoDB.
"ElastiCache Memcached supports Multi-AZ automatic failover" — FALSE; Memcached has no replication or failover. Only Redis supports Multi-AZ with automatic failover.
"Caching always improves consistency" — FALSE; caching introduces potential stale data; TTL and invalidation strategies must be carefully designed.
"CloudFront caches all content types by default" — FALSE; caching behavior is controlled by Cache-Control and TTL settings. Dynamic content (API responses, authenticated pages) is typically not cached and passes through to origin on every request.

⚙️ Load Balancing — ALB, NLB & Gateway LB

NetworkingHigh Frequency

▾

	ALB (Layer 7)	NLB (Layer 4)	Gateway LB (Layer 3)
Protocol	HTTP, HTTPS, gRPC, WebSocket	TCP, UDP, TLS	IP (GENEVE)
Routing	Path, host, header, query string	IP + port	Pass-through to appliances
Static IP	❌ (use Global Accelerator)	✅ Per AZ	✅
Use case	Microservices, HTTP routing, containers	Ultra-low latency, gaming, financial	Inline security appliances (IDS/IPS, firewalls)

🎯

Route by URL path (/api vs /web) → ALB. Need static IP for whitelist → NLB. Third-party firewall/IDS inspection → Gateway LB. WebSocket support → ALB (NLB also supports TCP WebSocket).

⚠️

Common traps:

"ALB provides a static IP address" — FALSE; ALB uses DNS names that resolve to dynamic IPs. Use Global Accelerator in front of ALB for static IPs.
"NLB supports path-based routing" — FALSE; NLB operates at Layer 4 and routes by IP/port only.
"You can attach a WAF to an NLB" — FALSE; WAF only works with ALB, CloudFront, API Gateway, and AppSync. NLB operates at Layer 4 with no HTTP context, so WAF (a Layer 7 firewall) cannot be applied to it.
"Cross-Zone Load Balancing is enabled by default on all LBs" — FALSE; it's enabled by default on ALB but disabled by default on NLB and Gateway LB.

Knowledge Check

Task 2.2

Design Highly Available and/or Fault-Tolerant Architectures

Multi-AZ, multi-Region, DR strategies, RTO/RPO, Route 53 routing, immutable infrastructure.

🌍 Disaster Recovery Strategies — RTO, RPO & the 4 Tiers

DRVery High Frequency

▾

RTO vs. RPO

RTO (Recovery Time Objective): Maximum tolerable downtime — how fast must you recover?
RPO (Recovery Point Objective): Maximum tolerable data loss — how much data can you afford to lose?

The 4 DR Strategies (cheapest → fastest)

Strategy	Description	RTO	RPO	Cost
Backup & Restore	Restore from S3/Glacier backup. No live DR resources.	Hours	Hours	Lowest
Pilot Light	Core data replicated; minimal compute off. Scale up on event.	Minutes–hours	Minutes	Low
Warm Standby	Scaled-down but functional copy in DR region. Scale up fast.	Minutes	Seconds–minutes	Medium
Active-Active (Multi-site)	Both regions serve traffic simultaneously.	Near-zero	Near-zero	Highest

Scenario

Company requires RPO ≤ 15 min and RTO ≤ 1 hour. Backup & Restore won't meet RTO. Active-Active is too expensive. Best fit: Warm Standby — a scaled-down running stack in DR region with continuous replication; scale up within minutes on failover.

graph LR BR["Backup & Restore\nRPO/RTO: Hours\nCost: $"] PL["Pilot Light\nRPO: Minutes\nRTO: Minutes–Hrs\nCost: $$"] WS["Warm Standby\nRPO: Seconds\nRTO: Minutes\nCost: $$$"] AA["Active-Active\nRPO/RTO: Near-Zero\nCost: $$$$"] BR -->|more resilient| PL PL -->|more resilient| WS WS -->|more resilient| AA

🎯

The exam gives you RPO/RTO requirements and asks which strategy fits. Map: hours/hours → Backup & Restore; minutes RPO → Pilot Light or Warm Standby; near-zero → Active-Active. Cost scales with RTO speed.

⚠️

Common traps:

"Pilot Light means the DR environment is fully running at reduced capacity" — FALSE; Pilot Light means only core data/services (like DB replication) run. Compute is off and must be scaled up on failover. That's Warm Standby.
"RPO is about how fast you recover" — FALSE; RPO is about data loss tolerance (time). RTO is recovery time. Swap these and you'll pick the wrong strategy.

🗺️ Route 53 — Routing Policies for HA

DNSHigh Frequency

▾

Policy	Use Case	Health Check?
Simple	Single resource; no health checks	Optional
Failover	Primary/secondary; fail over on health check failure	✅ Required
Weighted	A/B testing; canary deployments; split traffic by %	Optional
Latency	Route to region with lowest latency for the user	Optional
Geolocation	Route by user's geographic location (country/continent)	Optional
Geoproximity	Route by distance, with bias to shift traffic between regions	Optional
Multivalue Answer	Return up to 8 healthy records; basic load distribution	✅ Recommended

🎯

Active-Active failover → Latency or Weighted (both regions serve traffic). Active-Passive failover → Failover routing policy. Legal data residency → Geolocation. Gradually shift traffic to new region → Geoproximity with bias.

⚠️

Common traps:

"Geolocation routing guarantees users always connect to the nearest region" — FALSE; Geolocation routes by geographic location, not latency. Use Latency-based routing for lowest latency.
"Multivalue Answer is a load balancer replacement" — FALSE; it's basic DNS-level health-checked multi-record, not a real load balancer. Use ELB for actual load balancing.
"Route 53 health checks can test private endpoints directly" — FALSE; health checks originate from the internet. Use CloudWatch alarm + Route 53 health check linked to the alarm for private resources.
"Weighted routing with weight 0 removes the record" — FALSE; weight 0 stops traffic to that endpoint but the record remains; to remove it, delete the record or set all weights to 0 (which distributes evenly).

🔁 Multi-AZ Patterns, Auto Scaling & Immutable Infrastructure

HAHigh Frequency

▾

Multi-AZ Key Facts

Always deploy across ≥2 AZs for HA — AZs are isolated failure domains within a Region
RDS Multi-AZ: Synchronous replication to standby; automatic failover (~60–120s); standby is not readable
Aurora Multi-AZ: 6 copies across 3 AZs; read replicas can be promoted; much faster failover than RDS
ELB: Distributes traffic across AZs; Cross-Zone Load Balancing sends traffic to all registered targets

EC2 Auto Scaling

Target Tracking: Maintain a metric value (e.g., CPU at 60%) — simplest, recommended
Step Scaling: Scale in defined steps based on CloudWatch alarms
Scheduled Scaling: Scale at predictable times (e.g., 8 AM every weekday)
Predictive Scaling: ML-based; provisions capacity before load arrives
Cooldown Period: Prevents thrashing — default 300s after a scale action

Immutable Infrastructure

Never modify running instances — replace with new AMI versions
Enables blue/green deployments: stand up new stack → shift traffic → terminate old
CloudFormation / CDK define infrastructure as code — entire stack is replaceable

🎯

Eliminate single points of failure: Multi-AZ ELB + Auto Scaling Group + Multi-AZ RDS. RDS read replica ≠ Multi-AZ standby — read replicas are for scaling reads (asynchronous replication); Multi-AZ standby is for failover (synchronous, not readable).

⚠️

Common traps:

"RDS Multi-AZ standby can handle read queries to reduce load" — FALSE; standby is passive and not accessible for reads. Create read replicas for that.
"Auto Scaling adds instances immediately when alarm fires" — FALSE; there is a warm-up period and cooldown period that delays scaling.
"Scheduled scaling overrides target tracking" — FALSE; they work together — ASG uses whichever produces the largest desired capacity.
"EC2 Auto Scaling can replace unhealthy instances across regions" — FALSE; ASG is regional. Use multi-region architecture + Route 53 failover for cross-region HA.
"Cooldown period prevents scale-in and scale-out" — FALSE; cooldown only applies to the same scaling policy type that triggered it.

👁️ Workload Visibility — CloudWatch, X-Ray & Service Quotas

Observability

▾

Amazon CloudWatch

Metrics, logs, alarms, dashboards — central observability platform
Custom Metrics: Push application-level metrics (e.g., orders/min) via PutMetricData API
CloudWatch Logs Insights: Query log groups with SQL-like syntax
Composite Alarms: Combine multiple alarms with AND/OR logic to reduce alert noise

AWS X-Ray

Distributed tracing for microservices — visualizes request flow across services
Identifies bottlenecks, errors, throttling, and latency hotspots in distributed apps
Integrates with Lambda, API Gateway, ECS, EC2 (via daemon)

Service Quotas & Throttling

Every AWS service has default quotas (e.g., Lambda concurrency: 1,000)
Request quota increases via Service Quotas console before launching high-traffic workloads
Standby environments need their own quota increases — they won't share with primary
Use AWS Trusted Advisor to identify quota risks proactively

🎯

Trace a slow API call across Lambda + DynamoDB → X-Ray service map. Alert when 3 separate metrics breach thresholds simultaneously → CloudWatch Composite Alarm. DR standby needs same throughput as prod → pre-request quota increases in DR region.

⚠️

Common traps:

"CloudWatch monitors applications inside EC2 automatically" — FALSE; by default CloudWatch only gets hypervisor-level metrics (CPU, network, disk I/O). Install the CloudWatch Agent for memory, disk usage, and custom app metrics.
"X-Ray works automatically for all AWS services" — FALSE; you must instrument your code with the X-Ray SDK and configure the X-Ray daemon or Lambda layer.
"A CloudWatch alarm in INSUFFICIENT_DATA state means a breach" — FALSE; INSUFFICIENT_DATA means not enough data points — it does not trigger alarm actions by default.
"CloudWatch Logs retention is infinite by default" — FALSE; default is never expire. You must set a retention policy to avoid unbounded log storage costs.

🚀 RDS Proxy, Legacy Modernization & Reliability Patterns

Reliability

▾

Amazon RDS Proxy

Connection pooler between Lambda/app and RDS — prevents connection exhaustion
Improves failover time: maintains connections during Multi-AZ failover; app reconnects instantly
Integrates with Secrets Manager for credential rotation without app changes
Ideal when Lambda functions create many short-lived DB connections (connection storms)

Improving Legacy Applications

Add ALB + Auto Scaling in front of legacy monolith without changing app code
Put CloudFront in front to cache static assets and reduce origin load
Use SQS to absorb burst traffic and smooth load on legacy backend
Strangler Fig pattern: gradually replace monolith functionality with microservices behind same domain

🎯

Lambda hitting RDS connection limit → RDS Proxy (pooling). Legacy app can't be changed but needs HA → add ALB + Auto Scaling Group wrapping it. Reduce DB load without code changes → CloudFront for static + ElastiCache for DB query caching.

⚠️

Common traps:

"RDS Proxy works with all RDS database engines" — FALSE; RDS Proxy supports MySQL, PostgreSQL, and MariaDB. It does not support Oracle or SQL Server.
"RDS Proxy eliminates failover downtime" — FALSE; it reduces failover impact (from ~60s to ~5s) but doesn't eliminate it.
"Adding ElastiCache in front of RDS requires no code changes" — FALSE; you must modify application code to check cache before hitting the DB (lazy loading pattern). DAX for DynamoDB is the only cache that works transparently without code changes.
"Putting an SQS queue in front of a legacy app always improves performance" — FALSE; SQS adds asynchronous processing which can increase latency for synchronous use cases.

Domain 3 Overview

Select optimal AWS services and configurations for storage, compute, databases, networking, and data pipelines to meet performance requirements efficiently at scale.

⚡ 24% of scored content

📊 Visual Study Guides — Domain 3

Cheat SheetVisual

▾

Domain 3 — High-Performing Architectures

Domain 3: Designing High-Performing Architectures

5 Pillars of High-Performing AWS Architectures

Task 3.1

High-Performing Storage Solutions

S3, EBS, EFS, FSx — performance characteristics, hybrid storage, scaling.

Knowledge of:

Hybrid storage solutions to meet business requirements
Storage services with appropriate use cases (for example, Amazon S3, Amazon EFS, Amazon EBS)
Storage types with associated characteristics (for example, object, file, block)

Skills in:

Determining storage services and configurations that meet performance demands
Determining storage services that can scale to accommodate future needs

💿 Storage Type Selection — S3, EBS, EFS, FSx

StorageHigh Frequency

▾

Service	Type	Access	Use Case	Throughput
Amazon S3	Object	HTTP API (any client)	Data lake, backups, static website, media	Effectively unlimited
Amazon EBS	Block	Single EC2 (same AZ)	OS volumes, databases, low-latency random I/O	Up to 256K IOPS (io2 BE)
Amazon EFS	File (NFS)	Thousands of EC2/Lambda across AZs	Shared CMS, home dirs, dev tools, containers	Elastic, bursts to 3+ GB/s
FSx for Windows	File (SMB)	Windows EC2 / on-prem AD	Windows workloads, SQL Server, Active Directory	Up to 2 GB/s
FSx for Lustre	File (parallel)	HPC compute nodes	ML training, genomics, video processing, HPC	Hundreds of GB/s
FSx for NetApp ONTAP	File (multi-protocol)	NFS, SMB, iSCSI	Lift-and-shift enterprise storage apps	High

EBS Volume Types

Type	Class	Max IOPS	Use Case
gp3	SSD	16,000	General purpose — cost-effective for most workloads
io2 Block Express	SSD	256,000	Critical DBs, SAP HANA, lowest latency
st1	HDD	500	Throughput-optimized — big data, log processing
sc1	HDD	250	Cold HDD — infrequent access, lowest cost block

🎯

Shared file storage across multiple EC2 → EFS (Linux) or FSx for Windows (Windows). High-IOPS database → EBS io2. HPC / ML training → FSx for Lustre (can link to S3 as data repository). EBS only attaches to one EC2 in the same AZ — multi-attach io2 is limited exception.

⚠️

Common traps:

"EBS volumes can be attached to multiple EC2 instances simultaneously" — FALSE FOR MOST TYPES; only io1/io2 with Multi-Attach enabled (same AZ, up to 16 instances, Linux only with cluster-aware file system).
"EFS can be mounted on Windows EC2" — FALSE; EFS is NFS-based (Linux only). Use FSx for Windows File Server for Windows.
"S3 is a file system" — FALSE; S3 is object storage, not a POSIX-compliant file system. Don't mount it like EFS.
"EBS volumes persist if the EC2 is terminated" — FALSE BY DEFAULT; the root volume is deleted on termination unless you explicitly uncheck DeleteOnTermination. Data volumes persist by default.

🗂️ S3 Performance & Hybrid Storage

S3Hybrid

▾

S3 Performance Patterns

S3 automatically scales to 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix
Use multiple key prefixes (paths) to parallelize across partitions — no random prefixes needed (post-2018)
S3 Transfer Acceleration: Uploads via CloudFront edge → AWS backbone — improves speed over long distances
Multipart Upload: Required for files >5 GB; recommended for files >100 MB; enables parallel upload chunks
Byte-Range Fetches: Parallelize downloads by fetching chunks simultaneously

Hybrid Storage

Service	Use Case
AWS Storage Gateway (S3 File GW)	On-prem NFS/SMB → S3 via local cache appliance
AWS Storage Gateway (Volume GW)	iSCSI block volumes backed by S3/Glacier
AWS Storage Gateway (Tape GW)	Virtual tape library → S3 Glacier (replaces physical tape)
AWS DataSync	High-speed online data transfer: on-prem ↔ S3, EFS, FSx
AWS Snow Family	Offline physical transfer for petabyte-scale or no-internet scenarios

🎯

Upload large objects fast over the internet → S3 Transfer Acceleration + Multipart Upload. On-prem file server → S3 → Storage Gateway File GW. Migrate petabytes with limited bandwidth → Snowball Edge. Ongoing sync → DataSync (faster than S3 CLI, handles metadata).

⚠️

Common traps:

"S3 Transfer Acceleration speeds up downloads from S3" — TRUE, NOT JUST UPLOADS — IT ACCELERATES BOTH. "YOU CAN USE RANDOM PREFIXES (HASH-BASED KEYS) TO IMPROVE S3 PERFORMANCE" — THIS WAS TRUE PRE-2018; S3 now automatically partitions on request rate. Random prefixes are no longer needed.
"DataSync is only for one-time migrations" — FALSE; DataSync supports ongoing scheduled synchronization.
"Snowball can transfer data to any AWS region" — FALSE; Snowball ships to and from specific AWS regions; not all regions support all Snow devices.

Task 3.2

High-Performing and Elastic Compute Solutions

EC2 instance types, Auto Scaling, serverless, containers, distributed compute.

Knowledge of:

AWS compute services with appropriate use cases (for example, AWS Batch, Amazon EMR, AWS Fargate)
Distributed computing concepts supported by AWS global infrastructure and edge services
Queuing and messaging concepts (for example, publish/subscribe)
Scalability capabilities with appropriate use cases (for example, Amazon EC2 Auto Scaling, AWS Auto Scaling)
Serverless technologies and patterns (for example, AWS Lambda, Fargate)
The orchestration of containers (for example, Amazon ECS, Amazon EKS)

Skills in:

Decoupling workloads so that components can scale independently
Identifying metrics and conditions to perform scaling actions
Selecting the appropriate compute options and features (for example, EC2 instance types) to meet business requirements
Selecting the appropriate resource type and size (for example, the amount of Lambda memory) to meet business requirements

🖥️ EC2 Instance Families & Sizing

ComputeHigh Frequency

▾

Family	Optimized For	Example Types	Use Cases
General Purpose (M, T)	Balanced CPU/memory/network	m7g, t3a	Web servers, dev environments, small DBs
Compute Optimized (C)	High CPU : memory ratio	c7g, c6i	Batch processing, ML inference, video encoding
Memory Optimized (R, X, u)	High memory : CPU ratio	r7i, x2iedn	In-memory DBs, SAP HANA, real-time analytics
Storage Optimized (I, D, H)	High sequential I/O / NVMe	i4i, d3	NoSQL DBs, data warehousing, distributed file systems
Accelerated (P, G, Inf, Trn)	GPU / custom silicon	p4d, g5, inf2	ML training, graphics rendering, HPC

Decoupling & Independent Scaling

Use SQS between web tier and processing tier — each scales based on its own queue depth or CPU metric
SQS queue depth (ApproximateNumberOfMessagesVisible) → scale Auto Scaling Group for workers
AWS Batch: Managed batch compute — dynamically provisions EC2/Spot for job queues; no manual cluster management

🎯

Identify the right instance family from workload description: "in-memory database" → R-family; "high-compute batch jobs" → C-family; "ML training with GPUs" → P/G-family; "genomics high I/O" → I-family. Lambda memory directly controls allocated CPU too.

⚠️

Common traps:

"T-family instances always deliver full CPU performance" — FALSE; T instances have a CPU credit model. Under sustained load they throttle unless T-unlimited mode is enabled (at extra cost).
"Larger instance size always means better performance" — FALSE; a memory-optimized R-family is better for memory-bound workloads than a larger compute C-family.
"AWS Batch requires EC2 instances you manage" — FALSE; AWS Batch can use Fargate as the compute environment for serverless job execution.
"Spot Instances can be used for RDS" — FALSE; RDS does not use Spot pricing. Spot is only for EC2, ECS, EMR, and Batch.

Task 3.3

High-Performing Database Solutions

RDS, Aurora, DynamoDB, ElastiCache, database selection and architecture.

Knowledge of:

AWS global infrastructure (for example, Availability Zones, AWS Regions)
Caching strategies and services (for example, Amazon ElastiCache)
Data access patterns (for example, read-intensive compared with write-intensive)
Database capacity planning (for example, capacity units, instance types, Provisioned IOPS)
Database connections and proxies
Database engines with appropriate use cases (for example, heterogeneous migrations, homogeneous migrations)
Database replication (for example, read replicas)
Database types and services (for example, serverless, relational compared with non-relational, in-memory)

Skills in:

Configuring read replicas to meet business requirements
Designing database architectures
Determining an appropriate database engine (for example, MySQL compared with PostgreSQL)
Determining an appropriate database type (for example, Amazon Aurora, Amazon DynamoDB)
Integrating caching to meet business requirements

🗄️ Database Selection — Relational vs. NoSQL vs. Specialty

DatabaseHigh Frequency

▾

Service	Type	Strengths	Best For
Amazon RDS	Relational (OLTP)	Managed MySQL/PostgreSQL/Oracle/SQL Server	Traditional apps, ERP, CRM, e-commerce
Amazon Aurora	Relational (OLTP)	5× MySQL / 3× PostgreSQL perf; 6-copy replication; Global DB	High-throughput relational workloads
Amazon DynamoDB	NoSQL (key-value / document)	Millisecond latency, serverless, virtually unlimited scale	Session stores, gaming, IoT, e-commerce cart
Amazon Redshift	Columnar (OLAP)	Petabyte-scale data warehouse; Redshift Spectrum queries S3	Analytics, BI, large-scale reporting
Amazon Neptune	Graph	Billions of relationships; Gremlin / SPARQL	Social networks, fraud detection, knowledge graphs
Amazon ElastiCache	In-memory	Sub-millisecond; Redis or Memcached	Caching, sessions, leaderboards
Amazon Keyspaces	Wide-column (Cassandra)	Serverless Cassandra-compatible	Migrating Cassandra workloads

Aurora Performance Highlights

Storage auto-grows in 10 GB increments up to 128 TB — no pre-provisioning
Up to 15 read replicas with <10ms replica lag
Aurora Serverless v2: Scales in fine-grained ACU increments; ideal for variable/unpredictable workloads
Aurora Global Database: Replicates across up to 5 regions with <1s lag; secondary region promoted in <1 min

DynamoDB Capacity Modes

	Provisioned	On-Demand
Billing	Per RCU/WCU provisioned	Per request (pay per read/write)
Scaling	Auto Scaling adjusts within limits	Instantly handles any traffic level
Best for	Predictable traffic; cost optimization	Unknown or spiky traffic

🎯

"Need to join tables, ACID transactions" → RDS/Aurora. "Millisecond latency at any scale, no schema" → DynamoDB. "Run SQL analytics on S3 data lake" → Redshift Spectrum or Athena. "Graph relationships" → Neptune. "Migrate Oracle → AWS" → RDS for Oracle or Aurora PostgreSQL with SCT/DMS.

⚠️

Common traps:

"Aurora is just a managed MySQL" — FALSE; Aurora has a completely different storage layer (distributed, 6 copies, auto-growing) with 5× MySQL performance.
"DynamoDB supports complex multi-table joins" — FALSE; DynamoDB is a NoSQL key-value/document store with no native JOIN support. Design your data model to avoid joins (single-table design).
"Aurora Serverless v2 scales to zero" — FALSE; Aurora Serverless v2 scales down to 0.5 ACU minimum, not zero. v1 could scale to zero.
"Redshift is used for OLTP workloads" — FALSE; Redshift is a columnar OLAP data warehouse optimized for analytics queries, not transactional workloads.

📈 Database Performance — Read Replicas, Proxies & Caching

DatabaseHigh Frequency

▾

Read Replicas

Asynchronous replication from primary to replica(s)
RDS: up to 5 read replicas; Aurora: up to 15
Point reporting/analytics workloads to read replicas — reduce primary load
Can promote to standalone DB (breaks replication) for DR or migration
Cross-region read replicas: lower latency for global users + DR capability

Connection Management

RDS Proxy: Pools and multiplexes connections — critical for Lambda → RDS (Lambda can open thousands of concurrent connections)
Reduces DB failover impact: proxy maintains connections, apps reconnect through proxy seamlessly

Caching with ElastiCache

Read-heavy workloads: Cache frequently queried DB results → reduce RDS load by 80%+
Redis Cluster mode: Horizontal sharding for datasets >300 GB
Redis Sentinel/Replication: Primary + replicas for HA (automatic failover)

🎯

Scale reads → Read replicas + route app reads to replica endpoint. Lambda connection storms → RDS Proxy. Offload repetitive read queries → ElastiCache lazy loading. DynamoDB hot-partition reads → DAX (not ElastiCache — DAX is DynamoDB-specific and no code change required).

⚠️

Common traps:

"Read replicas provide synchronous replication for zero data loss" — FALSE; read replicas use asynchronous replication. There can be replication lag. Multi-AZ uses synchronous replication.
"Promoting a read replica to primary breaks the existing primary" — FALSE; promoting creates a standalone DB. The original primary continues to run independently.
"RDS Proxy supports all RDS engines including Oracle" — FALSE; RDS Proxy supports MySQL, PostgreSQL, MariaDB only.
"ElastiCache Redis cluster mode disabled means no HA" — FALSE; you can still have replication groups (primary + replicas) with auto-failover without cluster mode. Cluster mode adds sharding.

Task 3.4

High-Performing Network Architectures

CloudFront, Global Accelerator, PrivateLink, Transit Gateway, VPN, network topology design.

Knowledge of:

Edge networking services with appropriate use cases (for example, Amazon CloudFront, AWS Global Accelerator)
How to design network architecture (for example, subnet tiers, routing, IP addressing)
Load balancing concepts (for example, Application Load Balancer)
Network connection options (for example, AWS VPN, AWS Direct Connect, AWS PrivateLink)
Data analytics and visualization services with appropriate use cases (for example, Amazon Athena, AWS Lake Formation, Amazon QuickSight)
Data ingestion patterns (for example, frequency)
Data transfer services with appropriate use cases (for example, AWS DataSync, AWS Storage Gateway)
Data transformation services with appropriate use cases (for example, AWS Glue)
Secure access to ingestion access points
Sizes and speeds needed to meet business requirements
Streaming data services with appropriate use cases (for example, Amazon Kinesis)

Skills in:

Creating a network topology for various architectures (for example, global, hybrid, multi-tier)
Determining network configurations that can scale to accommodate future needs
Determining the appropriate placement of resources to meet business requirements
Selecting the appropriate load balancing strategy
Building and securing data lakes
Designing data streaming architectures
Designing data transfer solutions
Implementing visualization strategies
Selecting appropriate compute options for data processing (for example, Amazon EMR)
Selecting appropriate configurations for ingestion
Transforming data between formats (for example, .csv to .parquet)

🌐 Edge Acceleration — CloudFront & Global Accelerator

EdgeHigh Frequency

▾

	Amazon CloudFront	AWS Global Accelerator
Protocol	HTTP/HTTPS (content delivery)	TCP/UDP (any protocol)
Caching	✅ Edge caches content	❌ Routes packets, no caching
IP addresses	Dynamic (DNS-based)	✅ 2 static Anycast IPs
Routing	Nearest edge pop (content)	AWS backbone → nearest region endpoint
Use case	CDN — websites, video, APIs, S3 static	Gaming, IoT, VoIP, real-time apps needing static IPs

Network Connection Options

Service	Use Case
AWS PrivateLink	Expose services privately to other VPCs/accounts without VPC peering or internet; uses interface endpoints
VPC Peering	Direct private connectivity between 2 VPCs (same or different account/region); non-transitive
AWS Transit Gateway	Hub-and-spoke: connect 100s of VPCs + on-prem through one gateway; supports transitive routing
AWS Site-to-Site VPN	IPsec-encrypted tunnel over the public internet from on-prem to VPC; minutes to set up

🎯

Static IP requirement + non-HTTP → Global Accelerator. Cache web content globally → CloudFront. Connect many VPCs at scale → Transit Gateway (not peering — peering doesn't scale, no transitive routing). Expose SaaS privately → PrivateLink.

⚠️

Common traps:

"VPC peering allows transitive routing — traffic from VPC A can reach VPC C via VPC B" — FALSE; VPC peering is non-transitive. Use Transit Gateway for hub-and-spoke.
"Global Accelerator caches content at edge locations" — FALSE; it routes traffic via the AWS backbone to the nearest healthy endpoint — no caching.
"PrivateLink requires VPC peering" — FALSE; PrivateLink uses interface endpoints independent of peering.
"CloudFront can only serve content from S3" — FALSE; CloudFront supports any HTTP origin including ALBs, EC2 instances, on-prem web servers, and API Gateway — S3 is just the most common static origin.

🕸️ Network Topology — Subnets, Routing & Placement

VPCMedium Frequency

▾

Multi-Tier Subnet Design

Public tier: ALB, NAT GW, bastion hosts — has internet route via IGW
Application tier: EC2, ECS tasks — private, outbound via NAT GW
Data tier: RDS, ElastiCache — private, no outbound internet access
Spread each tier across ≥2 AZs for HA — 6 subnets minimum for a 3-tier, 2-AZ design

Scaling Network Capacity

CIDR sizing: plan subnets large enough for future growth — you can't resize a VPC CIDR, only add secondary CIDRs
VPC secondary CIDR blocks: extend IP space without recreating the VPC
Placement Groups: Cluster (low latency, same rack) / Spread (max isolation) / Partition (HDFS, Cassandra)

🎯

HPC requiring low-latency between instances → Cluster Placement Group (single AZ, same rack). Maximize instance isolation for HA → Spread Placement Group. HDFS/Cassandra large clusters → Partition Placement Group.

⚠️

Common traps:

"A public subnet automatically gives EC2 instances a public IP" — FALSE; EC2 gets a public IP only if the subnet's auto-assign public IP setting is enabled OR you explicitly associate an EIP.
"You can resize a VPC CIDR block" — FALSE; you cannot modify the primary CIDR. Add secondary CIDR blocks to extend address space.
"Cluster Placement Groups span multiple AZs for better HA" — FALSE; Cluster Placement Groups are within a single AZ (designed for performance, not HA). Use Spread Placement Groups across AZs for HA.
"Private subnets cannot reach the internet" — FALSE; private subnets can reach the internet for outbound traffic via a NAT Gateway in a public subnet.

Task 3.5

High-Performing Data Ingestion & Transformation

Kinesis, Glue, Athena, Lake Formation, EMR, DataSync — data pipelines and analytics.

🌊 Streaming Data — Kinesis Family

StreamingHigh Frequency

▾

Service	Purpose	Key Facts
Kinesis Data Streams (KDS)	Real-time custom ingestion	Shards: 1 MB/s in, 2 MB/s out per shard; 24-hour default retention (extendable up to 365 days)
Amazon Data Firehose	Managed delivery to destinations	Fully managed; delivers to S3, Redshift, OpenSearch, Splunk; no consumers to manage
Kinesis Video Streams	Video ingestion & playback	Ingest video from devices; ML processing
Amazon MSK	Managed Apache Kafka	Lift-and-shift Kafka workloads; standard Kafka API

Pattern

IoT devices → Kinesis Data Streams (real-time processing by Lambda) → transform → Data Firehose → S3 data lake → Athena for ad-hoc SQL queries → QuickSight for dashboards.

🎯

Real-time custom processing → KDS. Managed delivery without consumer management → Data Firehose. Existing Kafka infrastructure → Amazon MSK. Firehose can't replay data; KDS can (within retention window).

⚠️

Common traps:

"KDS and Firehose are interchangeable" — FALSE; KDS requires custom consumer code; Firehose is managed delivery to a fixed set of destinations.
"Adding Kinesis shards reduces read latency" — FALSE; shards increase throughput (MB/s), not latency.
"MSK replaces SQS" — FALSE; MSK is Managed Kafka for high-throughput streaming; SQS is a simpler decoupled queue service.
"Kinesis default retention is 24 hours" — TRUE; the default retention period IS 24 hours. It can be extended up to 365 days via the Extended Data Retention feature (additional cost). There is no "7-day default" — that is a common misconception.

🔬 Data Lakes, ETL & Analytics — Glue, Athena, EMR, Lake Formation

AnalyticsMedium Frequency

▾

Service	Role	Key Facts
AWS Glue	Serverless ETL	Crawlers catalog S3 data; Glue jobs transform and load; Python/Spark
Amazon Athena	Serverless SQL on S3	Pay per query (per TB scanned); use Parquet/ORC to reduce cost 10×
Amazon EMR	Managed Hadoop/Spark	Big data processing; Spot Instances for core nodes save 60–90%
AWS Lake Formation	Data lake governance	Centralized permissions on S3 data lake; column/row-level security
Amazon Redshift	Data warehouse	Columnar storage; Spectrum: query S3 directly without loading
Amazon QuickSight	BI / visualization	Serverless; SPICE in-memory engine; ML insights

Format Optimization

Convert CSV → Parquet or ORC before querying with Athena — columnar formats reduce data scanned by 10–100×
Partition S3 data by date/region/category — Athena skips entire partitions when WHERE clause matches
AWS Glue can automate CSV → Parquet conversion in ETL pipelines

🎯

Serverless SQL on S3 → Athena. Managed Spark/Hadoop big data → EMR. Serverless ETL → Glue. Fine-grained data lake permissions → Lake Formation. BI dashboards → QuickSight. Athena cost: convert to Parquet + partition = massive savings.

⚠️

Common traps:

"AWS Glue is a data warehouse" — FALSE; Glue is serverless ETL and a data catalog. Redshift is the data warehouse.
"Athena can directly query DynamoDB tables" — FALSE; export to S3 first or use PartiQL within DynamoDB.
"Lake Formation replaces S3 as storage" — FALSE; Lake Formation is a governance/permissions layer — data still lives in S3.
"EMR master node can run on Spot to save cost" — FALSE; master node interruption kills the entire cluster. Use On-Demand for master; Spot is safe only for task nodes.

Domain 4 Overview

Design architectures that deliver required capabilities at the lowest cost. Covers storage tiering, compute purchasing options, database cost optimization, and network cost reduction strategies.

⚡ 20% of scored content

📊 Visual Study Guides — Domain 4

Cheat SheetVisual

▾

Domain 4 — Cost-Optimized Architectures

Domain 4: Designing Cost-Optimized Architectures

Task 4.1

Cost-Optimized Storage Solutions

S3 tiers, lifecycle policies, EBS optimization, storage tool selection, data transfer costs.

📦 S3 Storage Classes & Lifecycle Cost Optimization

Storage CostHigh Frequency

▾

Storage Class	Retrieval	Min Duration	Use Case
S3 Standard	Instant, ms	None	Frequently accessed data
S3 Intelligent-Tiering	Instant (frequent tier)	None	Unknown or changing access patterns
S3 Standard-IA	Instant, ms	30 days	Infrequent access, rapid retrieval (backups)
S3 One Zone-IA	Instant, ms	30 days	IA data that can be recreated if AZ fails
S3 Glacier Instant Retrieval	Instant, ms	90 days	Archives accessed once/quarter
S3 Glacier Flexible Retrieval	1–12 hours	90 days	Compliance archives, not time-sensitive
S3 Glacier Deep Archive	12–48 hours	180 days	Lowest cost; regulatory long-term archives

Cost Reduction Strategies

Lifecycle Policies: Auto-transition objects to cheaper tiers based on age — set-and-forget cost savings
S3 Intelligent-Tiering: AWS monitors access and automatically moves objects between tiers; small monitoring fee per object
Requester Pays: Buckets where the requester (not bucket owner) pays data transfer and request costs — ideal for public datasets
Batch uploads: Aggregate small objects before upload — reduces per-request costs vs. many individual PUTs
Cost Allocation Tags: Tag S3 buckets by team/project for granular billing breakdown

Block Storage Cost Optimization

Right-size EBS volumes — don't over-provision; use CloudWatch to identify underutilized volumes
Delete unattached EBS volumes (common cost leak)
Use gp3 instead of gp2 — gp3 is 20% cheaper and lets you set IOPS independently
Use st1 (HDD) for sequential large file workloads — much cheaper than SSD for throughput-bound access
EBS Snapshots: incremental; store only changed blocks; use Data Lifecycle Manager to automate retention

🎯

Unknown access patterns → S3 Intelligent-Tiering (automated, no retrieval penalty on frequent tier). Long-term compliance archive, lowest cost → Glacier Deep Archive. gp2 vs gp3 → always prefer gp3 (cheaper, independent IOPS tuning). Unattached EBS = wasted spend — Trusted Advisor flags these.

⚠️

Common traps:

"S3 Intelligent-Tiering has retrieval fees" — FALSE; no retrieval fees, only a small per-object monitoring charge (~$0.0025/1k objects).
"Standard-IA is always cheaper than Standard" — FALSE; Standard-IA charges a per-GB retrieval fee — for frequently read data it is MORE expensive than Standard.
"Glacier Deep Archive and Glacier Flexible Retrieval have the same retrieval time" — FALSE; Deep Archive = 12–48 hours; Flexible Retrieval = 1–12 hours.
"S3 Lifecycle rules can move objects back to warmer tiers" — FALSE; lifecycle only transitions to colder tiers. Manual copy is needed to move back to Standard.

🚚 Data Transfer Costs & Migration Tools

CostTransfer

▾

Data Transfer Cost Rules

Inbound to AWS = always FREE (upload to S3, Direct Connect inbound)
Same Region, same AZ, EC2 → EC2 private IP = FREE
Cross-AZ traffic = $0.01/GB each direction — minimize by keeping tiers in same AZ when possible
Cross-Region transfer = varies by region; significant cost at scale
Internet egress = $0.09/GB (first 10 TB/month from most regions)
CloudFront egress = cheaper than direct S3/EC2 internet egress + reduces origin requests

Lowest-Cost Transfer Methods

Scenario	Best Tool	Why
Small regular transfers to S3	AWS CLI / SDK	No overhead
Large ongoing sync (on-prem ↔ AWS)	AWS DataSync	10× faster than rsync; handles metadata
Petabytes, limited bandwidth	AWS Snowball Edge	Physical device; free inbound after delivery
Exabytes	AWS Snowmobile	Truck-sized data transfer unit
Ongoing large files, transfer acceleration	S3 Transfer Acceleration	Uses CloudFront edge network backbone

🎯

Use VPC endpoints (Gateway for S3/DynamoDB) — eliminate NAT Gateway data processing charges for S3 traffic from EC2 in private subnets. A common exam answer to "reduce data transfer costs for S3."

⚠️

Common traps:

"EC2 → S3 in the same region is always free" — FALSE; traffic through a NAT Gateway incurs data processing charges even within the same region. Use a free Gateway VPC Endpoint to avoid it.
"DataSync and Transfer Family serve the same purpose" — FALSE; DataSync is for automated bulk migration; Transfer Family provides managed SFTP/FTP endpoints for ongoing partner file exchange.
"Snow device data ingestion to AWS is charged" — FALSE; data loading after device return is free. Only device rental and shipping are charged.
"Transfer Acceleration always speeds up uploads" — FALSE; AWS only charges you if acceleration is actually faster — if not beneficial, the transfer is not accelerated and not charged.

Task 4.2

Cost-Optimized Compute Solutions

Purchasing options, instance right-sizing, serverless vs. EC2, load balancing strategy.

💰 EC2 Purchasing Options — On-Demand, Reserved, Spot, Savings Plans

CostVery High Frequency

▾

Option	Discount vs On-Demand	Commitment	Interruption	Best For
On-Demand	Baseline (no discount)	None	None	Unpredictable, short-term, dev/test
Reserved Instances (1-yr)	Up to 40%	1 year	None	Steady-state, predictable workloads
Reserved Instances (3-yr)	Up to 72%	3 years	None	Long-term committed workloads
Savings Plans (Compute)	Up to 66%	1 or 3 yr ($/hr spend)	None	Flexible: any instance family, region, OS
Spot Instances	Up to 90%	None	✅ 2-min notice	Fault-tolerant, stateless, batch, CI/CD
Dedicated Hosts	Higher cost	On-Demand or Reserved	None	BYOL (per-socket/per-core licensing)
AWS Outposts	—	3–5 yr	None	On-prem workloads needing AWS APIs + low latency

Spot Best Practices

Use Spot for stateless, fault-tolerant workloads: batch jobs, CI/CD agents, ML training, video encoding
Spot Fleet / EC2 Fleet: Automatically diversifies across instance types and AZs to maintain target capacity
Use hibernate option to preserve instance state on interruption
Mix On-Demand (baseline) + Spot (burst) in Auto Scaling Groups for cost + availability balance

Right-Sizing Strategies

AWS Compute Optimizer: ML-based recommendations for EC2, Lambda, EBS, ECS on Fargate
AWS Trusted Advisor: flags low-utilization EC2 instances (<10% CPU for 4+ days)
Use T-family burstable instances for workloads with low baseline + occasional spikes

graph TD Q1{"Workload type?"} Q1 -->|"Fault-tolerant / batch / CI-CD"| Spot["Spot Instances\nUp to 90% off\n2-min interruption notice"] Q1 -->|"Steady-state production"| Q2{"Commitment?"} Q1 -->|"Short-term / unpredictable / dev"| OD["On-Demand\nNo commitment, full price"] Q2 -->|"1-3 yr, any family/region"| SP["Compute Savings Plans\nUp to 66% off"] Q2 -->|"1-3 yr, fixed type"| RI["Reserved Instances\nUp to 72% off"] Q2 -->|"BYOL per-socket licensing"| DH["Dedicated Hosts\nPhysical server control"]

🎯

Steady, always-on prod workloads → Reserved or Compute Savings Plans. Maximum savings + can tolerate interruption → Spot. BYOL Oracle/Windows → Dedicated Hosts. Mix for ASG → On-Demand base capacity + Spot for scaling. Most flexible discount → Compute Savings Plans (applies to Lambda and Fargate too).

⚠️

Common traps:

"Standard RIs apply to any size within the same instance family" — FALSE; Standard RIs lock instance type, size, OS, and tenancy. Only Convertible RIs allow family/size swaps.
"Compute Savings Plans cover RDS" — FALSE; they cover EC2, Lambda, and Fargate only. RDS has its own Reserved Instance program.
"Dedicated Instances and Dedicated Hosts are equivalent" — FALSE; Dedicated Hosts give physical server-level control needed for BYOL licensing; Dedicated Instances just run on dedicated hardware without host-level visibility.
"Spot interruptions always terminate instances" — FALSE; depending on launch configuration, the instance can be stopped or hibernated instead of terminated.

⚖️ Serverless vs. EC2 Cost Trade-offs & Load Balancer Selection

Cost

▾

Serverless vs. EC2

	Lambda / Fargate	EC2
Cost model	Pay per invocation + duration	Pay per hour (running or stopped)
Idle cost	$0	Full instance cost
Best for	Sporadic, event-driven, variable scale	Steady high-throughput, long-running, GPU workloads
Break-even	<~50% utilization favors serverless	>~50% utilization favors EC2 + Reserved

Load Balancer Cost Comparison

ALB: LCU-based pricing (connections, bandwidth, rules, new connections) — most cost-effective for HTTP/S at moderate scale
NLB: LCU-based but for TCP/UDP — lower cost for simple TCP load balancing than ALB
Classic LB: Legacy; more expensive per feature than ALB/NLB — migrate away

🎯

Low-traffic event-driven API → Lambda + API GW (zero idle cost). High-traffic 24/7 API → EC2 + Reserved Instances (predictable cost). Production + non-production same account → tag-based cost allocation; separate non-prod to dev account with separate budgets.

⚠️

Common traps:

"Lambda is always cheaper than EC2" — FALSE; at sustained high invocation rates Lambda cost exceeds a Reserved EC2 instance. Break-even is roughly 50% average utilization.
"ALB and NLB pricing is identical" — FALSE; ALB LCUs factor in rule evaluations and new connections; NLB LCUs factor in flows and bandwidth. At high concurrent TCP connection counts, NLB is typically cheaper.
"Classic Load Balancer is acceptable for new architectures" — FALSE; CLB is legacy with no new feature development — always use ALB (L7) or NLB (L4).
"Lambda concurrency scales infinitely" — FALSE; account-level concurrency limit is 1,000 by default per region; unreserved concurrency is shared across all functions.

Task 4.3

Cost-Optimized Database Solutions

Database right-sizing, caching, backup policies, serverless databases, migration for cost savings.

🗄️ Database Cost Levers — Sizing, Caching & Serverless

DB CostMedium Frequency

▾

Cost Reduction Strategies

Strategy	How	Savings
Caching	ElastiCache/DAX in front of RDS/DynamoDB — reduces read load → smaller DB instance	High
Read Replicas	Offload analytics to replica → downsize primary instance	Medium
Aurora Serverless v2	Pay per ACU consumed; scales to zero (v2 scales down to 0.5 ACU)	High for variable workloads
DynamoDB On-Demand → Provisioned	If traffic is predictable, provisioned is cheaper	Medium
RDS Reserved Instances	1-yr or 3-yr commitment for steady-state DBs	Up to 69%
Right-size DB instance	Use CloudWatch to identify underutilized DB → downsize	Medium

Data Retention & Backup Cost

Set RDS automated backup retention to what's actually needed (1–35 days) — don't keep 35 days if 7 suffices
Manual RDS snapshots persist until deleted — automate cleanup with Lambda or AWS Backup lifecycle rules
DynamoDB: On-Demand backups billed per GB; PITR costs ~0.2 cents/GB/month — enable only where needed

Database Migration for Cost Savings

Heterogeneous migration: Oracle/SQL Server → Aurora PostgreSQL using AWS SCT + DMS → eliminate expensive license costs
DynamoDB vs. RDS: DynamoDB has no per-seat or per-engine licensing; pure consumption billing

🎯

Dev/test databases → stop RDS instances nights/weekends (automated with EventBridge + Lambda). Migrate away from Oracle → Aurora PostgreSQL with DMS saves substantial license cost. Variable DynamoDB traffic → On-Demand mode (no capacity planning); predictable → Provisioned + Auto Scaling.

⚠️

Common traps:

"Aurora Serverless v2 scales to zero" — FALSE; v2 scales down to 0.5 ACU minimum, not zero. v1 could scale to zero (with cold-start penalty).
"ElastiCache always reduces database costs" — FALSE; ElastiCache adds its own hourly cost. It only saves money if the cache hit ratio is high enough that DB instance downsizing or fewer read replicas offset the cache cost.
"Stopping an RDS instance costs nothing" — FALSE; stopped RDS instances still incur storage charges. After 7 days, stopped instances automatically restart.
"DynamoDB On-Demand mode is always more expensive than Provisioned" — FALSE; for very spiky or unpredictable traffic, On-Demand avoids over-provisioned WCU/RCU waste and can be cheaper overall.

Task 4.4

Cost-Optimized Network Architectures

NAT Gateway cost, VPC endpoints, network topology, CDN strategy, throttling, bandwidth allocation.

🌉 NAT Gateway, VPC Endpoints & Network Cost Reduction

Network CostHigh Frequency

▾

NAT Gateway Cost Considerations

NAT Gateway: $0.045/hr per AZ + $0.045/GB data processed — significant at scale
One NAT GW per AZ: More expensive but prevents cross-AZ data transfer charges; use for production
Single shared NAT GW: Cheaper but cross-AZ traffic incurs $0.01/GB each direction
NAT Instance (legacy): Cheaper for low-traffic but requires management; no HA without scripting

VPC Endpoints — Eliminate NAT/Internet Costs

Type	Services	Cost
Gateway Endpoint	S3, DynamoDB only	Free — no hourly charge or data fee
Interface Endpoint (PrivateLink)	100s of AWS services (SSM, KMS, ECR, etc.)	~$0.01/hr per AZ + $0.01/GB

Network Routing for Cost Optimization

Keep traffic within the same AZ where possible — cross-AZ = ~$0.02/GB round trip
Use VPC peering instead of Transit Gateway for simple two-VPC connections — TGW adds per-attachment + data processing fees
CloudFront origin shield: consolidates origin requests, reduces S3/EC2 egress
Compress API responses before sending — reduces data transfer costs

Direct Connect vs. VPN vs. Internet for Cost

	Internet	VPN	Direct Connect
Setup cost	Lowest	Low	Higher (port fees)
Data transfer	Standard egress rates	Standard egress rates	Reduced egress rates
Break-even	Low volume	Low–medium volume	High volume (10s of TB/month)

Throttling & Bandwidth Optimization

API Gateway throttling prevents backends from being overloaded — reduces compute cost from traffic spikes
Multiple smaller Direct Connect connections vs. one large: same total bandwidth but more resilience; evaluate cost per Gbps
AWS Cost Explorer: enable network cost analysis to identify cross-AZ transfer hotspots

🎯

Biggest network cost wins: (1) S3 Gateway Endpoint — eliminate NAT GW charges for S3 access (free). (2) Keep EC2 → RDS in same AZ — eliminate cross-AZ charges. (3) CloudFront — cheaper egress than direct S3/EC2 + caches content. (4) Review cross-AZ data paths — each cross-AZ byte costs money.

⚠️

Common traps:

"One shared NAT Gateway is always cheaper than one per AZ" — NOT ALWAYS; a single NAT GW saves hourly cost but all cross-AZ traffic to it incurs $0.01/GB each way. At high data volumes, per-AZ NAT GWs are cheaper.
"VPC Interface Endpoints are free like Gateway Endpoints" — FALSE; Interface Endpoints (PrivateLink) cost ~$0.01/hr per AZ plus data processing fees. Only S3 and DynamoDB Gateway Endpoints are free.
"Transit Gateway is cheaper than VPC peering for two VPCs" — FALSE; TGW charges per attachment and per GB processed. For two VPCs, direct peering has no hourly or data charge (only standard EC2 data transfer rates).
"CloudFront eliminates all origin data transfer costs" — FALSE; CloudFront reduces origin load by caching, but on cache misses it still fetches from the origin (incurring data transfer). CloudFront egress is cheaper per GB than direct S3/EC2 egress, but costs are not zero.

📉 Cost Management Tools — Cost Explorer, Budgets & Trusted Advisor

FinOpsMedium Frequency

▾

Tool	Purpose	Key Feature
AWS Cost Explorer	Visualize, analyze, and forecast spend	Right-sizing recommendations; Reserved/Savings Plan utilization
AWS Budgets	Set cost/usage/RI thresholds + alerts	Alert via SNS when spend exceeds budget; forecast-based alerts
Cost & Usage Report (CUR)	Granular billing data to S3	Most detailed billing data; feed to Athena/Redshift for custom analysis
AWS Trusted Advisor	Best practice checks across all pillars	Cost: idle EBS, low-utilization EC2, unused RIs, unassociated EIPs
AWS Compute Optimizer	ML-based resource right-sizing	EC2, Lambda, EBS, ECS on Fargate recommendations
Cost Allocation Tags	Tag resources by team/project/env	Enables per-tag cost breakdown in Cost Explorer and CUR

🎯

Alert when monthly bill exceeds $500 → AWS Budgets. Detailed billing for chargeback analysis → Cost & Usage Report + Athena. Which EC2 to downsize → Compute Optimizer. Unused Reserved Instances → Cost Explorer RI utilization report or Trusted Advisor.

⚠️

Common traps:

"AWS Budgets prevents spending from exceeding the threshold" — FALSE; Budgets only alerts you — it does not stop resources from running. Use IAM SCPs to actually enforce cost limits.
"Cost Explorer shows real-time spend" — FALSE; Cost Explorer data has an up-to-24-hour delay. Use the Billing Dashboard for near-real-time spend.
"Cost Allocation Tags automatically appear in Cost Explorer" — FALSE; you must activate cost allocation tags in the Billing console before they appear as filterable dimensions.
"Compute Optimizer and Trusted Advisor give the same right-sizing recommendations" — FALSE; Compute Optimizer uses ML and 14 days of CloudWatch metrics for granular recommendations. Trusted Advisor uses simpler 14-day CPU/network averages with coarser thresholds.

Complete Study Guide
All Four Domains

Domain 1 Overview

Design Secure Access to AWS Resources

Design Secure Workloads and Applications

Determine Appropriate Data Security Controls

Domain 2 Overview

Design Scalable and Loosely Coupled Architectures

Design Highly Available and/or Fault-Tolerant Architectures

Domain 3 Overview

High-Performing Storage Solutions

High-Performing and Elastic Compute Solutions

High-Performing Database Solutions

High-Performing Network Architectures

High-Performing Data Ingestion & Transformation

Domain 4 Overview

Cost-Optimized Storage Solutions

Cost-Optimized Compute Solutions

Cost-Optimized Database Solutions

Cost-Optimized Network Architectures

On this page