AWS Certified Solutions Architect — Associate · SAA-C03

Complete Study Guide
All Four Domains

📋 Domains: 4 ⚡ Pass Score: 720 / 1000 ❓ Scored Questions: 50

Domain 1 Overview

Design architectures that protect AWS resources, workloads, data, and network traffic. Covers IAM, multi-account governance, network security, threat protection, and data encryption controls.

⚡ 30% of scored content — Highest weighted domain
📊 Visual Study Guides — Exam Overview & Domain 1
Cheat SheetVisual
SAA-C03 Exam Blueprint
SAA-C03 Exam Blueprint — all four domain weights

SAA-C03 Exam Blueprint — domain weights & exam mechanics

4 Pillars of AWS Solutions Architecture

4 Pillars of AWS Solutions Architecture overview

Domain 1 — Secure Architectures
Domain 1: Secure Architecture Design Study Guide

Domain 1 — Secure Architecture Design Study Guide

Cloud Security Architecture Blueprint

Domain 1 — Cloud Security Architecture Blueprint

Task 1.1

Design Secure Access to AWS Resources

IAM, federated identity, SCPs, multi-account strategy, shared responsibility model.

Knowledge of:
  • Access controls and management across multiple accounts
  • AWS federated access and identity services (for example, IAM, AWS IAM Identity Center)
  • AWS global infrastructure (for example, Availability Zones, AWS Regions)
  • AWS security best practices (for example, the principle of least privilege)
  • The AWS shared responsibility model
Skills in:
  • Applying AWS security best practices to IAM users and root users (for example, multi-factor authentication [MFA])
  • Designing a flexible authorization model that includes IAM users, groups, roles, and policies
  • Designing a role-based access control strategy (for example, AWS STS, role switching, cross-account access)
  • Designing a security strategy for multiple AWS accounts (for example, AWS Control Tower, service control policies [SCPs])
  • Determining the appropriate use of resource policies for AWS services
  • Determining when to federate a directory service with IAM roles
🔀 AWS Shared Responsibility Model
FoundationalExam Fave

AWS and the customer divide security obligations at a clear boundary. The exam tests this boundary constantly.

The Split
AWS — "Security OF the Cloud"Customer — "Security IN the Cloud"
Physical datacenters, hardware, networking, hypervisorOS patches, app code, data encryption
Managed service durability & HA (S3, RDS failover)IAM policies, S3 bucket policies, security groups
Global infrastructure (Regions, AZs, Edge)Data classification and access management
Scenario

EC2 runs an unpatched Apache web server — who's responsible for the patch? The customer. AWS delivers the hardware and hypervisor; OS-level software is the customer's domain.

💡

Mnemonic: AWS secures OF the cloud (Physical/Infra). You secure IN the cloud (Data/Access). Think: Owned by AWS = OF; Input by Customer = IN.

graph TD subgraph Customer ["Customer: Security IN the Cloud"] Data["Customer Data"] IAM["IAM & Access"] OS["OS, Network & Firewall Config"] Encrypt["Client/Server Encryption"] end subgraph AWS ["AWS: Security OF the Cloud"] Compute["Compute / Storage"] Net["Networking"] Infra["Global Infra (Regions, AZs)"] end Customer --- AWS
🎯

Rule of thumb: The more managed the service (Lambda, DynamoDB), the more AWS owns. You always own your data and access controls regardless of service type.

⚠️

Common traps:

  • "AWS is responsible for patching RDS OS" — TRUE for RDS (managed), FALSE for EC2.
  • "AWS encrypts S3 by default so customer doesn't need to manage access" — FALSE; encryption ≠ access control.
  • "Customers are never responsible for network infrastructure" — FALSE on-prem hybrid; customer owns their side of Direct Connect.
  • Questions often swap "of" and "in" — read carefully.
🔑 IAM — Users, Groups, Roles & Policies
IAMHigh Frequency

IAM is the control plane for all AWS access. Every exam scenario touches IAM at some level.

Entity Types
EntityWhat It IsWhen to Use
UserLong-term credentials for a person or appHuman workforce with permanent access
GroupCollection of users sharing policiesAssign permissions by job function
RoleShort-term STS credentials — no static keysEC2/Lambda/cross-account/federated access
PolicyJSON Allow/Deny on actions and resourcesAttached to any entity to grant/restrict permissions
Policy Priority Order (highest first)
  • SCPs — Org-level guardrails; constrain everything below
  • Permissions Boundaries — Max permissions a delegated entity can have
  • Identity-based Policies — Attached directly to user/group/role
  • Resource-based Policies — Attached to the resource (S3, KMS key, etc.)
  • Session Policies — Temporary scope passed at AssumeRole time
Evaluation Rule

Default = implicit Deny. Explicit Deny always wins — even over an explicit Allow. Explicit Allow grants access only when no Deny is present.

Least-Privilege Policy — S3 Read-Only
// Developer reads only from a specific bucket prefix
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-bucket",
      "arn:aws:s3:::my-bucket/dev-team/*"
    ]
  }]
}
🎯

Principle of Least Privilege: Grant only the minimum required. Prefer IAM Roles over long-term access keys. Never embed credentials in code — use instance profiles or Secrets Manager.

💡

Mnemonic: PIRATES evaluate policies: Permissions boundaries, Identity policies, Resource policies, All Together Explicit deny Supersedes.

⚠️

Common traps:

  • "Groups can be nested inside other groups" — FALSE; IAM groups are flat.
  • "An explicit Deny in a resource policy is overridden by an explicit Allow in an identity policy" — FALSE; explicit Deny always wins.
  • "Attaching a policy to a group grants permissions to the group itself" — FALSE; groups are not identities, permissions flow only to member users.
  • "Permissions Boundaries grant permissions" — FALSE; they only restrict the maximum.
👑 Root User Security & MFA
IAMExam Fave
Root User Best Practices
  • Enable MFA immediately after account creation — use hardware MFA for maximum security
  • Never create access keys for root — use IAM users/roles for all programmatic access
  • Lock root credentials; share with nobody; store hardware MFA token securely offline
  • Use root only for the small set of tasks that only root can perform
Root-Only Tasks

Change account root email/password · Enable MFA Delete on S3 · Activate IAM Billing access · Restore IAM admin when locked out · Change AWS Support plan · Close the AWS account · Register as Reserved Instance Marketplace seller.

MFA Types
TypeExampleRecommended For
Virtual MFAGoogle Authenticator, AuthyStandard IAM users
Hardware MFA (TOTP)Gemalto tokenPrivileged / root accounts
Hardware MFA (FIDO)YubiKeyHighest assurance — root, break-glass
SMS MFAText message OTPNot recommended (SIM-swap risk)
🎯

Exam will describe a scenario and ask which MFA type to recommend. Hardware MFA (FIDO/YubiKey) = highest security. For root, always recommend hardware. For general employees, virtual MFA is acceptable.

💡

Mnemonic: Root is MACS: MFA Delete, Account closing, Change support plan, Sign up for GovCloud. Only Root can do MACS.

⚠️

Common traps:

  • "IAM admin can perform all root tasks" — FALSE; some tasks (like MACS) require root exclusively.
  • "Enabling MFA on root prevents all unauthorized root access" — partially true, but root access keys bypass MFA on CLI calls — never create root access keys.
  • "SCPs restrict the root user of the management account" — FALSE; SCPs do NOT apply to the management account's root user.
🔄 IAM Roles, AWS STS & Cross-Account Access
IAMSTSHigh Frequency

Roles provide temporary credentials via STS — the preferred pattern for granting access to services, cross-account scenarios, and federated users.

How Role Assumption Works
  • Principal calls sts:AssumeRole → STS issues temporary credentials (AccessKeyId + SecretKey + SessionToken), valid 15 min–12 hrs
  • Trust Policy on the role defines who can assume it (the principal)
  • Permission Policy on the role defines what they can do
  • EC2 uses an Instance Profile — SDK auto-fetches and rotates credentials from IMDS
Cross-Account Pattern

Dev account (Account A) needs to read S3 in Prod account (Account B). Solution: Create an IAM Role in Account B with a trust policy allowing Account A's IAM principal. Devs call sts:AssumeRole → get scoped temp creds → access S3. No permanent keys are shared between accounts.

sequenceDiagram participant A as Account A (Dev) participant STS as AWS STS participant B as Account B (Prod S3) Note over B: Trust Policy allows Account A principal A->>STS: AssumeRole(RoleARN in Account B) STS-->>A: Temp Credentials (15 min–12 hr) A->>B: Access S3 with Temp Credentials Note over A,B: No permanent keys shared across accounts
Key STS API Calls
APIUse Case
AssumeRoleCross-account or service-to-service access
AssumeRoleWithWebIdentityFederated via OIDC (Cognito, Google, GitHub)
AssumeRoleWithSAMLFederated via corporate IdP (ADFS, Okta)
GetSessionTokenAdd MFA enforcement to an existing user session
🎯

"EC2 needs access to S3" → Attach an IAM Role via Instance Profile. Never store access keys on the instance. The SDK auto-retrieves credentials from http://169.254.169.254/latest/meta-data/.

⚠️

Common traps:

  • "AssumeRole credentials never expire" — FALSE; they are temporary (max 12 hours).
  • "A role can only be assumed by one service" — FALSE; the trust policy can list multiple principals.
  • "Cross-account access requires VPC peering" — FALSE; it uses IAM role assumption via STS, which is an AWS API call with no network dependency.
  • "Instance Profile = IAM Role" — not exactly; an Instance Profile is the container that holds a role and attaches to EC2.
🏢 Multi-Account Strategy: Organizations, SCPs & Control Tower
OrganizationsHigh Frequency
Key Concepts
  • Management Account: Creates the Org; cannot be restricted by SCPs
  • Member Accounts: Subject to SCPs from management account or OUs above them
  • Organizational Units (OUs): Logical groupings — Production OU, Dev OU, Sandbox OU
  • SCPs: Allow/Deny policies at Org/OU/Account level — they are guardrails, never grants by themselves
SCP vs. IAM Policy
FeatureSCPIAM Policy
ScopeAccount / OU / OrgUser / Role / Group
Grants permissions❌ No — only restricts✅ Yes
Applies to root user✅ Yes (member accounts)❌ No
Can overrideTrumps identity policiesOverridden by SCP
SCP Scenario

SCP on "Dev OU" denies ec2:TerminateInstances. An IAM admin in a member account tries to terminate EC2. Result: DENIED. SCP is an absolute ceiling — even AdministratorAccess cannot exceed what the SCP permits.

graph TD Root["Root"] --> Mgmt["Management Account\n(not restricted by SCPs)"] Root --> ProdOU["Production OU\nSCP: Deny DeleteBucket"] Root --> DevOU["Dev OU\nSCP: Deny EC2 Terminate"] Root --> Sandbox["Sandbox OU"] ProdOU --> ProdA["Account: Prod-US"] ProdOU --> ProdB["Account: Prod-EU"] DevOU --> DevA["Account: Dev-Team"] Sandbox --> SandA["Account: Sandbox"]
AWS Control Tower
  • Automates multi-account setup with a Landing Zone (secure baseline)
  • Guardrails: Preventive (SCPs) + Detective (Config rules) applied across all accounts
  • Account Factory: Vends new accounts with standard config — ideal for "spin up 50 accounts" scenarios
  • Integrates with IAM Identity Center for SSO across all accounts
🎯

Prevent account from leaving Org → SCP: Deny organizations:LeaveOrganization. 50 accounts with security baseline → Control Tower Account Factory. Centralized billing → AWS Organizations Consolidated Billing.

💡

Mnemonic: SCPs are a Ceiling, not a Grant. They don't Grant permissions, they just set the Ceiling for what's possible.

⚠️

Common traps:

  • "An SCP with Allow * grants full access" — FALSE; SCPs alone do not grant permissions; IAM policies must also allow the action.
  • "SCPs apply to the management account" — FALSE; SCPs never restrict the management account.
  • "Attaching an SCP to an OU immediately affects all child accounts" — TRUE and often tested as a gotcha when students expect a manual rollout.
  • "Control Tower replaces Organizations" — FALSE; Control Tower runs on top of Organizations.
🔗 Federated Identity — IAM Identity Center, SAML, OIDC
FederationExam Fave

Federation lets an external Identity Provider authenticate users and map them to IAM roles — no individual IAM users needed for every employee.

Options Compared
OptionUse CaseProtocol
IAM Identity Center (SSO)Workforce SSO across many AWS accounts + SaaS appsSAML 2.0 / OIDC
SAML 2.0 FederationCorporate IdP (ADFS) → AWS Console or CLISAML
OIDC / Web IdentityMobile/web app users via Cognito, Google, GitHub ActionsOIDC / OAuth 2.0
AWS Directory ServiceExtend on-prem AD to AWS; Managed Microsoft ADKerberos / LDAP
Scenario

Company has 1,000 AD employees. They need AWS console access without separate IAM users. Solution: Configure IAM Identity Center with AD as identity source → map AD groups to Permission Sets → employees log in with AD credentials and get access to assigned accounts.

🎯

"Employees / SSO / multiple accounts"IAM Identity Center. "Mobile app / social login / customers"Amazon Cognito. "GitHub Actions accessing AWS" → OIDC with IAM role (no stored access keys).

⚠️

Common traps:

  • "IAM Identity Center and Cognito are interchangeable" — FALSE; Identity Center = workforce/employees, Cognito = customer/consumer apps.
  • "SAML federation creates IAM users for each federated user" — FALSE; federated users assume IAM roles, no IAM users are created.
  • "You need AWS Directory Service to use IAM Identity Center" — FALSE; you can use an external IdP like Okta directly.
  • "OIDC tokens from Cognito can directly call AWS APIs" — FALSE; they must be exchanged via an Identity Pool for temporary STS credentials first.
Knowledge Check
Task 1.2

Design Secure Workloads and Applications

VPC security, endpoint security, WAF, Shield, Cognito, GuardDuty, Secrets Manager, hybrid connectivity.

Knowledge of:
  • Application configuration and credentials security
  • AWS service endpoints
  • Control ports, protocols, and network traffic on AWS
  • Secure application access
  • Security services with appropriate use cases (for example, AWS Cognito, AWS GuardDuty, AWS Macie)
  • Threat vectors external to AWS (for example, DDoS, SQL injection)
  • Data access and governance
  • Data recovery
  • Data retention and classification
  • Encryption and appropriate key management
Skills in:
  • Designing VPC architectures with security components (for example, security groups, route tables, network ACLs, NAT gateways)
  • Determining network segmentation strategies (for example, using public subnets and private subnets)
  • Integrating AWS services to secure applications (for example, AWS Shield, AWS WAF, IAM Identity Center, AWS Secrets Manager)
  • Securing external network connections to and from the AWS Cloud (for example, VPN, AWS Direct Connect)
  • Aligning AWS technologies to meet compliance requirements
  • Encrypting data at rest (for example, AWS KMS)
  • Encrypting data in transit (for example, AWS Certificate Manager [ACM] using TLS)
  • Implementing access policies for encryption keys
  • Implementing data backups and replications
  • Implementing policies for data access, lifecycle, and protection
  • Rotating encryption keys and renewing certificates
🌐 VPC Security — Subnets, Security Groups, NACLs, NAT
NetworkingHigh Frequency
Public vs. Private Subnets
AttributePublic SubnetPrivate Subnet
Route to internetVia Internet Gateway (IGW)Via NAT Gateway (outbound only)
Resources hereALBs, bastion hosts, NAT GatewaysApp servers, databases, internal services
Direct inbound from internet✅ Yes (if SG permits)❌ No
Security Groups vs. Network ACLs
FeatureSecurity GroupNetwork ACL
LevelInstance / ENISubnet
StateStateful — return traffic auto-allowedStateless — both directions must be allowed
Rule typesAllow onlyAllow AND Deny
Rule evaluationAll rules evaluatedRules processed in order (lowest # wins)
Block specific IP❌ Cannot deny✅ Use explicit Deny rule
3-Tier Architecture

ALB (public subnet, SG allows 443 from 0.0.0.0/0) → App servers (private subnet, SG allows 8080 from ALB SG only) → RDS (private subnet, SG allows 5432 from App SG only). NAT GW in public subnet lets private instances pull updates without being internet-reachable.

graph TD subgraph VPC ["VPC"] IGW["Internet Gateway"] subgraph Public ["Public Subnet"] ALB["ALB (SG: 443)"] NAT["NAT Gateway"] end subgraph Private ["Private Subnet"] App["App Server (SG: 8080)"] DB["RDS (SG: 5432)"] end IGW -->|Internet Traffic| ALB ALB -->|Port 8080| App App -->|Port 5432| DB App -->|Outbound Updates| NAT NAT --> IGW end
🎯

Block an IP → NACL Deny rule (SGs can't deny). Stateless NACL reminder: must open ephemeral ports 1024–65535 on outbound rules for return traffic from internet-facing resources.

💡

Mnemonic: SG is Stateful at the Group (Instance) level. NACL is Not Stateful, Applies to Complete Location (Subnet level).

⚠️

Common traps:

  • "Security groups are stateless" — FALSE; SGs are stateful (return traffic auto-allowed). NACLs are stateless.
  • "A NACL rule number 100 Allow and rule 200 Deny for the same CIDR — the Deny wins" — FALSE; NACLs process rules in ascending order — rule 100 Allow is evaluated first and traffic is allowed immediately.
  • "NACLs apply to specific EC2 instances" — FALSE; NACLs apply at the subnet level, affecting all resources in that subnet.
  • "You can attach multiple NACLs to a subnet" — FALSE; one NACL per subnet only.
🔐 Application Credentials — Secrets Manager vs. Parameter Store
CredentialsExam Fave
FeatureSecrets ManagerSSM Parameter Store
Cost$0.40 / secret / monthFree (Standard); $0.05 / adv. param / month
Auto rotation✅ Built-in (RDS, Redshift, DocumentDB)❌ Requires custom Lambda
Cross-account✅ Resource policyLimited
EncryptionAlways KMS-encryptedSecureString = KMS; String = plaintext
Best forDB passwords, API keys needing rotationConfig values, feature flags, non-sensitive params
Pattern

Lambda needs RDS password → Store in Secrets Manager with rotation enabled → Lambda execution role gets secretsmanager:GetSecretValue → password never appears in code or env variables, and rotates automatically without application downtime.

🎯

Rotation = Secrets Manager. If the question mentions rotating credentials, automatic rotation, or "without application downtime" — Secrets Manager is the answer every time.

💡

Mnemonic: SM = Secrets Manager rotates automatically; PS = Parameter Store is static/cheap.

⚠️

Common traps:

  • "Parameter Store SecureString values are unencrypted" — FALSE; SecureString uses KMS encryption.
  • "Secrets Manager rotates secrets in place, so apps need to handle the change" — FALSE; rotation is designed to be seamless; Secrets Manager updates the secret value and the application retrieves the new value on next fetch.
  • "Parameter Store can automatically rotate RDS passwords" — FALSE; Parameter Store has no built-in rotation for RDS.
  • "SSM Parameter Store is free" — Partially True; Standard tier is free, but Advanced parameters cost money.
🛡️ DDoS Protection — AWS Shield & WAF
DDoSHigh Frequency
AWS Shield
Shield StandardShield Advanced
CostFree (automatic)$3,000/month + data transfer
Protection layersL3/L4 (SYN floods, UDP reflection)L3/L4/L7 + financial protection
DRT access✅ AWS DDoS Response Team
ScopeAll AWS customersEC2, ELB, CloudFront, Route 53, Global Accelerator
AWS WAF — Layer 7
  • Attaches to: CloudFront, ALB, API Gateway, AppSync
  • Rules: block SQLi, XSS, bad bots, geo-restriction, IP reputation lists
  • Managed Rule Groups — pre-built, no authoring required (AWS or marketplace)
  • Rate-based rules — block IPs sending too many requests per interval
Architecture

Route 53 → CloudFront (WAF attached, blocks SQLi/XSS at edge) → ALB → EC2 in private subnet. Shield Standard protects CloudFront from volumetric DDoS. Shield Advanced adds financial protection and DRT support.

🎯

Layer mapping: Shield = L3/L4 (volumetric/network). WAF = L7 (HTTP). SQL injection, XSS, HTTP flood → WAF. SYN flood, UDP amplification, volumetric → Shield. Both together = full-stack DDoS protection.

💡

Mnemonic: WAF covers the Web (Layer 7). Shield covers the Network/Transport layers (Layer 3/4) against volumetric attacks.

⚠️

Common traps:

  • "AWS WAF can be attached directly to an EC2 instance" — FALSE; WAF attaches to CloudFront, ALB, API Gateway, or AppSync only.
  • "Shield Standard protects against L7 application-layer attacks" — FALSE; Standard only covers L3/L4.
  • "WAF blocks DDoS automatically without rules" — FALSE; WAF requires explicit rate-based or IP-block rules to act on DDoS.
  • "Shield Advanced covers all AWS services automatically" — FALSE; it must be explicitly enabled on specific resources (ELB, CloudFront, Route 53, EC2 EIP).
👤 Amazon Cognito — User Pools & Identity Pools
AuthMedium Frequency
User PoolsIdentity Pools
PurposeAuthentication — sign-up/sign-inAuthorization — AWS credentials
OutputJWT tokens (ID, Access, Refresh)Temp AWS creds via STS
Integrates withALB, API GW, social IdPs (Google, Facebook)IAM roles, S3, DynamoDB
End-to-End Flow

Mobile app → authenticates with User Pool → receives JWT → exchanges JWT at Identity Pool → Identity Pool calls STS → app receives scoped AWS temp creds → uploads directly to user's S3 prefix. User Pool = who you are; Identity Pool = what you can access in AWS.

graph LR App["Mobile App"] -->|"1. Sign in"| UP["Cognito\nUser Pool"] UP -->|"2. JWT token"| App App -->|"3. Exchange JWT"| IP["Cognito\nIdentity Pool"] IP -->|"4. AssumeRole"| STS["AWS STS"] STS -->|"5. Temp AWS creds"| App App -->|"6. Direct access"| S3["S3 Bucket\n(user prefix)"]
🎯

"Mobile / web app / social login / customers" → Cognito. "Employees / workforce / SSO"IAM Identity Center. The distinction is customer-facing vs. workforce-facing.

💡

Mnemonic: User Pools = User Authentication (Who). Identity Pools = Identity Authorization (What they can do).

⚠️

Common traps:

  • "Cognito User Pool tokens can directly access AWS services like S3" — FALSE; User Pool JWTs authenticate the user but don't grant AWS permissions. You need an Identity Pool to exchange the JWT for STS credentials.
  • "Identity Pools require a User Pool" — FALSE; Identity Pools can also accept tokens from social IdPs, SAML, or even unauthenticated (guest) identities.
  • "Cognito is the right choice for employee workforce SSO" — FALSE; use IAM Identity Center for workforce.
🔍 GuardDuty & Macie — Threat Detection & Data Discovery
DetectionMedium Frequency
Amazon GuardDuty
  • Intelligent threat detection — no agents, no infrastructure to manage
  • Data sources: VPC Flow Logs, CloudTrail API events, DNS logs, EKS audit logs, S3 data events
  • Detects: crypto mining, credential theft, port scans, unusual API calls, malware
  • Findings routed to EventBridge → Lambda auto-remediation or SNS alerts
  • Multi-account: delegate GuardDuty admin to a security account via Organizations
Amazon Macie
  • Discovers and protects sensitive data (PII, financial data, credentials) in S3
  • Uses ML + pattern matching — flags publicly accessible buckets containing sensitive data
  • Supports custom data identifiers (regex patterns) for proprietary data types
🎯

GuardDuty = threat/attack detection (compromised instances, unusual API activity). Macie = sensitive data discovery in S3 (PII exposure). If question mentions PII or S3 data exposure → Macie. Compromised EC2, coin mining → GuardDuty.

💡

Mnemonic: GuardDuty is a Guard (looks for bad behavior everywhere: VPC, DNS, CloudTrail). Macie is a Maid (cleans up/finds sensitive stuff in S3 buckets).

⚠️

Common traps:

  • "GuardDuty requires installing agents on EC2" — FALSE; it analyzes VPC Flow Logs, CloudTrail, and DNS logs without any agents.
  • "GuardDuty can block threats automatically" — FALSE by itself; it generates findings only. You must wire EventBridge → Lambda to block (e.g., update Security Group).
  • "Macie scans all AWS services for PII" — FALSE; Macie only analyzes S3 objects.
  • "Disabling GuardDuty deletes all findings" — TRUE and a common gotcha; findings are not retained after service is disabled.
🔌 Hybrid Connectivity — Site-to-Site VPN & Direct Connect
HybridHigh Frequency
FeatureSite-to-Site VPNAWS Direct Connect
MediumIPsec over public internetDedicated private fiber
Setup timeMinutes–hoursWeeks–months
BandwidthUp to ~1.25 Gbps1, 10, or 100 Gbps
LatencyVariable (internet-dependent)Consistent, low latency
Encrypted✅ IPsec❌ Not by default — add VPN on top
CostLowHigher (port-hour + data transfer)
Redundant Hybrid Pattern

Primary: Direct Connect (consistent, low latency). Backup: Site-to-Site VPN over internet. Add VPN on top of DX for encryption when compliance requires it. This gives performance + resilience.

graph LR DC["On-Premises\nDatacenter"] DC -->|"Primary: Direct Connect\nPrivate fiber, 1–100 Gbps\nNot encrypted by default"| VGW["Virtual Private\nGateway"] DC -->|"Backup: Site-to-Site VPN\nIPsec over internet\nEncrypted, variable latency"| VGW VGW --> VPC["AWS VPC\n(Private Subnets)"]
🎯

Consistent bandwidth + compliance + data must not traverse internet → Direct Connect. Quick setup + encrypted + lower cost → VPN. DX not encrypted by default — layer VPN over DX when encryption is required.

💡

Mnemonic: DX = Dedicated eXpress (Fast/Private but Unencrypted). VPN = Virtual Private Network (Encrypted but Public/Variable latency).

⚠️

Common traps:

  • "Direct Connect provides encrypted connectivity" — FALSE by default; DX is a private connection but not encrypted. Add IPsec VPN on top for encryption.
  • "Site-to-Site VPN is faster and more reliable than Direct Connect" — FALSE; VPN travels the public internet with variable latency.
  • "Direct Connect instantly fails over to VPN" — FALSE; failover requires Route 53 health checks or BGP failover configuration.
  • "Direct Connect provisioning takes minutes" — FALSE; it takes weeks to months to get a physical fiber connection provisioned.
Knowledge Check
Task 1.3

Determine Appropriate Data Security Controls

KMS, ACM, S3 encryption, data lifecycle, backup, compliance controls.

🗝️ AWS KMS — Key Management Service
EncryptionHigh Frequency
Key Types
TypeManaged ByRotationCostUse Case
AWS Managed KeysAWSAuto (annual)FreeDefault for most services
Customer Managed Keys (CMK)CustomerOptional / on-demand$1/month/keyFine-grained control, audit, cross-account
SSE-C (S3 only)Customer (sent in API)Customer managesNo KMS costKeys managed entirely outside AWS
Envelope Encryption

KMS generates a Data Encryption Key (DEK). Your data is encrypted with the DEK (AES-256, fast). The DEK is then encrypted by the CMK and stored alongside the ciphertext. To decrypt: KMS decrypts the DEK → DEK decrypts data. The CMK never leaves KMS HSMs.

KMS Key Policies
  • Every CMK must have a key policy — unlike IAM, KMS requires explicit policy to grant root account access
  • Both key policy + IAM policy must allow access (intersection of both)
  • Cross-account: add external account principal to key policy + IAM in that account grants kms:Decrypt
sequenceDiagram participant App as Application participant KMS as AWS KMS Note over App,KMS: Envelope Encryption Process App->>KMS: GenerateDataKey(CMK_ID) KMS-->>App: Plaintext DEK + Encrypted DEK Note over App: App encrypts payload
using Plaintext DEK Note over App: App drops Plaintext DEK from memory Note over App: App stores Encrypted Payload
alongside Encrypted DEK
🎯

Audit key usage → CMK (CloudTrail logs every API call). On-demand rotation → CMK only (AWS Managed keys rotate on AWS schedule). BYOK → Import key material into CMK. CloudHSM → single-tenant HSM; you control the hardware security module.

💡

Mnemonic: DEK = Data Encryption Key (Encrypts the Data directly). CMK = Customer Master Key (Encrypts the DEK). This is the Envelope Encryption concept.

⚠️

Common traps:

  • "Rotating a CMK re-encrypts all existing ciphertext" — FALSE; only new data is encrypted with the new key version. Old ciphertext is decryptable because KMS retains all previous key versions.
  • "You can use the same CMK across all regions" — FALSE; KMS keys are region-specific. Use multi-region keys (a newer feature) when cross-region decryption is needed.
  • "Deleting a CMK is immediate" — FALSE; KMS enforces a 7–30 day waiting period before deletion.
  • "CloudHSM is managed by AWS like KMS" — FALSE; with CloudHSM you manage the HSM cluster and are solely responsible for key backup.
🪣 S3 Data Security — Encryption, Policies, Object Lock
S3High Frequency
Server-Side Encryption Options
TypeKey Managed ByNotes
SSE-S3AWS (S3 service key)Default; AES-256; no cost or config
SSE-KMSAWS KMS CMKCloudTrail audit + key rotation + cross-account
SSE-CCustomer (in API header)HTTPS required; AWS does not store key
CSECustomer (client-side)Encrypted before upload; AWS never sees plaintext
Access Controls
  • Block Public Access: Account-level override — prevents any bucket/object ACL or policy from granting public access
  • Bucket Policies: Resource-based; enforce conditions like aws:SecureTransport (HTTPS-only)
  • MFA Delete: Requires MFA to delete object versions — enabled only by root user; prevents malicious deletion
  • Object Lock (WORM): Prevents deletion for a set retention period — Governance mode (admins can override) vs. Compliance mode (nobody can delete, even AWS)
  • VPC Gateway Endpoint: Private S3 access from VPC without NAT Gateway or internet
Force HTTPS

Bucket policy: Effect: Deny, Action: s3:*, Principal: *, Condition: aws:SecureTransport = false. Denies all non-HTTPS requests to the bucket at the resource level — no IAM Allow can override this Deny.

🎯

WORM / immutable data / SEC 17a-4 → S3 Object Lock in Compliance mode. Prevent version deletion → MFA Delete (root only). Private S3 access from Lambda in VPC → VPC Gateway Endpoint (free; no NAT needed).

💡

Mnemonic: SSE-S3 = Simple/Free (AWS managed). SSE-KMS = Key Audit/Control (CloudTrail). SSE-C = Customer provided key (Sent in HTTPS header).

⚠️

Common traps:

  • "S3 Block Public Access prevents all access to a bucket" — FALSE; it blocks public ACL and policy grants, but authenticated IAM users can still access objects.
  • "Object Lock in Governance mode prevents all deletion" — FALSE; Governance mode allows users with the s3:BypassGovernanceRetention permission to override. Compliance mode allows NO overrides.
  • "Versioning and Object Lock are the same thing" — FALSE; versioning keeps historical versions but doesn't prevent deletion of versions. Object Lock adds a WORM protection layer.
  • "MFA Delete can be enabled by any IAM admin" — FALSE; only the root user can enable MFA Delete.
📜 ACM — Encryption in Transit with TLS
TLSCertificates
  • Free public TLS certificates for AWS services (ALB, CloudFront, API Gateway)
  • Auto-renewal — eliminates certificate expiry incidents
  • Private key stays in ACM — cannot be exported (use ACM Private CA for on-prem)
  • Critical: CloudFront certificates must be provisioned in us-east-1 regardless of origin region
Validation Methods
MethodHowBest For
DNS ValidationAdd CNAME to Route 53 (ACM can automate)Automated renewal; preferred
Email ValidationClick link emailed to WHOIS contactsWhen DNS is not manageable
🎯

TLS terminates at ALB (ACM cert on the HTTPS listener). Backend EC2s communicate on HTTP within the VPC (acceptable) or HTTPS with self-signed cert. CloudFront + custom domain → provision ACM cert in us-east-1 first — this is a common gotcha.

💡

Mnemonic: ACM = Auto Certificate Management (Free, auto-renews with DNS, stays in AWS).

⚠️

Common traps:

  • "ACM certificates can be downloaded and installed on EC2" — FALSE; public ACM certs cannot be exported. Use ACM Private CA if you need exportable certs for EC2/on-prem.
  • "A certificate provisioned in us-west-2 works with CloudFront" — FALSE; CloudFront requires ACM certificates to be in us-east-1 specifically, regardless of where your origin is.
  • "ACM automatically renews all certificates" — FALSE; ACM only auto-renews if DNS validation is in place. Email-validated certs require manual re-validation.
  • "ACM certificates work with EC2 directly" — FALSE; ACM integrates with ELB, CloudFront, API Gateway — not directly on EC2.
💾 Data Backups, Replication & AWS Backup
BackupsMedium Frequency
AWS Backup
  • Centralized policy-driven backup for: EC2, EBS, RDS, Aurora, DynamoDB, EFS, S3, FSx, Storage Gateway
  • Backup Plans: schedule, retention, lifecycle to cold storage tier
  • Cross-region and cross-account copies for DR
  • Backup Vault Lock: WORM on backup vaults — prevents deletion even by admins; Compliance mode = immutable
Service-Specific Patterns
ServiceBackup MechanismRecovery
EBSIncremental snapshots (stored in S3)Restore to new volume, any point
RDSAutomated backups (1–35 days) + manual snapshotsPoint-in-time within retention window
DynamoDBOn-demand backups + PITR (35 days)Restore to new table
S3Versioning + Cross-Region Replication (CRR)Any prior version in same or other region
🎯

7-year immutable backup (compliance) → AWS Backup with Vault Lock in Compliance mode. Cross-region DR for S3 → CRR. Point-in-time recovery for DynamoDB → enable PITR (35-day window, continuous).

💡

Mnemonic: PITR = Point In Time Recovery (Creates a NEW table/DB, never overwrites the existing one).

⚠️

Common traps:

  • "RDS Multi-AZ standby can serve read traffic" — FALSE; Multi-AZ standby is passive — it only activates on failover. Use read replicas to serve reads.
  • "S3 Cross-Region Replication replicates existing objects automatically" — FALSE; CRR only replicates objects uploaded after CRR is enabled. Use S3 Batch Replication for existing objects.
  • "EBS snapshots are region-specific" — TRUE and often a trap; you must manually copy snapshots to other regions for DR.
  • "DynamoDB PITR lets you restore to any second in the last 35 days" — TRUE but the restored table is a new table — it does not overwrite the existing table.
📊 Data Classification, Lifecycle & Compliance Controls
GovernanceMedium Frequency
S3 Lifecycle Transitions

Standard (0–30d) → Standard-IA (30–90d) → Glacier Instant Retrieval (90–180d) → Glacier Deep Archive (180d+). Expire/delete objects automatically after a set age.

Compliance Toolchain
ServicePurposeKey Output
AWS ConfigContinuous compliance monitoring; tracks config changesConfig rules, conformance packs
CloudTrailAPI audit trail — who did what, when, from whereLog files to S3; EventBridge integration
Audit ManagerAutomated evidence collection for auditsSOC2, PCI, HIPAA frameworks
Security HubAggregates findings from GuardDuty, Inspector, MacieUnified security posture score
🎯

CloudTrail = "who made the API call?" (event history). Config = "is this resource compliant right now?" (current state). Config auto-remediates with SSM Automation. Both feed Security Hub for unified dashboard.

💡

Mnemonic: Config evaluates the Current State. CloudTrail tracks the Trail of API calls (Who/What/When).

⚠️

Common traps:

  • "CloudTrail is enabled by default in all regions" — FALSE; by default only a limited management events trail may exist. You must create an organization trail or enable per-region trails explicitly.
  • "AWS Config prevents non-compliant resource creation" — FALSE; Config is detective, not preventive. Use SCPs or IAM policies to prevent; Config detects and reports after the fact.
  • "S3 lifecycle rules can transition objects from Standard-IA directly to Standard" — FALSE; lifecycle rules only move objects to colder tiers, not back to warmer ones. Minimum 30-day stay applies for Standard-IA before transitioning to Glacier.
Knowledge Check

Domain 2 Overview

Design architectures that survive failures, scale on demand, and decouple components. Covers microservices, messaging, serverless, containers, HA patterns, disaster recovery, and fault tolerance.

⚡ 26% of scored content
📊 Visual Study Guides — Domain 2
Cheat SheetVisual
Domain 2 — Resilient Architectures
Domain 2: Design Resilient Cloud Architectures Overview

Resilient Architectures — overview & DR strategies

The Two Pillars of Resilient AWS Architecture

Two Pillars of Resilient Architecture

Domain 2: Resilient Cloud Architecture Study Guide

Resilient Cloud Architecture — full study guide

Task 2.1

Design Scalable and Loosely Coupled Architectures

Microservices, messaging, serverless, containers, caching, API Gateway, event-driven design.

⚖️ Decoupling with SQS, SNS & EventBridge
MessagingHigh Frequency

Loosely coupled architectures use asynchronous messaging so components can scale and fail independently.

Service Comparison
ServiceModelUse CaseRetention
Amazon SQSQueue (pull)Work queues, job decoupling, rate limitingUp to 14 days
Amazon SNSPub/Sub (push)Fan-out to multiple subscribers simultaneouslyNo persistence
Amazon EventBridgeEvent bus (push)Event-driven routing, SaaS integration, scheduled rulesArchive optional
Amazon MQQueue (AMQP/MQTT)Migrating existing message brokers (ActiveMQ, RabbitMQ)Configurable
SQS — Key Concepts
  • Standard Queue: At-least-once delivery, best-effort ordering, nearly unlimited throughput
  • FIFO Queue: Exactly-once processing, strict ordering, up to 3,000 msg/s with batching
  • Visibility Timeout: Hides a message while a consumer processes it (default 30s); prevents duplicate processing
  • Dead Letter Queue (DLQ): Captures messages that fail processing after N attempts
  • Long Polling: Consumer waits up to 20s for messages — reduces empty API calls and cost
Fan-Out Pattern (SNS + SQS)
Pattern

Order service publishes to SNS topic → SNS fans out to: SQS queue for fulfillment service + SQS queue for billing service + SQS queue for notification service. Each service scales independently and processes at its own rate. No service is blocked by another.

graph TD O[Order Service] -->|Publish| SNS[SNS Topic: NewOrders] SNS -->|Push| Q1[SQS: Fulfillment] SNS -->|Push| Q2[SQS: Billing] SNS -->|Push| Q3[SQS: Notifications] Q1 -.->|Poll| W1[Fulfillment Workers] Q2 -.->|Poll| W2[Billing Workers] Q3 -.->|Poll| W3[Notification Workers]
🎯

Ordered + exactly-once → SQS FIFO. Fan-out to multiple consumers → SNS → SQS. Route events based on content/pattern → EventBridge. Migrating ActiveMQ → Amazon MQ (not SQS — preserves broker protocols).

⚠️

Common traps:

  • "SQS FIFO guarantees ordering across all message groups" — FALSE; ordering is guaranteed only within a message group ID.
  • "SNS delivers to SQS in order" — FALSE; SNS is a push/fanout service with no ordering guarantee.
  • "SQS Standard ensures exactly-once delivery" — FALSE; Standard is at-least-once. Only FIFO is exactly-once.
  • "Increasing SQS visibility timeout prevents all duplicate processing" — FALSE; if a consumer crashes before deleting the message, it reappears after the timeout and will be processed again.
  • "A DLQ automatically retries messages" — FALSE; DLQ just stores failed messages. You must manually reprocess or build re-drive logic.
Serverless — Lambda, Fargate & Step Functions
ServerlessHigh Frequency
AWS Lambda
  • Event-driven, stateless functions — runs up to 15 minutes per invocation
  • Memory: 128 MB–10 GB (CPU allocated proportionally)
  • Triggers: API GW, ALB, SQS, SNS, S3, DynamoDB Streams, EventBridge, Kinesis
  • Concurrency: Account default 1,000; request increases for high-traffic workloads
  • Reserved Concurrency: Guarantees capacity; also throttles at that limit
  • Provisioned Concurrency: Eliminates cold starts — pre-warms execution environments
AWS Fargate vs. EC2 Launch Type (ECS/EKS)
FargateEC2 Launch Type
Server managementFully serverlessYou manage EC2 instances
Cost modelPer vCPU + memory usedPer EC2 instance (even when idle)
ScalingPer-task scalingCluster + service scaling
Best forVariable workloads, no ops overheadGPU workloads, custom AMIs, cost at scale
AWS Step Functions
  • Orchestrates multi-step workflows as state machines (JSON ASL definition)
  • Standard Workflows: Long-running (up to 1 year), exactly-once, audit history
  • Express Workflows: High-volume, short-duration (up to 5 min), at-least-once
  • Handles retries, error catching, parallel branches, and human approval steps
graph TD S1["Order Received"] --> S2["Validate Payment\n(Lambda)"] S2 -->|"Success"| S3["Parallel State"] S2 -->|"Failure / Retry"| S4["Notify Customer\n(Error state)"] S3 -->|"Branch 1"| S5["Update Inventory\n(Lambda)"] S3 -->|"Branch 2"| S6["Send Confirmation\n(SNS)"] S5 --> S7["Order Complete"] S6 --> S7
🎯

Cold start latency → Provisioned Concurrency. Orchestrate multi-Lambda workflow with retries → Step Functions. Containers without managing EC2 → Fargate. Lambda timeout limit: 15 minutes — long-running tasks need EC2, Batch, or ECS.

⚠️

Common traps:

  • "Lambda scales infinitely without limits" — FALSE; there is an account-level concurrency limit (default 1,000 per region).
  • "Provisioned Concurrency eliminates all cold starts" — TRUE FOR PROVISIONED INSTANCES, BUT IF TRAFFIC EXCEEDS PROVISIONED COUNT, NEW COLD INSTANCES SPIN UP. "LAMBDA CAN RUN INDEFINITELY" — FALSE; max 15 minutes per invocation.
  • "Fargate is always cheaper than EC2" — FALSE; for consistently high utilization, EC2 with Reserved Instances is cheaper. Fargate shines for variable/spiky workloads.
  • "Step Functions Express Workflows support exactly-once execution" — FALSE; Express is at-least-once. Only Standard Workflows are exactly-once.
🌀 Containers — ECS, EKS & When to Use Them
ContainersMedium Frequency
Amazon ECSAmazon EKS
OrchestrationAWS-proprietaryKubernetes (open standard)
Learning curveLower — AWS-nativeHigher — requires K8s knowledge
Best forAWS-native workloads, simpler opsKubernetes migrations, multi-cloud portability
Launch typesFargate + EC2Fargate + EC2 + Managed Node Groups
Container Migration Drivers
  • Portability: same container image runs locally, on ECS, EKS, or on-prem
  • Density: pack more workloads per EC2 instance than VMs
  • Faster deploys: images are immutable — promotes CI/CD best practices
  • ECR (Elastic Container Registry): private Docker registry, integrated with ECS/EKS
🎯

AWS-native container workload → ECS. Existing Kubernetes workload or multi-cloud → EKS. No server management → add Fargate. Store container images → ECR (not Docker Hub — keep it in AWS for lower latency and security).

⚠️

Common traps:

  • "ECS and EKS both require managing EC2 instances" — FALSE; both support Fargate (serverless compute).
  • "ECS is a Kubernetes service" — FALSE; ECS is AWS-proprietary orchestration. EKS runs actual Kubernetes.
  • "Containers are always stateless" — FALSE; containers can be stateful using EBS or EFS volumes.
  • "ECR is only for ECS" — FALSE; ECR stores container images used by ECS, EKS, Lambda, or any Docker-compatible runtime.
  • "EKS is free" — FALSE; you pay per EKS cluster per hour (~$0.10/hr) plus EC2/Fargate costs.
🔌 API Gateway & Microservice Patterns
APIMedium Frequency
API Gateway Types
TypeUse CaseProtocol
REST APIStandard HTTP APIs; request/response transformation, cachingHTTP/S
HTTP APILower cost, lower latency than REST; JWT auth built-inHTTP/S
WebSocket APIReal-time bidirectional — chat, live dashboardsWebSocket
Key Features
  • Throttling: Protects backends — default 10,000 RPS per account (configurable)
  • Caching: Cache responses 0.5 GB–237 GB; reduces backend calls
  • Usage Plans + API Keys: Tiered rate limiting per client
  • Authorizers: Lambda authorizer (custom logic) or Cognito User Pool (JWT)
  • Private APIs: Accessible only within VPC via interface endpoint
Microservice Design Principles
  • Stateless workloads: No server-side session state → easy horizontal scaling
  • Stateful workloads: Session state in ElastiCache or DynamoDB, not in-process
  • Read replicas: Offload read traffic from primary DB — scale reads independently
🎯

REST API + caching + transformation → REST API GW. Lower cost simple HTTP proxy → HTTP API GW. Real-time push → WebSocket API GW. Throttle specific clients → Usage Plans. Scale reads → RDS read replicas or ElastiCache.

⚠️

Common traps:

  • "API Gateway HTTP API supports request/response transformation" — FALSE; only REST APIs support mapping templates for transformation.
  • "API Gateway caches responses globally across all regions" — FALSE; caching is per stage, per region, per API.
  • "Increasing API GW timeout beyond 29 seconds is possible" — FALSE; API Gateway has a hard maximum integration timeout of 29 seconds. Use async patterns (SQS + Lambda) for longer operations.
Caching Strategies — ElastiCache, CloudFront, DAX
CachingHigh Frequency
ServiceLayerUse CaseEngine
ElastiCache for RedisIn-memory DB cacheSessions, leaderboards, pub/sub, complex data typesRedis
ElastiCache for MemcachedIn-memory cacheSimple object caching, horizontal scalingMemcached
Amazon DAXDynamoDB acceleratorMicrosecond reads for DynamoDB (no app code change)Proprietary
Amazon CloudFrontEdge CDN cacheStatic/dynamic content, API response caching at edgeEdge network
Caching Patterns
  • Lazy Loading (Cache-Aside): Check cache → miss → load from DB → write to cache. Stale data risk, but only caches what's requested.
  • Write-Through: Write to cache and DB simultaneously. Always fresh data but higher write latency.
  • TTL: Set expiry on cache entries to prevent serving stale data indefinitely.
🎯

DynamoDB read latency too high → DAX (microseconds, no code change). Session management → ElastiCache Redis. Global content delivery / static assetsCloudFront. Need pub/sub in cache layer → Redis (Memcached has no pub/sub).

⚠️

Common traps:

  • "DAX can be used with any database" — FALSE; DAX is exclusively for DynamoDB.
  • "ElastiCache Memcached supports Multi-AZ automatic failover" — FALSE; Memcached has no replication or failover. Only Redis supports Multi-AZ with automatic failover.
  • "Caching always improves consistency" — FALSE; caching introduces potential stale data; TTL and invalidation strategies must be carefully designed.
  • "CloudFront caches all content types by default" — FALSE; caching behavior is controlled by Cache-Control and TTL settings. Dynamic content (API responses, authenticated pages) is typically not cached and passes through to origin on every request.
⚙️ Load Balancing — ALB, NLB & Gateway LB
NetworkingHigh Frequency
ALB (Layer 7)NLB (Layer 4)Gateway LB (Layer 3)
ProtocolHTTP, HTTPS, gRPC, WebSocketTCP, UDP, TLSIP (GENEVE)
RoutingPath, host, header, query stringIP + portPass-through to appliances
Static IP❌ (use Global Accelerator)✅ Per AZ
Use caseMicroservices, HTTP routing, containersUltra-low latency, gaming, financialInline security appliances (IDS/IPS, firewalls)
🎯

Route by URL path (/api vs /web) → ALB. Need static IP for whitelist → NLB. Third-party firewall/IDS inspection → Gateway LB. WebSocket support → ALB (NLB also supports TCP WebSocket).

⚠️

Common traps:

  • "ALB provides a static IP address" — FALSE; ALB uses DNS names that resolve to dynamic IPs. Use Global Accelerator in front of ALB for static IPs.
  • "NLB supports path-based routing" — FALSE; NLB operates at Layer 4 and routes by IP/port only.
  • "You can attach a WAF to an NLB" — FALSE; WAF only works with ALB, CloudFront, API Gateway, and AppSync. NLB operates at Layer 4 with no HTTP context, so WAF (a Layer 7 firewall) cannot be applied to it.
  • "Cross-Zone Load Balancing is enabled by default on all LBs" — FALSE; it's enabled by default on ALB but disabled by default on NLB and Gateway LB.
Knowledge Check
Task 2.2

Design Highly Available and/or Fault-Tolerant Architectures

Multi-AZ, multi-Region, DR strategies, RTO/RPO, Route 53 routing, immutable infrastructure.

🌍 Disaster Recovery Strategies — RTO, RPO & the 4 Tiers
DRVery High Frequency
RTO vs. RPO
  • RTO (Recovery Time Objective): Maximum tolerable downtime — how fast must you recover?
  • RPO (Recovery Point Objective): Maximum tolerable data loss — how much data can you afford to lose?
The 4 DR Strategies (cheapest → fastest)
StrategyDescriptionRTORPOCost
Backup & RestoreRestore from S3/Glacier backup. No live DR resources.HoursHoursLowest
Pilot LightCore data replicated; minimal compute off. Scale up on event.Minutes–hoursMinutesLow
Warm StandbyScaled-down but functional copy in DR region. Scale up fast.MinutesSeconds–minutesMedium
Active-Active (Multi-site)Both regions serve traffic simultaneously.Near-zeroNear-zeroHighest
Scenario

Company requires RPO ≤ 15 min and RTO ≤ 1 hour. Backup & Restore won't meet RTO. Active-Active is too expensive. Best fit: Warm Standby — a scaled-down running stack in DR region with continuous replication; scale up within minutes on failover.

graph LR BR["Backup & Restore\nRPO/RTO: Hours\nCost: $"] PL["Pilot Light\nRPO: Minutes\nRTO: Minutes–Hrs\nCost: $$"] WS["Warm Standby\nRPO: Seconds\nRTO: Minutes\nCost: $$$"] AA["Active-Active\nRPO/RTO: Near-Zero\nCost: $$$$"] BR -->|more resilient| PL PL -->|more resilient| WS WS -->|more resilient| AA
🎯

The exam gives you RPO/RTO requirements and asks which strategy fits. Map: hours/hours → Backup & Restore; minutes RPO → Pilot Light or Warm Standby; near-zero → Active-Active. Cost scales with RTO speed.

⚠️

Common traps:

  • "Pilot Light means the DR environment is fully running at reduced capacity" — FALSE; Pilot Light means only core data/services (like DB replication) run. Compute is off and must be scaled up on failover. That's Warm Standby.
  • "RPO is about how fast you recover" — FALSE; RPO is about data loss tolerance (time). RTO is recovery time. Swap these and you'll pick the wrong strategy.
🗺️ Route 53 — Routing Policies for HA
DNSHigh Frequency
PolicyUse CaseHealth Check?
SimpleSingle resource; no health checksOptional
FailoverPrimary/secondary; fail over on health check failure✅ Required
WeightedA/B testing; canary deployments; split traffic by %Optional
LatencyRoute to region with lowest latency for the userOptional
GeolocationRoute by user's geographic location (country/continent)Optional
GeoproximityRoute by distance, with bias to shift traffic between regionsOptional
Multivalue AnswerReturn up to 8 healthy records; basic load distribution✅ Recommended
🎯

Active-Active failover → Latency or Weighted (both regions serve traffic). Active-Passive failover → Failover routing policy. Legal data residency → Geolocation. Gradually shift traffic to new region → Geoproximity with bias.

⚠️

Common traps:

  • "Geolocation routing guarantees users always connect to the nearest region" — FALSE; Geolocation routes by geographic location, not latency. Use Latency-based routing for lowest latency.
  • "Multivalue Answer is a load balancer replacement" — FALSE; it's basic DNS-level health-checked multi-record, not a real load balancer. Use ELB for actual load balancing.
  • "Route 53 health checks can test private endpoints directly" — FALSE; health checks originate from the internet. Use CloudWatch alarm + Route 53 health check linked to the alarm for private resources.
  • "Weighted routing with weight 0 removes the record" — FALSE; weight 0 stops traffic to that endpoint but the record remains; to remove it, delete the record or set all weights to 0 (which distributes evenly).
🔁 Multi-AZ Patterns, Auto Scaling & Immutable Infrastructure
HAHigh Frequency
Multi-AZ Key Facts
  • Always deploy across ≥2 AZs for HA — AZs are isolated failure domains within a Region
  • RDS Multi-AZ: Synchronous replication to standby; automatic failover (~60–120s); standby is not readable
  • Aurora Multi-AZ: 6 copies across 3 AZs; read replicas can be promoted; much faster failover than RDS
  • ELB: Distributes traffic across AZs; Cross-Zone Load Balancing sends traffic to all registered targets
EC2 Auto Scaling
  • Target Tracking: Maintain a metric value (e.g., CPU at 60%) — simplest, recommended
  • Step Scaling: Scale in defined steps based on CloudWatch alarms
  • Scheduled Scaling: Scale at predictable times (e.g., 8 AM every weekday)
  • Predictive Scaling: ML-based; provisions capacity before load arrives
  • Cooldown Period: Prevents thrashing — default 300s after a scale action
Immutable Infrastructure
  • Never modify running instances — replace with new AMI versions
  • Enables blue/green deployments: stand up new stack → shift traffic → terminate old
  • CloudFormation / CDK define infrastructure as code — entire stack is replaceable
🎯

Eliminate single points of failure: Multi-AZ ELB + Auto Scaling Group + Multi-AZ RDS. RDS read replica ≠ Multi-AZ standby — read replicas are for scaling reads (asynchronous replication); Multi-AZ standby is for failover (synchronous, not readable).

⚠️

Common traps:

  • "RDS Multi-AZ standby can handle read queries to reduce load" — FALSE; standby is passive and not accessible for reads. Create read replicas for that.
  • "Auto Scaling adds instances immediately when alarm fires" — FALSE; there is a warm-up period and cooldown period that delays scaling.
  • "Scheduled scaling overrides target tracking" — FALSE; they work together — ASG uses whichever produces the largest desired capacity.
  • "EC2 Auto Scaling can replace unhealthy instances across regions" — FALSE; ASG is regional. Use multi-region architecture + Route 53 failover for cross-region HA.
  • "Cooldown period prevents scale-in and scale-out" — FALSE; cooldown only applies to the same scaling policy type that triggered it.
👁️ Workload Visibility — CloudWatch, X-Ray & Service Quotas
Observability
Amazon CloudWatch
  • Metrics, logs, alarms, dashboards — central observability platform
  • Custom Metrics: Push application-level metrics (e.g., orders/min) via PutMetricData API
  • CloudWatch Logs Insights: Query log groups with SQL-like syntax
  • Composite Alarms: Combine multiple alarms with AND/OR logic to reduce alert noise
AWS X-Ray
  • Distributed tracing for microservices — visualizes request flow across services
  • Identifies bottlenecks, errors, throttling, and latency hotspots in distributed apps
  • Integrates with Lambda, API Gateway, ECS, EC2 (via daemon)
Service Quotas & Throttling
  • Every AWS service has default quotas (e.g., Lambda concurrency: 1,000)
  • Request quota increases via Service Quotas console before launching high-traffic workloads
  • Standby environments need their own quota increases — they won't share with primary
  • Use AWS Trusted Advisor to identify quota risks proactively
🎯

Trace a slow API call across Lambda + DynamoDB → X-Ray service map. Alert when 3 separate metrics breach thresholds simultaneously → CloudWatch Composite Alarm. DR standby needs same throughput as prod → pre-request quota increases in DR region.

⚠️

Common traps:

  • "CloudWatch monitors applications inside EC2 automatically" — FALSE; by default CloudWatch only gets hypervisor-level metrics (CPU, network, disk I/O). Install the CloudWatch Agent for memory, disk usage, and custom app metrics.
  • "X-Ray works automatically for all AWS services" — FALSE; you must instrument your code with the X-Ray SDK and configure the X-Ray daemon or Lambda layer.
  • "A CloudWatch alarm in INSUFFICIENT_DATA state means a breach" — FALSE; INSUFFICIENT_DATA means not enough data points — it does not trigger alarm actions by default.
  • "CloudWatch Logs retention is infinite by default" — FALSE; default is never expire. You must set a retention policy to avoid unbounded log storage costs.
🚀 RDS Proxy, Legacy Modernization & Reliability Patterns
Reliability
Amazon RDS Proxy
  • Connection pooler between Lambda/app and RDS — prevents connection exhaustion
  • Improves failover time: maintains connections during Multi-AZ failover; app reconnects instantly
  • Integrates with Secrets Manager for credential rotation without app changes
  • Ideal when Lambda functions create many short-lived DB connections (connection storms)
Improving Legacy Applications
  • Add ALB + Auto Scaling in front of legacy monolith without changing app code
  • Put CloudFront in front to cache static assets and reduce origin load
  • Use SQS to absorb burst traffic and smooth load on legacy backend
  • Strangler Fig pattern: gradually replace monolith functionality with microservices behind same domain
🎯

Lambda hitting RDS connection limit → RDS Proxy (pooling). Legacy app can't be changed but needs HA → add ALB + Auto Scaling Group wrapping it. Reduce DB load without code changesCloudFront for static + ElastiCache for DB query caching.

⚠️

Common traps:

  • "RDS Proxy works with all RDS database engines" — FALSE; RDS Proxy supports MySQL, PostgreSQL, and MariaDB. It does not support Oracle or SQL Server.
  • "RDS Proxy eliminates failover downtime" — FALSE; it reduces failover impact (from ~60s to ~5s) but doesn't eliminate it.
  • "Adding ElastiCache in front of RDS requires no code changes" — FALSE; you must modify application code to check cache before hitting the DB (lazy loading pattern). DAX for DynamoDB is the only cache that works transparently without code changes.
  • "Putting an SQS queue in front of a legacy app always improves performance" — FALSE; SQS adds asynchronous processing which can increase latency for synchronous use cases.

Domain 3 Overview

Select optimal AWS services and configurations for storage, compute, databases, networking, and data pipelines to meet performance requirements efficiently at scale.

⚡ 24% of scored content
📊 Visual Study Guides — Domain 3
Cheat SheetVisual
Domain 3 — High-Performing Architectures
Domain 3: Designing High-Performing Architectures

High-Performing Architectures — domain guide

5 Pillars of High-Performing AWS Architectures

5 Pillars of High-Performing AWS Architectures

Task 3.1

High-Performing Storage Solutions

S3, EBS, EFS, FSx — performance characteristics, hybrid storage, scaling.

Knowledge of:
  • Hybrid storage solutions to meet business requirements
  • Storage services with appropriate use cases (for example, Amazon S3, Amazon EFS, Amazon EBS)
  • Storage types with associated characteristics (for example, object, file, block)
Skills in:
  • Determining storage services and configurations that meet performance demands
  • Determining storage services that can scale to accommodate future needs
💿 Storage Type Selection — S3, EBS, EFS, FSx
StorageHigh Frequency
ServiceTypeAccessUse CaseThroughput
Amazon S3ObjectHTTP API (any client)Data lake, backups, static website, mediaEffectively unlimited
Amazon EBSBlockSingle EC2 (same AZ)OS volumes, databases, low-latency random I/OUp to 256K IOPS (io2 BE)
Amazon EFSFile (NFS)Thousands of EC2/Lambda across AZsShared CMS, home dirs, dev tools, containersElastic, bursts to 3+ GB/s
FSx for WindowsFile (SMB)Windows EC2 / on-prem ADWindows workloads, SQL Server, Active DirectoryUp to 2 GB/s
FSx for LustreFile (parallel)HPC compute nodesML training, genomics, video processing, HPCHundreds of GB/s
FSx for NetApp ONTAPFile (multi-protocol)NFS, SMB, iSCSILift-and-shift enterprise storage appsHigh
EBS Volume Types
TypeClassMax IOPSUse Case
gp3SSD16,000General purpose — cost-effective for most workloads
io2 Block ExpressSSD256,000Critical DBs, SAP HANA, lowest latency
st1HDD500Throughput-optimized — big data, log processing
sc1HDD250Cold HDD — infrequent access, lowest cost block
🎯

Shared file storage across multiple EC2 → EFS (Linux) or FSx for Windows (Windows). High-IOPS database → EBS io2. HPC / ML training → FSx for Lustre (can link to S3 as data repository). EBS only attaches to one EC2 in the same AZ — multi-attach io2 is limited exception.

⚠️

Common traps:

  • "EBS volumes can be attached to multiple EC2 instances simultaneously" — FALSE FOR MOST TYPES; only io1/io2 with Multi-Attach enabled (same AZ, up to 16 instances, Linux only with cluster-aware file system).
  • "EFS can be mounted on Windows EC2" — FALSE; EFS is NFS-based (Linux only). Use FSx for Windows File Server for Windows.
  • "S3 is a file system" — FALSE; S3 is object storage, not a POSIX-compliant file system. Don't mount it like EFS.
  • "EBS volumes persist if the EC2 is terminated" — FALSE BY DEFAULT; the root volume is deleted on termination unless you explicitly uncheck DeleteOnTermination. Data volumes persist by default.
🗂️ S3 Performance & Hybrid Storage
S3Hybrid
S3 Performance Patterns
  • S3 automatically scales to 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix
  • Use multiple key prefixes (paths) to parallelize across partitions — no random prefixes needed (post-2018)
  • S3 Transfer Acceleration: Uploads via CloudFront edge → AWS backbone — improves speed over long distances
  • Multipart Upload: Required for files >5 GB; recommended for files >100 MB; enables parallel upload chunks
  • Byte-Range Fetches: Parallelize downloads by fetching chunks simultaneously
Hybrid Storage
ServiceUse Case
AWS Storage Gateway (S3 File GW)On-prem NFS/SMB → S3 via local cache appliance
AWS Storage Gateway (Volume GW)iSCSI block volumes backed by S3/Glacier
AWS Storage Gateway (Tape GW)Virtual tape library → S3 Glacier (replaces physical tape)
AWS DataSyncHigh-speed online data transfer: on-prem ↔ S3, EFS, FSx
AWS Snow FamilyOffline physical transfer for petabyte-scale or no-internet scenarios
🎯

Upload large objects fast over the internet → S3 Transfer Acceleration + Multipart Upload. On-prem file server → S3 → Storage Gateway File GW. Migrate petabytes with limited bandwidth → Snowball Edge. Ongoing sync → DataSync (faster than S3 CLI, handles metadata).

⚠️

Common traps:

  • "S3 Transfer Acceleration speeds up downloads from S3" — TRUE, NOT JUST UPLOADS — IT ACCELERATES BOTH. "YOU CAN USE RANDOM PREFIXES (HASH-BASED KEYS) TO IMPROVE S3 PERFORMANCE" — THIS WAS TRUE PRE-2018; S3 now automatically partitions on request rate. Random prefixes are no longer needed.
  • "DataSync is only for one-time migrations" — FALSE; DataSync supports ongoing scheduled synchronization.
  • "Snowball can transfer data to any AWS region" — FALSE; Snowball ships to and from specific AWS regions; not all regions support all Snow devices.
Task 3.2

High-Performing and Elastic Compute Solutions

EC2 instance types, Auto Scaling, serverless, containers, distributed compute.

Knowledge of:
  • AWS compute services with appropriate use cases (for example, AWS Batch, Amazon EMR, AWS Fargate)
  • Distributed computing concepts supported by AWS global infrastructure and edge services
  • Queuing and messaging concepts (for example, publish/subscribe)
  • Scalability capabilities with appropriate use cases (for example, Amazon EC2 Auto Scaling, AWS Auto Scaling)
  • Serverless technologies and patterns (for example, AWS Lambda, Fargate)
  • The orchestration of containers (for example, Amazon ECS, Amazon EKS)
Skills in:
  • Decoupling workloads so that components can scale independently
  • Identifying metrics and conditions to perform scaling actions
  • Selecting the appropriate compute options and features (for example, EC2 instance types) to meet business requirements
  • Selecting the appropriate resource type and size (for example, the amount of Lambda memory) to meet business requirements
🖥️ EC2 Instance Families & Sizing
ComputeHigh Frequency
FamilyOptimized ForExample TypesUse Cases
General Purpose (M, T)Balanced CPU/memory/networkm7g, t3aWeb servers, dev environments, small DBs
Compute Optimized (C)High CPU : memory ratioc7g, c6iBatch processing, ML inference, video encoding
Memory Optimized (R, X, u)High memory : CPU ratior7i, x2iednIn-memory DBs, SAP HANA, real-time analytics
Storage Optimized (I, D, H)High sequential I/O / NVMei4i, d3NoSQL DBs, data warehousing, distributed file systems
Accelerated (P, G, Inf, Trn)GPU / custom siliconp4d, g5, inf2ML training, graphics rendering, HPC
Decoupling & Independent Scaling
  • Use SQS between web tier and processing tier — each scales based on its own queue depth or CPU metric
  • SQS queue depth (ApproximateNumberOfMessagesVisible) → scale Auto Scaling Group for workers
  • AWS Batch: Managed batch compute — dynamically provisions EC2/Spot for job queues; no manual cluster management
🎯

Identify the right instance family from workload description: "in-memory database" → R-family; "high-compute batch jobs" → C-family; "ML training with GPUs" → P/G-family; "genomics high I/O" → I-family. Lambda memory directly controls allocated CPU too.

⚠️

Common traps:

  • "T-family instances always deliver full CPU performance" — FALSE; T instances have a CPU credit model. Under sustained load they throttle unless T-unlimited mode is enabled (at extra cost).
  • "Larger instance size always means better performance" — FALSE; a memory-optimized R-family is better for memory-bound workloads than a larger compute C-family.
  • "AWS Batch requires EC2 instances you manage" — FALSE; AWS Batch can use Fargate as the compute environment for serverless job execution.
  • "Spot Instances can be used for RDS" — FALSE; RDS does not use Spot pricing. Spot is only for EC2, ECS, EMR, and Batch.
Task 3.3

High-Performing Database Solutions

RDS, Aurora, DynamoDB, ElastiCache, database selection and architecture.

Knowledge of:
  • AWS global infrastructure (for example, Availability Zones, AWS Regions)
  • Caching strategies and services (for example, Amazon ElastiCache)
  • Data access patterns (for example, read-intensive compared with write-intensive)
  • Database capacity planning (for example, capacity units, instance types, Provisioned IOPS)
  • Database connections and proxies
  • Database engines with appropriate use cases (for example, heterogeneous migrations, homogeneous migrations)
  • Database replication (for example, read replicas)
  • Database types and services (for example, serverless, relational compared with non-relational, in-memory)
Skills in:
  • Configuring read replicas to meet business requirements
  • Designing database architectures
  • Determining an appropriate database engine (for example, MySQL compared with PostgreSQL)
  • Determining an appropriate database type (for example, Amazon Aurora, Amazon DynamoDB)
  • Integrating caching to meet business requirements
🗄️ Database Selection — Relational vs. NoSQL vs. Specialty
DatabaseHigh Frequency
ServiceTypeStrengthsBest For
Amazon RDSRelational (OLTP)Managed MySQL/PostgreSQL/Oracle/SQL ServerTraditional apps, ERP, CRM, e-commerce
Amazon AuroraRelational (OLTP)5× MySQL / 3× PostgreSQL perf; 6-copy replication; Global DBHigh-throughput relational workloads
Amazon DynamoDBNoSQL (key-value / document)Millisecond latency, serverless, virtually unlimited scaleSession stores, gaming, IoT, e-commerce cart
Amazon RedshiftColumnar (OLAP)Petabyte-scale data warehouse; Redshift Spectrum queries S3Analytics, BI, large-scale reporting
Amazon NeptuneGraphBillions of relationships; Gremlin / SPARQLSocial networks, fraud detection, knowledge graphs
Amazon ElastiCacheIn-memorySub-millisecond; Redis or MemcachedCaching, sessions, leaderboards
Amazon KeyspacesWide-column (Cassandra)Serverless Cassandra-compatibleMigrating Cassandra workloads
Aurora Performance Highlights
  • Storage auto-grows in 10 GB increments up to 128 TB — no pre-provisioning
  • Up to 15 read replicas with <10ms replica lag
  • Aurora Serverless v2: Scales in fine-grained ACU increments; ideal for variable/unpredictable workloads
  • Aurora Global Database: Replicates across up to 5 regions with <1s lag; secondary region promoted in <1 min
DynamoDB Capacity Modes
ProvisionedOn-Demand
BillingPer RCU/WCU provisionedPer request (pay per read/write)
ScalingAuto Scaling adjusts within limitsInstantly handles any traffic level
Best forPredictable traffic; cost optimizationUnknown or spiky traffic
🎯

"Need to join tables, ACID transactions" → RDS/Aurora. "Millisecond latency at any scale, no schema" → DynamoDB. "Run SQL analytics on S3 data lake" → Redshift Spectrum or Athena. "Graph relationships" → Neptune. "Migrate Oracle → AWS" → RDS for Oracle or Aurora PostgreSQL with SCT/DMS.

⚠️

Common traps:

  • "Aurora is just a managed MySQL" — FALSE; Aurora has a completely different storage layer (distributed, 6 copies, auto-growing) with 5× MySQL performance.
  • "DynamoDB supports complex multi-table joins" — FALSE; DynamoDB is a NoSQL key-value/document store with no native JOIN support. Design your data model to avoid joins (single-table design).
  • "Aurora Serverless v2 scales to zero" — FALSE; Aurora Serverless v2 scales down to 0.5 ACU minimum, not zero. v1 could scale to zero.
  • "Redshift is used for OLTP workloads" — FALSE; Redshift is a columnar OLAP data warehouse optimized for analytics queries, not transactional workloads.
📈 Database Performance — Read Replicas, Proxies & Caching
DatabaseHigh Frequency
Read Replicas
  • Asynchronous replication from primary to replica(s)
  • RDS: up to 5 read replicas; Aurora: up to 15
  • Point reporting/analytics workloads to read replicas — reduce primary load
  • Can promote to standalone DB (breaks replication) for DR or migration
  • Cross-region read replicas: lower latency for global users + DR capability
Connection Management
  • RDS Proxy: Pools and multiplexes connections — critical for Lambda → RDS (Lambda can open thousands of concurrent connections)
  • Reduces DB failover impact: proxy maintains connections, apps reconnect through proxy seamlessly
Caching with ElastiCache
  • Read-heavy workloads: Cache frequently queried DB results → reduce RDS load by 80%+
  • Redis Cluster mode: Horizontal sharding for datasets >300 GB
  • Redis Sentinel/Replication: Primary + replicas for HA (automatic failover)
🎯

Scale reads → Read replicas + route app reads to replica endpoint. Lambda connection storms → RDS Proxy. Offload repetitive read queries → ElastiCache lazy loading. DynamoDB hot-partition reads → DAX (not ElastiCache — DAX is DynamoDB-specific and no code change required).

⚠️

Common traps:

  • "Read replicas provide synchronous replication for zero data loss" — FALSE; read replicas use asynchronous replication. There can be replication lag. Multi-AZ uses synchronous replication.
  • "Promoting a read replica to primary breaks the existing primary" — FALSE; promoting creates a standalone DB. The original primary continues to run independently.
  • "RDS Proxy supports all RDS engines including Oracle" — FALSE; RDS Proxy supports MySQL, PostgreSQL, MariaDB only.
  • "ElastiCache Redis cluster mode disabled means no HA" — FALSE; you can still have replication groups (primary + replicas) with auto-failover without cluster mode. Cluster mode adds sharding.
Task 3.4

High-Performing Network Architectures

CloudFront, Global Accelerator, PrivateLink, Transit Gateway, VPN, network topology design.

Knowledge of:
  • Edge networking services with appropriate use cases (for example, Amazon CloudFront, AWS Global Accelerator)
  • How to design network architecture (for example, subnet tiers, routing, IP addressing)
  • Load balancing concepts (for example, Application Load Balancer)
  • Network connection options (for example, AWS VPN, AWS Direct Connect, AWS PrivateLink)
  • Data analytics and visualization services with appropriate use cases (for example, Amazon Athena, AWS Lake Formation, Amazon QuickSight)
  • Data ingestion patterns (for example, frequency)
  • Data transfer services with appropriate use cases (for example, AWS DataSync, AWS Storage Gateway)
  • Data transformation services with appropriate use cases (for example, AWS Glue)
  • Secure access to ingestion access points
  • Sizes and speeds needed to meet business requirements
  • Streaming data services with appropriate use cases (for example, Amazon Kinesis)
Skills in:
  • Creating a network topology for various architectures (for example, global, hybrid, multi-tier)
  • Determining network configurations that can scale to accommodate future needs
  • Determining the appropriate placement of resources to meet business requirements
  • Selecting the appropriate load balancing strategy
  • Building and securing data lakes
  • Designing data streaming architectures
  • Designing data transfer solutions
  • Implementing visualization strategies
  • Selecting appropriate compute options for data processing (for example, Amazon EMR)
  • Selecting appropriate configurations for ingestion
  • Transforming data between formats (for example, .csv to .parquet)
🌐 Edge Acceleration — CloudFront & Global Accelerator
EdgeHigh Frequency
Amazon CloudFrontAWS Global Accelerator
ProtocolHTTP/HTTPS (content delivery)TCP/UDP (any protocol)
Caching✅ Edge caches content❌ Routes packets, no caching
IP addressesDynamic (DNS-based)✅ 2 static Anycast IPs
RoutingNearest edge pop (content)AWS backbone → nearest region endpoint
Use caseCDN — websites, video, APIs, S3 staticGaming, IoT, VoIP, real-time apps needing static IPs
Network Connection Options
ServiceUse Case
AWS PrivateLinkExpose services privately to other VPCs/accounts without VPC peering or internet; uses interface endpoints
VPC PeeringDirect private connectivity between 2 VPCs (same or different account/region); non-transitive
AWS Transit GatewayHub-and-spoke: connect 100s of VPCs + on-prem through one gateway; supports transitive routing
AWS Site-to-Site VPNIPsec-encrypted tunnel over the public internet from on-prem to VPC; minutes to set up
🎯

Static IP requirement + non-HTTP → Global Accelerator. Cache web content globallyCloudFront. Connect many VPCs at scale → Transit Gateway (not peering — peering doesn't scale, no transitive routing). Expose SaaS privately → PrivateLink.

⚠️

Common traps:

  • "VPC peering allows transitive routing — traffic from VPC A can reach VPC C via VPC B" — FALSE; VPC peering is non-transitive. Use Transit Gateway for hub-and-spoke.
  • "Global Accelerator caches content at edge locations" — FALSE; it routes traffic via the AWS backbone to the nearest healthy endpoint — no caching.
  • "PrivateLink requires VPC peering" — FALSE; PrivateLink uses interface endpoints independent of peering.
  • "CloudFront can only serve content from S3" — FALSE; CloudFront supports any HTTP origin including ALBs, EC2 instances, on-prem web servers, and API Gateway — S3 is just the most common static origin.
🕸️ Network Topology — Subnets, Routing & Placement
VPCMedium Frequency
Multi-Tier Subnet Design
  • Public tier: ALB, NAT GW, bastion hosts — has internet route via IGW
  • Application tier: EC2, ECS tasks — private, outbound via NAT GW
  • Data tier: RDS, ElastiCache — private, no outbound internet access
  • Spread each tier across ≥2 AZs for HA — 6 subnets minimum for a 3-tier, 2-AZ design
Scaling Network Capacity
  • CIDR sizing: plan subnets large enough for future growth — you can't resize a VPC CIDR, only add secondary CIDRs
  • VPC secondary CIDR blocks: extend IP space without recreating the VPC
  • Placement Groups: Cluster (low latency, same rack) / Spread (max isolation) / Partition (HDFS, Cassandra)
🎯

HPC requiring low-latency between instances → Cluster Placement Group (single AZ, same rack). Maximize instance isolation for HA → Spread Placement Group. HDFS/Cassandra large clusters → Partition Placement Group.

⚠️

Common traps:

  • "A public subnet automatically gives EC2 instances a public IP" — FALSE; EC2 gets a public IP only if the subnet's auto-assign public IP setting is enabled OR you explicitly associate an EIP.
  • "You can resize a VPC CIDR block" — FALSE; you cannot modify the primary CIDR. Add secondary CIDR blocks to extend address space.
  • "Cluster Placement Groups span multiple AZs for better HA" — FALSE; Cluster Placement Groups are within a single AZ (designed for performance, not HA). Use Spread Placement Groups across AZs for HA.
  • "Private subnets cannot reach the internet" — FALSE; private subnets can reach the internet for outbound traffic via a NAT Gateway in a public subnet.
Task 3.5

High-Performing Data Ingestion & Transformation

Kinesis, Glue, Athena, Lake Formation, EMR, DataSync — data pipelines and analytics.

🌊 Streaming Data — Kinesis Family
StreamingHigh Frequency
ServicePurposeKey Facts
Kinesis Data Streams (KDS)Real-time custom ingestionShards: 1 MB/s in, 2 MB/s out per shard; 24-hour default retention (extendable up to 365 days)
Amazon Data FirehoseManaged delivery to destinationsFully managed; delivers to S3, Redshift, OpenSearch, Splunk; no consumers to manage
Kinesis Video StreamsVideo ingestion & playbackIngest video from devices; ML processing
Amazon MSKManaged Apache KafkaLift-and-shift Kafka workloads; standard Kafka API
Pattern

IoT devices → Kinesis Data Streams (real-time processing by Lambda) → transform → Data Firehose → S3 data lake → Athena for ad-hoc SQL queries → QuickSight for dashboards.

graph LR IoT["IoT Devices /\nApp Events"] -->|"records"| KDS["Kinesis Data Streams\n(shards, 24hr retention)"] KDS -->|"real-time"| Lambda["Lambda\n(transform/filter)"] Lambda -->|"processed"| KDF["Data Firehose\n(managed delivery)"] KDF -->|"batched"| S3["S3 Data Lake"] S3 -->|"SQL"| Athena["Athena\n(pay per TB scanned)"] Athena --> QS["QuickSight\n(dashboards)"]
🎯

Real-time custom processing → KDS. Managed delivery without consumer management → Data Firehose. Existing Kafka infrastructure → Amazon MSK. Firehose can't replay data; KDS can (within retention window).

⚠️

Common traps:

  • "KDS and Firehose are interchangeable" — FALSE; KDS requires custom consumer code; Firehose is managed delivery to a fixed set of destinations.
  • "Adding Kinesis shards reduces read latency" — FALSE; shards increase throughput (MB/s), not latency.
  • "MSK replaces SQS" — FALSE; MSK is Managed Kafka for high-throughput streaming; SQS is a simpler decoupled queue service.
  • "Kinesis default retention is 24 hours" — TRUE; the default retention period IS 24 hours. It can be extended up to 365 days via the Extended Data Retention feature (additional cost). There is no "7-day default" — that is a common misconception.
🔬 Data Lakes, ETL & Analytics — Glue, Athena, EMR, Lake Formation
AnalyticsMedium Frequency
ServiceRoleKey Facts
AWS GlueServerless ETLCrawlers catalog S3 data; Glue jobs transform and load; Python/Spark
Amazon AthenaServerless SQL on S3Pay per query (per TB scanned); use Parquet/ORC to reduce cost 10×
Amazon EMRManaged Hadoop/SparkBig data processing; Spot Instances for core nodes save 60–90%
AWS Lake FormationData lake governanceCentralized permissions on S3 data lake; column/row-level security
Amazon RedshiftData warehouseColumnar storage; Spectrum: query S3 directly without loading
Amazon QuickSightBI / visualizationServerless; SPICE in-memory engine; ML insights
Format Optimization
  • Convert CSV → Parquet or ORC before querying with Athena — columnar formats reduce data scanned by 10–100×
  • Partition S3 data by date/region/category — Athena skips entire partitions when WHERE clause matches
  • AWS Glue can automate CSV → Parquet conversion in ETL pipelines
graph LR Raw["S3\nRaw CSV/JSON"] -->|"Glue Crawler"| Cat["Glue Data Catalog\n(schema discovery)"] Raw -->|"Glue ETL Job"| Parquet["S3\nParquet/ORC\n(10× cheaper queries)"] Parquet -->|"SQL queries"| Athena["Athena\n(serverless)"] Parquet -->|"COPY"| RS["Redshift\n(data warehouse)"] Athena --> QS["QuickSight\n(dashboards)"] RS --> QS Cat --> Athena
🎯

Serverless SQL on S3 → Athena. Managed Spark/Hadoop big data → EMR. Serverless ETL → Glue. Fine-grained data lake permissions → Lake Formation. BI dashboards → QuickSight. Athena cost: convert to Parquet + partition = massive savings.

⚠️

Common traps:

  • "AWS Glue is a data warehouse" — FALSE; Glue is serverless ETL and a data catalog. Redshift is the data warehouse.
  • "Athena can directly query DynamoDB tables" — FALSE; export to S3 first or use PartiQL within DynamoDB.
  • "Lake Formation replaces S3 as storage" — FALSE; Lake Formation is a governance/permissions layer — data still lives in S3.
  • "EMR master node can run on Spot to save cost" — FALSE; master node interruption kills the entire cluster. Use On-Demand for master; Spot is safe only for task nodes.

Domain 4 Overview

Design architectures that deliver required capabilities at the lowest cost. Covers storage tiering, compute purchasing options, database cost optimization, and network cost reduction strategies.

⚡ 20% of scored content
📊 Visual Study Guides — Domain 4
Cheat SheetVisual
Domain 4 — Cost-Optimized Architectures
Domain 4: Designing Cost-Optimized Architectures

Cost-Optimized Architectures — domain guide

4 Pillars of Cost Optimization Strategy

4 Pillars of Cost Optimization Strategy

Task 4.1

Cost-Optimized Storage Solutions

S3 tiers, lifecycle policies, EBS optimization, storage tool selection, data transfer costs.

📦 S3 Storage Classes & Lifecycle Cost Optimization
Storage CostHigh Frequency
Storage ClassRetrievalMin DurationUse Case
S3 StandardInstant, msNoneFrequently accessed data
S3 Intelligent-TieringInstant (frequent tier)NoneUnknown or changing access patterns
S3 Standard-IAInstant, ms30 daysInfrequent access, rapid retrieval (backups)
S3 One Zone-IAInstant, ms30 daysIA data that can be recreated if AZ fails
S3 Glacier Instant RetrievalInstant, ms90 daysArchives accessed once/quarter
S3 Glacier Flexible Retrieval1–12 hours90 daysCompliance archives, not time-sensitive
S3 Glacier Deep Archive12–48 hours180 daysLowest cost; regulatory long-term archives
graph LR Std["S3 Standard\nFrequent access"] -->|"≥30 days"| IA["Standard-IA\nInfrequent access"] Std -->|"auto-managed"| IT["Intelligent-Tiering\nUnknown patterns"] IA -->|"≥90 days"| GIR["Glacier Instant\nQuarterly access"] GIR -->|"≥90 days"| GFR["Glacier Flexible\n1–12hr retrieval"] GFR -->|"≥180 days"| DA["Deep Archive\n12–48hr retrieval\nLowest cost"]
Cost Reduction Strategies
  • Lifecycle Policies: Auto-transition objects to cheaper tiers based on age — set-and-forget cost savings
  • S3 Intelligent-Tiering: AWS monitors access and automatically moves objects between tiers; small monitoring fee per object
  • Requester Pays: Buckets where the requester (not bucket owner) pays data transfer and request costs — ideal for public datasets
  • Batch uploads: Aggregate small objects before upload — reduces per-request costs vs. many individual PUTs
  • Cost Allocation Tags: Tag S3 buckets by team/project for granular billing breakdown
Block Storage Cost Optimization
  • Right-size EBS volumes — don't over-provision; use CloudWatch to identify underutilized volumes
  • Delete unattached EBS volumes (common cost leak)
  • Use gp3 instead of gp2 — gp3 is 20% cheaper and lets you set IOPS independently
  • Use st1 (HDD) for sequential large file workloads — much cheaper than SSD for throughput-bound access
  • EBS Snapshots: incremental; store only changed blocks; use Data Lifecycle Manager to automate retention
🎯

Unknown access patterns → S3 Intelligent-Tiering (automated, no retrieval penalty on frequent tier). Long-term compliance archive, lowest cost → Glacier Deep Archive. gp2 vs gp3 → always prefer gp3 (cheaper, independent IOPS tuning). Unattached EBS = wasted spend — Trusted Advisor flags these.

⚠️

Common traps:

  • "S3 Intelligent-Tiering has retrieval fees" — FALSE; no retrieval fees, only a small per-object monitoring charge (~$0.0025/1k objects).
  • "Standard-IA is always cheaper than Standard" — FALSE; Standard-IA charges a per-GB retrieval fee — for frequently read data it is MORE expensive than Standard.
  • "Glacier Deep Archive and Glacier Flexible Retrieval have the same retrieval time" — FALSE; Deep Archive = 12–48 hours; Flexible Retrieval = 1–12 hours.
  • "S3 Lifecycle rules can move objects back to warmer tiers" — FALSE; lifecycle only transitions to colder tiers. Manual copy is needed to move back to Standard.
🚚 Data Transfer Costs & Migration Tools
CostTransfer
Data Transfer Cost Rules
  • Inbound to AWS = always FREE (upload to S3, Direct Connect inbound)
  • Same Region, same AZ, EC2 → EC2 private IP = FREE
  • Cross-AZ traffic = $0.01/GB each direction — minimize by keeping tiers in same AZ when possible
  • Cross-Region transfer = varies by region; significant cost at scale
  • Internet egress = $0.09/GB (first 10 TB/month from most regions)
  • CloudFront egress = cheaper than direct S3/EC2 internet egress + reduces origin requests
Lowest-Cost Transfer Methods
ScenarioBest ToolWhy
Small regular transfers to S3AWS CLI / SDKNo overhead
Large ongoing sync (on-prem ↔ AWS)AWS DataSync10× faster than rsync; handles metadata
Petabytes, limited bandwidthAWS Snowball EdgePhysical device; free inbound after delivery
ExabytesAWS SnowmobileTruck-sized data transfer unit
Ongoing large files, transfer accelerationS3 Transfer AccelerationUses CloudFront edge network backbone
🎯

Use VPC endpoints (Gateway for S3/DynamoDB) — eliminate NAT Gateway data processing charges for S3 traffic from EC2 in private subnets. A common exam answer to "reduce data transfer costs for S3."

⚠️

Common traps:

  • "EC2 → S3 in the same region is always free" — FALSE; traffic through a NAT Gateway incurs data processing charges even within the same region. Use a free Gateway VPC Endpoint to avoid it.
  • "DataSync and Transfer Family serve the same purpose" — FALSE; DataSync is for automated bulk migration; Transfer Family provides managed SFTP/FTP endpoints for ongoing partner file exchange.
  • "Snow device data ingestion to AWS is charged" — FALSE; data loading after device return is free. Only device rental and shipping are charged.
  • "Transfer Acceleration always speeds up uploads" — FALSE; AWS only charges you if acceleration is actually faster — if not beneficial, the transfer is not accelerated and not charged.
Task 4.2

Cost-Optimized Compute Solutions

Purchasing options, instance right-sizing, serverless vs. EC2, load balancing strategy.

💰 EC2 Purchasing Options — On-Demand, Reserved, Spot, Savings Plans
CostVery High Frequency
OptionDiscount vs On-DemandCommitmentInterruptionBest For
On-DemandBaseline (no discount)NoneNoneUnpredictable, short-term, dev/test
Reserved Instances (1-yr)Up to 40%1 yearNoneSteady-state, predictable workloads
Reserved Instances (3-yr)Up to 72%3 yearsNoneLong-term committed workloads
Savings Plans (Compute)Up to 66%1 or 3 yr ($/hr spend)NoneFlexible: any instance family, region, OS
Spot InstancesUp to 90%None✅ 2-min noticeFault-tolerant, stateless, batch, CI/CD
Dedicated HostsHigher costOn-Demand or ReservedNoneBYOL (per-socket/per-core licensing)
AWS Outposts3–5 yrNoneOn-prem workloads needing AWS APIs + low latency
Spot Best Practices
  • Use Spot for stateless, fault-tolerant workloads: batch jobs, CI/CD agents, ML training, video encoding
  • Spot Fleet / EC2 Fleet: Automatically diversifies across instance types and AZs to maintain target capacity
  • Use hibernate option to preserve instance state on interruption
  • Mix On-Demand (baseline) + Spot (burst) in Auto Scaling Groups for cost + availability balance
Right-Sizing Strategies
  • AWS Compute Optimizer: ML-based recommendations for EC2, Lambda, EBS, ECS on Fargate
  • AWS Trusted Advisor: flags low-utilization EC2 instances (<10% CPU for 4+ days)
  • Use T-family burstable instances for workloads with low baseline + occasional spikes
graph TD Q1{"Workload type?"} Q1 -->|"Fault-tolerant / batch / CI-CD"| Spot["Spot Instances\nUp to 90% off\n2-min interruption notice"] Q1 -->|"Steady-state production"| Q2{"Commitment?"} Q1 -->|"Short-term / unpredictable / dev"| OD["On-Demand\nNo commitment, full price"] Q2 -->|"1-3 yr, any family/region"| SP["Compute Savings Plans\nUp to 66% off"] Q2 -->|"1-3 yr, fixed type"| RI["Reserved Instances\nUp to 72% off"] Q2 -->|"BYOL per-socket licensing"| DH["Dedicated Hosts\nPhysical server control"]
🎯

Steady, always-on prod workloads → Reserved or Compute Savings Plans. Maximum savings + can tolerate interruption → Spot. BYOL Oracle/Windows → Dedicated Hosts. Mix for ASG → On-Demand base capacity + Spot for scaling. Most flexible discount → Compute Savings Plans (applies to Lambda and Fargate too).

⚠️

Common traps:

  • "Standard RIs apply to any size within the same instance family" — FALSE; Standard RIs lock instance type, size, OS, and tenancy. Only Convertible RIs allow family/size swaps.
  • "Compute Savings Plans cover RDS" — FALSE; they cover EC2, Lambda, and Fargate only. RDS has its own Reserved Instance program.
  • "Dedicated Instances and Dedicated Hosts are equivalent" — FALSE; Dedicated Hosts give physical server-level control needed for BYOL licensing; Dedicated Instances just run on dedicated hardware without host-level visibility.
  • "Spot interruptions always terminate instances" — FALSE; depending on launch configuration, the instance can be stopped or hibernated instead of terminated.
⚖️ Serverless vs. EC2 Cost Trade-offs & Load Balancer Selection
Cost
Serverless vs. EC2
Lambda / FargateEC2
Cost modelPay per invocation + durationPay per hour (running or stopped)
Idle cost$0Full instance cost
Best forSporadic, event-driven, variable scaleSteady high-throughput, long-running, GPU workloads
Break-even<~50% utilization favors serverless>~50% utilization favors EC2 + Reserved
Load Balancer Cost Comparison
  • ALB: LCU-based pricing (connections, bandwidth, rules, new connections) — most cost-effective for HTTP/S at moderate scale
  • NLB: LCU-based but for TCP/UDP — lower cost for simple TCP load balancing than ALB
  • Classic LB: Legacy; more expensive per feature than ALB/NLB — migrate away
🎯

Low-traffic event-driven API → Lambda + API GW (zero idle cost). High-traffic 24/7 API → EC2 + Reserved Instances (predictable cost). Production + non-production same account → tag-based cost allocation; separate non-prod to dev account with separate budgets.

⚠️

Common traps:

  • "Lambda is always cheaper than EC2" — FALSE; at sustained high invocation rates Lambda cost exceeds a Reserved EC2 instance. Break-even is roughly 50% average utilization.
  • "ALB and NLB pricing is identical" — FALSE; ALB LCUs factor in rule evaluations and new connections; NLB LCUs factor in flows and bandwidth. At high concurrent TCP connection counts, NLB is typically cheaper.
  • "Classic Load Balancer is acceptable for new architectures" — FALSE; CLB is legacy with no new feature development — always use ALB (L7) or NLB (L4).
  • "Lambda concurrency scales infinitely" — FALSE; account-level concurrency limit is 1,000 by default per region; unreserved concurrency is shared across all functions.
Task 4.3

Cost-Optimized Database Solutions

Database right-sizing, caching, backup policies, serverless databases, migration for cost savings.

🗄️ Database Cost Levers — Sizing, Caching & Serverless
DB CostMedium Frequency
Cost Reduction Strategies
StrategyHowSavings
CachingElastiCache/DAX in front of RDS/DynamoDB — reduces read load → smaller DB instanceHigh
Read ReplicasOffload analytics to replica → downsize primary instanceMedium
Aurora Serverless v2Pay per ACU consumed; scales to zero (v2 scales down to 0.5 ACU)High for variable workloads
DynamoDB On-Demand → ProvisionedIf traffic is predictable, provisioned is cheaperMedium
RDS Reserved Instances1-yr or 3-yr commitment for steady-state DBsUp to 69%
Right-size DB instanceUse CloudWatch to identify underutilized DB → downsizeMedium
Data Retention & Backup Cost
  • Set RDS automated backup retention to what's actually needed (1–35 days) — don't keep 35 days if 7 suffices
  • Manual RDS snapshots persist until deleted — automate cleanup with Lambda or AWS Backup lifecycle rules
  • DynamoDB: On-Demand backups billed per GB; PITR costs ~0.2 cents/GB/month — enable only where needed
Database Migration for Cost Savings
  • Heterogeneous migration: Oracle/SQL Server → Aurora PostgreSQL using AWS SCT + DMS → eliminate expensive license costs
  • DynamoDB vs. RDS: DynamoDB has no per-seat or per-engine licensing; pure consumption billing
🎯

Dev/test databases → stop RDS instances nights/weekends (automated with EventBridge + Lambda). Migrate away from Oracle → Aurora PostgreSQL with DMS saves substantial license cost. Variable DynamoDB traffic → On-Demand mode (no capacity planning); predictable → Provisioned + Auto Scaling.

⚠️

Common traps:

  • "Aurora Serverless v2 scales to zero" — FALSE; v2 scales down to 0.5 ACU minimum, not zero. v1 could scale to zero (with cold-start penalty).
  • "ElastiCache always reduces database costs" — FALSE; ElastiCache adds its own hourly cost. It only saves money if the cache hit ratio is high enough that DB instance downsizing or fewer read replicas offset the cache cost.
  • "Stopping an RDS instance costs nothing" — FALSE; stopped RDS instances still incur storage charges. After 7 days, stopped instances automatically restart.
  • "DynamoDB On-Demand mode is always more expensive than Provisioned" — FALSE; for very spiky or unpredictable traffic, On-Demand avoids over-provisioned WCU/RCU waste and can be cheaper overall.
Task 4.4

Cost-Optimized Network Architectures

NAT Gateway cost, VPC endpoints, network topology, CDN strategy, throttling, bandwidth allocation.

🌉 NAT Gateway, VPC Endpoints & Network Cost Reduction
Network CostHigh Frequency
NAT Gateway Cost Considerations
  • NAT Gateway: $0.045/hr per AZ + $0.045/GB data processed — significant at scale
  • One NAT GW per AZ: More expensive but prevents cross-AZ data transfer charges; use for production
  • Single shared NAT GW: Cheaper but cross-AZ traffic incurs $0.01/GB each direction
  • NAT Instance (legacy): Cheaper for low-traffic but requires management; no HA without scripting
VPC Endpoints — Eliminate NAT/Internet Costs
TypeServicesCost
Gateway EndpointS3, DynamoDB onlyFree — no hourly charge or data fee
Interface Endpoint (PrivateLink)100s of AWS services (SSM, KMS, ECR, etc.)~$0.01/hr per AZ + $0.01/GB
Network Routing for Cost Optimization
  • Keep traffic within the same AZ where possible — cross-AZ = ~$0.02/GB round trip
  • Use VPC peering instead of Transit Gateway for simple two-VPC connections — TGW adds per-attachment + data processing fees
  • CloudFront origin shield: consolidates origin requests, reduces S3/EC2 egress
  • Compress API responses before sending — reduces data transfer costs
Direct Connect vs. VPN vs. Internet for Cost
InternetVPNDirect Connect
Setup costLowestLowHigher (port fees)
Data transferStandard egress ratesStandard egress ratesReduced egress rates
Break-evenLow volumeLow–medium volumeHigh volume (10s of TB/month)
Throttling & Bandwidth Optimization
  • API Gateway throttling prevents backends from being overloaded — reduces compute cost from traffic spikes
  • Multiple smaller Direct Connect connections vs. one large: same total bandwidth but more resilience; evaluate cost per Gbps
  • AWS Cost Explorer: enable network cost analysis to identify cross-AZ transfer hotspots
🎯

Biggest network cost wins: (1) S3 Gateway Endpoint — eliminate NAT GW charges for S3 access (free). (2) Keep EC2 → RDS in same AZ — eliminate cross-AZ charges. (3) CloudFront — cheaper egress than direct S3/EC2 + caches content. (4) Review cross-AZ data paths — each cross-AZ byte costs money.

⚠️

Common traps:

  • "One shared NAT Gateway is always cheaper than one per AZ" — NOT ALWAYS; a single NAT GW saves hourly cost but all cross-AZ traffic to it incurs $0.01/GB each way. At high data volumes, per-AZ NAT GWs are cheaper.
  • "VPC Interface Endpoints are free like Gateway Endpoints" — FALSE; Interface Endpoints (PrivateLink) cost ~$0.01/hr per AZ plus data processing fees. Only S3 and DynamoDB Gateway Endpoints are free.
  • "Transit Gateway is cheaper than VPC peering for two VPCs" — FALSE; TGW charges per attachment and per GB processed. For two VPCs, direct peering has no hourly or data charge (only standard EC2 data transfer rates).
  • "CloudFront eliminates all origin data transfer costs" — FALSE; CloudFront reduces origin load by caching, but on cache misses it still fetches from the origin (incurring data transfer). CloudFront egress is cheaper per GB than direct S3/EC2 egress, but costs are not zero.
📉 Cost Management Tools — Cost Explorer, Budgets & Trusted Advisor
FinOpsMedium Frequency
ToolPurposeKey Feature
AWS Cost ExplorerVisualize, analyze, and forecast spendRight-sizing recommendations; Reserved/Savings Plan utilization
AWS BudgetsSet cost/usage/RI thresholds + alertsAlert via SNS when spend exceeds budget; forecast-based alerts
Cost & Usage Report (CUR)Granular billing data to S3Most detailed billing data; feed to Athena/Redshift for custom analysis
AWS Trusted AdvisorBest practice checks across all pillarsCost: idle EBS, low-utilization EC2, unused RIs, unassociated EIPs
AWS Compute OptimizerML-based resource right-sizingEC2, Lambda, EBS, ECS on Fargate recommendations
Cost Allocation TagsTag resources by team/project/envEnables per-tag cost breakdown in Cost Explorer and CUR
🎯

Alert when monthly bill exceeds $500 → AWS Budgets. Detailed billing for chargeback analysis → Cost & Usage Report + Athena. Which EC2 to downsize → Compute Optimizer. Unused Reserved Instances → Cost Explorer RI utilization report or Trusted Advisor.

⚠️

Common traps:

  • "AWS Budgets prevents spending from exceeding the threshold" — FALSE; Budgets only alerts you — it does not stop resources from running. Use IAM SCPs to actually enforce cost limits.
  • "Cost Explorer shows real-time spend" — FALSE; Cost Explorer data has an up-to-24-hour delay. Use the Billing Dashboard for near-real-time spend.
  • "Cost Allocation Tags automatically appear in Cost Explorer" — FALSE; you must activate cost allocation tags in the Billing console before they appear as filterable dimensions.
  • "Compute Optimizer and Trusted Advisor give the same right-sizing recommendations" — FALSE; Compute Optimizer uses ML and 14 days of CloudWatch metrics for granular recommendations. Trusted Advisor uses simpler 14-day CPU/network averages with coarser thresholds.

On this page