AWS Well-Architected Framework · 6 Pillars

Well-Architected Framework
Exam Reference Guide

🏛️ Pillars: 6 📐 Design Principles: 30+ 🎯 Exam Relevance: SAA · DVA · SAP

Pillar 1 — Operational Excellence

The ability to support development and run workloads effectively, gain insight into their operations, and continually improve processes and procedures to deliver business value.

🎯 Focus: Run, monitor, and continuously improve operations
Design Principles

Operational Excellence Design Principles

Six guiding principles that shift operations from reactive, manual, and siloed to proactive, automated, and collaborative.

⚙️ 6 Design Principles of Operational Excellence
Exam FaveOperational Excellence
  1. Perform operations as code — Define your entire workload (infrastructure, config, procedures) as code. Use IaC (CloudFormation, CDK) and runbooks as code to reduce human error and enable consistent, repeatable operations.
  2. Make frequent, small, reversible changes — Design workloads for incremental updates. Small changes reduce blast radius and enable faster rollback. Use CI/CD pipelines, blue/green deployments, and feature flags.
  3. Refine operations procedures frequently — Evolve runbooks, playbooks, and processes as the workload changes. Schedule game days to exercise procedures and validate they work before an incident.
  4. Anticipate failure — Perform pre-mortem exercises — imagine what could fail and design mitigations. Test failure scenarios regularly (chaos engineering, DR drills). Never assume; inject faults to verify resilience.
  5. Learn from all operational failures — Run post-incident reviews (blameless post-mortems). Share learnings across teams. Drive improvements from every event, not just outages.
  6. Use managed services — Remove operational burden by using managed AWS services. AWS manages patching, availability, and scaling; you focus on business logic.
💡

Mnemonic — FMRALU: Frequent small changes · Managed services · Refine procedures · Anticipate failure · Learn from failures · Use ops as code.

🎯

On the exam, "run books as code" and "automate operational procedures" point to Operational Excellence. If a scenario says "manual process prone to human error" → the OE principle is Perform operations as code. If it says "big-bang releases" → Make frequent, small, reversible changes.

Best Practices

Operational Excellence Best Practices

Organize, Prepare, Operate, Evolve — the four areas of operational best practices.

📋 Organize, Prepare, Operate & Evolve
High Frequency
Organize
  • Understand business priorities — each team member understands their role in delivering business outcomes
  • Structure teams around products/services (two-pizza teams) not infrastructure layers
  • Define and publish operational runbooks (how to handle known events) and playbooks (how to investigate unknowns)
Prepare
  • Design for operations: Instrument workloads with telemetry (logs, metrics, traces). Design dashboards before launch.
  • Operational readiness reviews: Validate a workload is ready to be operated before launch
  • Game days: Simulate production events (failures, traffic spikes) to test runbooks and team response. Run on a schedule, not just before major launches.
  • Config management: Use AWS Config, Systems Manager Parameter Store / AppConfig for validated, versioned configuration
Operate
  • Use CloudWatch dashboards for real-time workload health visibility
  • Define KPIs and SLOs; alert on breaches — not just technical metrics but business outcomes (e.g., order success rate)
  • Respond to events with runbooks: Automate responses where possible (EventBridge → Lambda for auto-remediation)
  • Escalate events through defined severity levels with on-call rotations (e.g., PagerDuty, OpsGenie)
Evolve
  • Run blameless post-mortems after every significant event — document timeline, root cause, contributing factors, action items
  • Share learnings across teams via internal wikis, post-mortem databases, and all-hands reviews
  • Continuously improve runbooks, architecture, and tooling based on operational experience
  • Track improvement work as backlog items with the same priority as feature work
Key AWS Tool — Well-Architected Tool
  • Free in the AWS console — run workload reviews against all 6 pillars
  • Produces a list of identified risks (High, Medium) and improvement plan
  • Supports custom Lenses (industry-specific or org-specific questions)
🎯

"Validate operational readiness before launch" → Operational Readiness Review (OE: Prepare). "Simulate failure to test runbooks" → Game Day (OE: Prepare). "Learn from incidents" → Post-mortem / Evolve. "Automate response to CloudWatch alarm" → EventBridge rule + Lambda (OE: Operate as code).

⚠️

Common traps:

  • "Operational Excellence is only about monitoring" — FALSE; it covers the full ops lifecycle: organization, preparation, day-to-day operations, and continuous evolution.
  • "Post-mortems should identify who caused the outage" — FALSE; Well-Architected recommends blameless post-mortems that focus on system/process failures, not individual blame.
Key Services

Operational Excellence — Key AWS Services

🛠️ Services That Implement Operational Excellence
Reference
ServiceOE AreaHow It Applies
AWS CloudFormation / CDK / SAMOperations as CodeDefine and version all infrastructure as templates; repeatable, consistent deployments
AWS CodePipeline / CodeBuild / CodeDeploySmall, reversible changesAutomate build → test → deploy pipeline; enable frequent, safe releases
Amazon CloudWatchOperateMetrics, logs, alarms, dashboards, anomaly detection for workload health visibility
AWS X-RayOperateDistributed tracing to identify latency bottlenecks and errors across microservices
AWS CloudTrailOperate / EvolveAudit log of all API calls — who did what, when. Essential for post-mortem analysis.
AWS Systems ManagerPrepare / OperateRun Command, Patch Manager, Parameter Store, OpsCenter for operational automation
AWS ConfigOrganize / OperateRecord and evaluate resource configuration changes; compliance rules and auto-remediation
Amazon EventBridgeOperate as CodeRoute operational events to automated Lambda remediation without human intervention
AWS Well-Architected ToolEvolveGuided review against all 6 pillars; produces improvement plan and risk register
CloudFormationCDKCodePipeline CloudWatchX-RayCloudTrail Systems ManagerAWS ConfigEventBridge

Pillar 2 — Security

The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies. Encompasses identity management, detective controls, infrastructure protection, data protection, and incident response.

🎯 Focus: Protect data, systems, and assets — at every layer
Design Principles

Security Design Principles

🔐 7 Design Principles of the Security Pillar
Exam FaveSecurity
  1. Implement a strong identity foundation — Apply the principle of least privilege. Centralize identity management. Eliminate long-term static credentials wherever possible (use IAM roles, short-lived STS credentials).
  2. Maintain traceability — Log and monitor all actions and changes. Enable CloudTrail, VPC Flow Logs, and CloudWatch. Integrate logs into SIEM systems. Ensure every action by a human or system is auditable.
  3. Apply security at all layers — Defense in depth. Apply controls at the edge (CloudFront + WAF), VPC (security groups, NACLs), compute (OS hardening, instance metadata), and data (encryption). Never rely on a single security control.
  4. Automate security best practices — Use AWS Config rules for compliance. Automate vulnerability scanning (Amazon Inspector). Use AWS Security Hub to aggregate findings. Automate remediation via EventBridge + Lambda.
  5. Protect data in transit and at rest — Classify data by sensitivity. Use TLS for all in-transit data (enforce via bucket policy aws:SecureTransport). Encrypt at rest using KMS. Apply SSE-KMS for auditability.
  6. Keep people away from data — Reduce direct human access to sensitive data. Use automated processes for data access. When humans must access, use temporary, audited credentials. Limit production access to break-glass scenarios.
  7. Prepare for security events — Define incident response processes before an incident occurs. Run game days simulating security events. Use AWS Config, GuardDuty, and Security Hub for automated detection and pre-built runbooks for response.
💡

Mnemonic — ITAPKPP: Identity foundation · Traceability · All layers · Automate · Protect data · Keep people away · Prepare for events.

Best Practices

Security Best Practice Areas

The 7 security focus areas from the Well-Architected Security whitepaper.

🛡️ 7 Security Focus Areas — Identity, Detection, Infrastructure, Data & Incident Response
High Frequency
Focus AreaKey PracticesPrimary AWS Services
Security FoundationsAWS account structure, multi-account strategy with Organizations, trusted advisor checksAWS Organizations, Control Tower, Trusted Advisor
Identity & Access ManagementLeast privilege, no root access keys, IAM roles over users, MFA everywhere, permissions boundariesIAM, IAM Identity Center, STS, Cognito
DetectionEnable CloudTrail in all regions, GuardDuty for threat detection, Config for compliance, Security Hub for aggregationCloudTrail, GuardDuty, AWS Config, Security Hub, Macie
Infrastructure ProtectionVPC with private subnets, security groups (allow-only), NACLs for subnet-level deny, WAF on ALBs/CloudFrontVPC, WAF, Shield, Network Firewall, Inspector
Data ProtectionClassify data by sensitivity (PII, PHI), encrypt at rest (KMS) and in transit (TLS), S3 Block Public Access, MFA DeleteKMS, ACM, S3, Macie, CloudHSM
Incident ResponsePre-defined runbooks, isolate compromised resources via Security Groups, use forensic accounts, automate containmentGuardDuty → EventBridge → Lambda, Security Hub, Systems Manager
Application SecurityThreat modeling, SAST/DAST in CI/CD pipeline, dependency scanning, secrets managementInspector, CodeGuru Security, Secrets Manager
Exam Scenario — "Detect and auto-remediate"

GuardDuty detects an EC2 instance communicating with a known C2 server. The recommended automated remediation is: GuardDuty finding → EventBridge ruleLambda function that modifies the instance's Security Group to isolate it (removing all inbound/outbound rules except the forensics team's IP). This is "Automate security best practices" + "Prepare for security events" in action.

🎯

Exam pattern — match the service to the security area: Threat detection → GuardDuty. PII discovery in S3 → Macie. Compliance rules → AWS Config. Vulnerability scanning → Inspector. Aggregate findings → Security Hub. Audit log → CloudTrail. Web exploit protection → WAF.

⚠️

Common traps:

  • "Security and Operational Excellence are separate concerns" — FALSE; Well-Architected explicitly requires both — security must be built into operations, not bolted on.
  • "GuardDuty auto-remediates security threats" — FALSE; GuardDuty only generates findings. Remediation requires a separate automation (EventBridge + Lambda).
  • "AWS Shield Standard must be purchased" — FALSE; Shield Standard is free and automatically enabled for all AWS customers.
Key Services

Security Pillar — Key AWS Services

🔑 Security Services Map
Reference
CategoryServicePurpose
IdentityIAM, IAM Identity Center, Cognito, STSAuthentication, authorization, federation, temporary credentials
DetectionGuardDuty, Macie, Inspector, CloudTrail, Security HubThreat detection, data classification, vulnerability scanning, audit logs, aggregation
InfrastructureVPC, WAF, Shield, Network Firewall, Firewall ManagerNetwork isolation, web exploit protection, DDoS protection
DataKMS, ACM, CloudHSM, Secrets ManagerEncryption keys, certificates, HSMs, secret storage
ComplianceAWS Config, Audit Manager, ArtifactCompliance rules, audit reports, regulatory documentation
IAMGuardDutyMacie InspectorSecurity HubCloudTrail KMSACMWAF ShieldAWS ConfigSecrets Manager

Pillar 3 — Reliability

The ability of a workload to perform its intended function correctly and consistently when it's expected to. Includes the ability to operate and test the workload through its total lifecycle — from design, through operations, to decommission.

🎯 Focus: Recover quickly from failures, scale to meet demand
Design Principles

Reliability Design Principles

🔄 5 Design Principles of the Reliability Pillar
Exam FaveReliability
  1. Automatically recover from failure — Monitor KPIs and trigger automation when thresholds are breached. Use Auto Scaling, Route 53 health checks, and ALB health checks to replace unhealthy resources without human intervention.
  2. Test recovery procedures — Use automation to simulate component failures. Don't wait for real incidents to find out your DR plan doesn't work. Run regular failover drills — test Multi-AZ failover, backup restoration, and cross-region DR.
  3. Scale horizontally to increase aggregate workload availability — Replace one large resource with many smaller resources distributed across multiple AZs. If one fails, the remaining capacity absorbs load. Avoid single large instances; prefer Auto Scaling groups.
  4. Stop guessing capacity — Use Auto Scaling to provision the right amount of capacity at any given time. Add and remove resources automatically based on demand — never over-provision for peak or under-provision and throttle.
  5. Manage change through automation — Use IaC (CloudFormation) and CI/CD to make changes in a consistent, tested, automated way. Unmanaged manual changes are a leading cause of reliability incidents.
Availability Targets — Know Your Numbers
AvailabilityDowntime per yearDowntime per month
99% ("Two nines")~87.6 hours~7.3 hours
99.9% ("Three nines")~8.76 hours~43.8 minutes
99.99% ("Four nines")~52.6 minutes~4.38 minutes
99.999% ("Five nines")~5.26 minutes~26.3 seconds
🎯

"Eliminate single points of failure" → Scale horizontally + Multi-AZ. "Never test DR only during a real incident" → Test recovery procedures principle. "Unpredictable traffic causes outages" → Stop guessing capacity → Auto Scaling.

Best Practices

Reliability Best Practices

Foundations, workload architecture, change management, and failure management.

🏗️ Foundations, Workload Architecture & Failure Management
High Frequency
Foundations
  • Service quotas: Understand and proactively request increases for limits that could affect workload availability (e.g., Lambda concurrency, EC2 vCPUs, VPC subnets)
  • Network topology: Design VPCs with multiple AZs. Use private subnets for compute; public only for load balancers and NAT. Plan IP address space to avoid exhaustion.
Workload Architecture
  • Service-oriented architecture: Each component deployed independently. Failures in one component don't cascade to others.
  • Bulkhead pattern: Isolate components so failure of one doesn't exhaust resources for others (e.g., separate thread pools, separate queues per tier)
  • Circuit breaker: Detect downstream failures and stop sending requests to a failing service; fail fast instead of queuing up and causing cascading timeouts
  • Idempotency: Design operations that can be retried without side effects — critical for at-least-once delivery systems (SQS, Lambda async)
Change Management
  • Deploy changes using CI/CD with automated tests and rollback capabilities
  • Use CloudWatch alarms as CloudFormation rollback triggers — automatically roll back a stack update if alarm fires post-deployment
  • Feature flags (AppConfig) allow disabling problematic features without redeployment
Failure Management
  • Backup and restore: AWS Backup for centralized, policy-based backup across services. Test restores regularly.
  • Multi-AZ: Deploy across ≥2 AZs. RDS Multi-AZ, ALB across AZs, ECS service across AZs. Synchronous replication for zero-data-loss failover.
  • Multi-Region DR strategies (RTO/RPO tradeoffs):
StrategyRTORPOCostDescription
Backup & RestoreHoursHoursLowestRestore from backups in DR region when disaster occurs
Pilot Light~10 minMinutesLowMinimal infrastructure (DB replicas) running; scale up on disaster
Warm StandbyMinutesSecondsMediumScaled-down copy fully running in DR region; scale to full on disaster
Multi-Site Active-ActiveNear-zeroNear-zeroHighestFull production load in ≥2 regions simultaneously; instant failover
💡

DR Cost vs. Speed: B&R (cheapest, slowest) → Pilot Light → Warm Standby → Active-Active (costliest, fastest). Exam will describe RTO/RPO requirements and ask which strategy fits.

⚠️

Common traps:

  • "Multi-AZ and Multi-Region are the same" — FALSE; Multi-AZ provides HA within a region. Multi-Region is for disaster recovery and geographic distribution across regions.
  • "RTO is how much data you can afford to lose" — FALSE; RTO is Recovery Time Objective (how long recovery takes). RPO is Recovery Point Objective (how much data loss is acceptable).
  • "Pilot Light means no resources are running in DR" — FALSE; Pilot Light keeps minimal critical resources (like DB replicas) running. Backup & Restore has nothing running.
Key Services

Reliability Pillar — Key AWS Services

Reliability Services Map
Reference
CategoryServiceReliability Role
Compute HAAuto Scaling Groups, ALBReplace unhealthy instances; distribute load across AZs
Database HARDS Multi-AZ, Aurora Global DB, DynamoDB Global TablesSynchronous replication; automatic failover; cross-region HA
DNS / TrafficRoute 53 health checks, failover routing, latency routingAutomatic traffic cutover to healthy endpoints
BackupAWS Backup, S3 Cross-Region Replication, EBS snapshotsCentralized backup policies; cross-region data durability
DecouplingSQS, SNS, EventBridgeBuffer requests; prevent cascade failures between components
Serverless HALambda, DynamoDB, S3Built-in HA across AZs by design — no customer configuration needed
Auto ScalingALBRDS Multi-AZ Aurora GlobalRoute 53AWS Backup SQSS3 CRRCloudFormation

Pillar 4 — Performance Efficiency

The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.

🎯 Focus: Use the right resource type, right size, and continuously measure
Design Principles

Performance Efficiency Design Principles

🚀 5 Design Principles of Performance Efficiency
Exam FavePerformance
  1. Democratize advanced technologies — Use managed services for complex capabilities (ML with SageMaker, search with OpenSearch, graph with Neptune). Don't build what AWS manages; consume it as a service to reduce effort and accelerate innovation.
  2. Go global in minutes — Deploy to multiple AWS Regions with minimal effort. Use CloudFront to serve users from edge locations worldwide. Aurora Global Database replicates with <1s lag. Route 53 latency routing directs users to the nearest region.
  3. Use serverless architectures — Remove the need to run and maintain servers. Lambda, Fargate, DynamoDB, Aurora Serverless, S3 — these services eliminate undifferentiated heavy lifting and scale automatically.
  4. Experiment more often — The low cost of cloud resources enables A/B testing of architecture choices (instance types, storage classes, caching strategies) to find the best-performing option without long-term commitment.
  5. Consider mechanical sympathy — Understand how services work internally and use them in alignment with their design. Example: DynamoDB performs best with high-cardinality partition keys; use keys that distribute load evenly across partitions.
Best Practices

Performance Efficiency Best Practices

Architecture selection, compute, data management, networking, and continuous review.

Architecture Selection, Compute, Data & Networking
High Frequency
Architecture Selection
Workload TypeRecommended ArchitectureAWS Services
Event-driven / variable loadServerlessLambda, DynamoDB, API GW, S3
Containerized microservicesContainer platformECS Fargate, EKS, ECR
Stateful long-runningCompute instancesEC2 with right-sized instance family
High-performance compute (HPC)Cluster placement groupsC/P/G instance families, EFA networking
Data analyticsPurpose-built data servicesRedshift, Athena, EMR, Kinesis
Compute & Hardware Selection
  • Right-size instances: Use CloudWatch metrics (CPU, memory via custom metric) + AWS Compute Optimizer to find underutilized instances and resize to the correct family and size
  • Instance families: General (M/T) · Compute-optimized (C) · Memory-optimized (R/X) · Storage-optimized (I/D) · Accelerated (P/G/Inf/Trn)
  • Graviton (ARM): Up to 40% better price/performance for many workloads (Lambda, EC2, RDS). Cost optimization AND performance simultaneously.
  • Auto Scaling: Predictive Scaling anticipates traffic; Target Tracking maintains a metric at a target value
Data Management
  • Choose the right storage: S3 for objects, EBS for block (EC2), EFS for shared file, FSx for HPC/Windows
  • Use caching to reduce read latency: ElastiCache, DAX (DynamoDB), CloudFront (CDN), API GW cache
  • Partition data to enable parallel reads — DynamoDB partition key design, Kinesis shard count
Networking & Content Delivery
  • Proximity: Deploy resources in regions close to users. Use Route 53 latency routing or geolocation routing.
  • CloudFront: Cache static assets and API responses at 400+ edge locations globally — dramatically reduces latency and origin load
  • Enhanced Networking: For HPC and low-latency networking — use ENA (Elastic Network Adapter) and EFA (Elastic Fabric Adapter) on supported instance types
  • Global Accelerator: Routes traffic over the AWS backbone to the nearest healthy endpoint — reduces hops and jitter for non-cacheable traffic (dynamic APIs, gaming)
🎯

Exam shortcut — Performance Efficiency service matching: "Low latency for global users" → CloudFront. "Low latency for global dynamic API" → Global Accelerator. "Database query latency" → ElastiCache / DAX. "Wrong EC2 instance size" → Compute Optimizer. "Faster Lambda on Java" → SnapStart / Graviton.

Key Services

Performance Efficiency — Key AWS Services

📊 Performance Services Map
Reference
CloudFrontGlobal AcceleratorElastiCache DAXAuto ScalingLambda Graviton EC2Compute OptimizerKinesis Aurora ServerlessSageMakerEFA

Pillar 5 — Cost Optimization

The ability to run systems to deliver business value at the lowest price point. Covers understanding spending, controlling fund allocation, selecting the right resource type and quantity, and scaling to meet business needs without overspending.

🎯 Focus: Eliminate waste, right-size resources, use pricing models strategically
Design Principles

Cost Optimization Design Principles

💰 5 Design Principles of Cost Optimization
Exam FaveCost
  1. Implement cloud financial management — Build organizational capability (FinOps practice). Establish a Cloud Center of Excellence. Finance and engineering must collaborate on cost goals. Tag all resources for chargeback/showback reporting.
  2. Adopt a consumption model — Pay only for what you use. Use on-demand and serverless resources. Shut down non-production environments outside business hours. Use Auto Scaling to match supply to demand — never idle capacity.
  3. Measure overall efficiency — Track cost per unit of business outcome (e.g., cost per order, cost per active user). Use AWS Cost Explorer, Cost & Usage Reports, and cost allocation tags.
  4. Stop spending money on undifferentiated heavy lifting — Use managed services (RDS instead of self-managed MySQL on EC2, Lambda instead of always-on EC2). AWS manages patching, HA, and scaling — you stop paying for that operational work.
  5. Analyze and attribute expenditure — Use cost allocation tags and AWS Organizations to attribute costs to teams/products. Enable cost visibility so engineers see the impact of their architectural decisions.
Best Practices

Cost Optimization Best Practices

Cloud financial management, usage awareness, cost-effective resources, demand management, and continuous optimization.

📉 Pricing Models, Right-Sizing, and Waste Elimination
High Frequency
EC2 Pricing Models
ModelDiscount vs On-DemandCommitmentBest For
On-DemandBaselineNoneVariable, unpredictable workloads; dev/test
Reserved Instances (Standard)Up to 72%1 or 3 years, specific typeSteady-state production workloads with known instance type
Reserved Instances (Convertible)Up to 54%1 or 3 years, flexible typeSteady-state but may change instance family
Savings Plans (Compute)Up to 66%1 or 3 years, $/hr commitmentFlexible — applies to EC2, Lambda, Fargate, any region/family
Savings Plans (EC2 Instance)Up to 72%1 or 3 years, specific family+regionSpecific instance family in specific region
Spot InstancesUp to 90%None (can be interrupted)Fault-tolerant, stateless, batch, CI/CD, ML training
Dedicated HostsOn-demand or ReservedOptionalLicensing compliance (per-socket, per-core) or regulatory isolation
Right-Sizing & Waste Elimination
  • AWS Compute Optimizer: ML-based right-sizing recommendations for EC2, Lambda, ECS tasks, Auto Scaling groups, EBS volumes — based on 14 days of CloudWatch metrics
  • AWS Trusted Advisor: Free cost optimization checks — identifies idle EC2 instances, underutilized EBS volumes, unassociated Elastic IPs, low-utilization RDS instances
  • Idle resources: Terminate stopped EC2 instances with attached EBS, delete unattached EBS volumes, release unassociated Elastic IPs ($0.005/hr charge when unassociated)
  • S3 lifecycle policies: Move objects to cheaper storage classes over time (Standard → Standard-IA → Glacier → Glacier Deep Archive)
Expenditure & Usage Awareness
ToolPurpose
AWS Cost ExplorerVisualize and analyze historical spend; forecast future costs; RI utilization and coverage reports
AWS BudgetsSet cost/usage/coverage thresholds; alert via SNS when threshold breached; take automated actions
Cost & Usage Report (CUR)Most granular billing data — exported to S3, queryable with Athena; gold standard for chargeback
Cost Allocation TagsMust be activated in Billing console; tag resources with Team/Project/Environment; filter Cost Explorer by tag
💡

Savings Plans vs. Reserved Instances: Savings Plans are more flexible (apply across EC2 + Lambda + Fargate, any region, any family). Standard RIs are less flexible but slightly higher discount for specific instance types. Compute Savings Plans = best all-round for most organizations.

🎯

"Alert when monthly spend exceeds $500" → AWS Budgets. "Find underutilized EC2 instances" → Compute Optimizer + Trusted Advisor. "Maximize discount for predictable 3-year workload" → Standard Reserved Instances. "Fault-tolerant batch processing cheaply" → Spot Instances. "Granular per-team billing data" → Cost & Usage Report + Athena + Cost Allocation Tags.

⚠️

Common traps:

  • "AWS Budgets stops spending when the threshold is hit" — FALSE; Budgets only alerts/notifies. To actually stop resources, you need a Lambda triggered by the Budget Action or an SCP.
  • "Reserved Instances can be used for any instance type" — FALSE; Standard RIs are locked to specific instance family, size, region, and OS. Convertible RIs allow changing type within the same family.
  • "Spot Instances are suitable for production databases" — FALSE; Spot Instances can be interrupted with 2 minutes notice. Never use for stateful workloads requiring persistence.
Key Services

Cost Optimization — Key AWS Services

📊 Cost Services Map
Reference
Cost ExplorerAWS BudgetsSavings Plans Reserved InstancesSpot InstancesCompute Optimizer Trusted AdvisorCost & Usage ReportS3 Lifecycle Lambda (serverless)Aurora ServerlessGraviton

Pillar 6 — Sustainability

The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload by maximizing the benefits from provisioned resources and minimizing the total resources required.

🎯 Focus: Minimize environmental impact — maximize utilization, minimize waste
Design Principles

Sustainability Design Principles

🌱 6 Design Principles of the Sustainability Pillar
Medium FrequencySustainability
  1. Understand your impact — Measure the environmental impact of your workload. Use the Customer Carbon Footprint Tool to track your AWS carbon emissions. Establish KPIs for sustainability (e.g., carbon per API call).
  2. Establish sustainability goals — Set long-term goals for carbon reduction aligned with your organization's sustainability commitments. Prioritize workloads with the highest impact.
  3. Maximize utilization — Right-size resources to maximize utilization and reduce idle waste. Use Auto Scaling so resources only run when needed. Consolidate workloads onto fewer, larger instances vs. many small ones.
  4. Anticipate and adopt new, more efficient hardware and software — Monitor AWS for new, more energy-efficient instance types (e.g., Graviton processors use up to 60% less energy than equivalent x86 instances). Adopt them as they become available.
  5. Use managed services — AWS managed services are optimized for energy efficiency at scale (shared infrastructure, higher utilization). Moving from self-managed to managed (e.g., RDS vs MySQL on EC2) reduces your environmental footprint.
  6. Reduce the downstream impact of your cloud workloads — Minimize data transfer, storage, and processing that end users perform. Use efficient data formats (Parquet vs CSV), compress data, deliver assets from edge (CloudFront) to reduce client-side energy use.
Best Practices

Sustainability Best Practices

Region selection, behavior patterns, software architecture, data patterns, hardware, and development process.

♻️ Sustainability Best Practice Areas & Exam Scenarios
Medium Frequency
AreaPracticeAWS Implementation
Region SelectionChoose regions powered by higher % renewable energyCheck AWS sustainability region data; consider carbon footprint per region
User BehaviorReduce unnecessary data transfer and client-side computeCloudFront edge caching; efficient pagination; compression (gzip/brotli)
Software ArchitectureUse serverless to eliminate idle compute; adopt event-drivenLambda (no idle servers), Fargate (no idle EC2), DynamoDB On-Demand
Data PatternsMinimize data stored and processed; use efficient formatsS3 Intelligent-Tiering; archive to Glacier; use Parquet/ORC over CSV; delete unused data
Hardware PatternsUse energy-efficient processor architecturesGraviton (ARM) instances — up to 60% less energy for same workload
Development ProcessInclude sustainability in architecture reviews; measure and improveWell-Architected Tool sustainability lens; Customer Carbon Footprint Tool
Shared Responsibility for Sustainability
  • AWS responsibility: Optimize the physical data center, networking, and hardware (energy-efficient cooling, renewable energy procurement, efficient hardware design)
  • Customer responsibility: Optimize their workloads — right-size, remove idle resources, choose efficient architectures, use efficient data formats, leverage managed services
Sustainability vs. Cost Optimization

Sustainability and Cost Optimization are closely aligned — both benefit from eliminating waste and maximizing utilization. Key difference: Sustainability focuses on minimizing environmental impact (carbon emissions, energy use); Cost Optimization focuses on minimizing financial spend. In practice, actions like right-sizing, using Graviton, and switching to serverless achieve both goals simultaneously.

🎯

On the exam, Sustainability scenarios often involve: Graviton instances (energy-efficient compute), S3 Intelligent-Tiering (reduce storage waste), serverless architectures (no idle compute), and data lifecycle policies (delete unused data). If the question mentions "reduce carbon footprint" or "minimize environmental impact" → Sustainability pillar.

⚠️

Common traps:

  • "Sustainability is only relevant for large enterprises with ESG requirements" — FALSE; it's one of the 6 pillars and tested on all associate and professional exams.
  • "Using more powerful instances improves sustainability because jobs finish faster" — CONTEXT-DEPENDENT; over-provisioning wastes energy. Right-sizing is the key — use the minimum power needed to meet performance requirements.
Key Services

Sustainability — Key AWS Services

🌿 Sustainability Services Map
Reference
Graviton EC2Lambda (serverless)Fargate S3 Intelligent-TieringS3 GlacierAurora Serverless CloudFrontAuto ScalingCarbon Footprint Tool Spot InstancesCompute Optimizer

On this page