AWS Well-Architected Framework

Pillar 1 — Operational Excellence

The ability to support development and run workloads effectively, gain insight into their operations, and continually improve processes and procedures to deliver business value.

🎯 Focus: Run, monitor, and continuously improve operations

Design Principles

Operational Excellence Design Principles

Six guiding principles that shift operations from reactive, manual, and siloed to proactive, automated, and collaborative.

⚙️ 6 Design Principles of Operational Excellence

Exam FaveOperational Excellence

▾

Perform operations as code — Define your entire workload (infrastructure, config, procedures) as code. Use IaC (CloudFormation, CDK) and runbooks as code to reduce human error and enable consistent, repeatable operations.
Make frequent, small, reversible changes — Design workloads for incremental updates. Small changes reduce blast radius and enable faster rollback. Use CI/CD pipelines, blue/green deployments, and feature flags.
Refine operations procedures frequently — Evolve runbooks, playbooks, and processes as the workload changes. Schedule game days to exercise procedures and validate they work before an incident.
Anticipate failure — Perform pre-mortem exercises — imagine what could fail and design mitigations. Test failure scenarios regularly (chaos engineering, DR drills). Never assume; inject faults to verify resilience.
Learn from all operational failures — Run post-incident reviews (blameless post-mortems). Share learnings across teams. Drive improvements from every event, not just outages.
Use managed services — Remove operational burden by using managed AWS services. AWS manages patching, availability, and scaling; you focus on business logic.

💡

Mnemonic — FMRALU: Frequent small changes · Managed services · Refine procedures · Anticipate failure · Learn from failures · Use ops as code.

🎯

On the exam, "run books as code" and "automate operational procedures" point to Operational Excellence. If a scenario says "manual process prone to human error" → the OE principle is Perform operations as code. If it says "big-bang releases" → Make frequent, small, reversible changes.

Best Practices

Operational Excellence Best Practices

Organize, Prepare, Operate, Evolve — the four areas of operational best practices.

📋 Organize, Prepare, Operate & Evolve

High Frequency

▾

Organize

Understand business priorities — each team member understands their role in delivering business outcomes
Structure teams around products/services (two-pizza teams) not infrastructure layers
Define and publish operational runbooks (how to handle known events) and playbooks (how to investigate unknowns)

Prepare

Design for operations: Instrument workloads with telemetry (logs, metrics, traces). Design dashboards before launch.
Operational readiness reviews: Validate a workload is ready to be operated before launch
Game days: Simulate production events (failures, traffic spikes) to test runbooks and team response. Run on a schedule, not just before major launches.
Config management: Use AWS Config, Systems Manager Parameter Store / AppConfig for validated, versioned configuration

Operate

Use CloudWatch dashboards for real-time workload health visibility
Define KPIs and SLOs; alert on breaches — not just technical metrics but business outcomes (e.g., order success rate)
Respond to events with runbooks: Automate responses where possible (EventBridge → Lambda for auto-remediation)
Escalate events through defined severity levels with on-call rotations (e.g., PagerDuty, OpsGenie)

Evolve

Run blameless post-mortems after every significant event — document timeline, root cause, contributing factors, action items
Share learnings across teams via internal wikis, post-mortem databases, and all-hands reviews
Continuously improve runbooks, architecture, and tooling based on operational experience
Track improvement work as backlog items with the same priority as feature work

Key AWS Tool — Well-Architected Tool

Free in the AWS console — run workload reviews against all 6 pillars
Produces a list of identified risks (High, Medium) and improvement plan
Supports custom Lenses (industry-specific or org-specific questions)

🎯

"Validate operational readiness before launch" → Operational Readiness Review (OE: Prepare). "Simulate failure to test runbooks" → Game Day (OE: Prepare). "Learn from incidents" → Post-mortem / Evolve. "Automate response to CloudWatch alarm" → EventBridge rule + Lambda (OE: Operate as code).

⚠️

Common traps:

"Operational Excellence is only about monitoring" — FALSE; it covers the full ops lifecycle: organization, preparation, day-to-day operations, and continuous evolution.
"Post-mortems should identify who caused the outage" — FALSE; Well-Architected recommends blameless post-mortems that focus on system/process failures, not individual blame.

Key Services

Operational Excellence — Key AWS Services

🛠️ Services That Implement Operational Excellence

Reference

▾

Service	OE Area	How It Applies
AWS CloudFormation / CDK / SAM	Operations as Code	Define and version all infrastructure as templates; repeatable, consistent deployments
AWS CodePipeline / CodeBuild / CodeDeploy	Small, reversible changes	Automate build → test → deploy pipeline; enable frequent, safe releases
Amazon CloudWatch	Operate	Metrics, logs, alarms, dashboards, anomaly detection for workload health visibility
AWS X-Ray	Operate	Distributed tracing to identify latency bottlenecks and errors across microservices
AWS CloudTrail	Operate / Evolve	Audit log of all API calls — who did what, when. Essential for post-mortem analysis.
AWS Systems Manager	Prepare / Operate	Run Command, Patch Manager, Parameter Store, OpsCenter for operational automation
AWS Config	Organize / Operate	Record and evaluate resource configuration changes; compliance rules and auto-remediation
Amazon EventBridge	Operate as Code	Route operational events to automated Lambda remediation without human intervention
AWS Well-Architected Tool	Evolve	Guided review against all 6 pillars; produces improvement plan and risk register

CloudFormationCDKCodePipeline CloudWatchX-RayCloudTrail Systems ManagerAWS ConfigEventBridge

Pillar 2 — Security

The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies. Encompasses identity management, detective controls, infrastructure protection, data protection, and incident response.

🎯 Focus: Protect data, systems, and assets — at every layer

Design Principles

Security Design Principles

🔐 7 Design Principles of the Security Pillar

Exam FaveSecurity

▾

Implement a strong identity foundation — Apply the principle of least privilege. Centralize identity management. Eliminate long-term static credentials wherever possible (use IAM roles, short-lived STS credentials).
Maintain traceability — Log and monitor all actions and changes. Enable CloudTrail, VPC Flow Logs, and CloudWatch. Integrate logs into SIEM systems. Ensure every action by a human or system is auditable.
Apply security at all layers — Defense in depth. Apply controls at the edge (CloudFront + WAF), VPC (security groups, NACLs), compute (OS hardening, instance metadata), and data (encryption). Never rely on a single security control.
Automate security best practices — Use AWS Config rules for compliance. Automate vulnerability scanning (Amazon Inspector). Use AWS Security Hub to aggregate findings. Automate remediation via EventBridge + Lambda.
Protect data in transit and at rest — Classify data by sensitivity. Use TLS for all in-transit data (enforce via bucket policy aws:SecureTransport). Encrypt at rest using KMS. Apply SSE-KMS for auditability.
Keep people away from data — Reduce direct human access to sensitive data. Use automated processes for data access. When humans must access, use temporary, audited credentials. Limit production access to break-glass scenarios.
Prepare for security events — Define incident response processes before an incident occurs. Run game days simulating security events. Use AWS Config, GuardDuty, and Security Hub for automated detection and pre-built runbooks for response.

💡

Mnemonic — ITAPKPP: Identity foundation · Traceability · All layers · Automate · Protect data · Keep people away · Prepare for events.

Best Practices

Security Best Practice Areas

The 7 security focus areas from the Well-Architected Security whitepaper.

🛡️ 7 Security Focus Areas — Identity, Detection, Infrastructure, Data & Incident Response

High Frequency

▾

Focus Area	Key Practices	Primary AWS Services
Security Foundations	AWS account structure, multi-account strategy with Organizations, trusted advisor checks	AWS Organizations, Control Tower, Trusted Advisor
Identity & Access Management	Least privilege, no root access keys, IAM roles over users, MFA everywhere, permissions boundaries	IAM, IAM Identity Center, STS, Cognito
Detection	Enable CloudTrail in all regions, GuardDuty for threat detection, Config for compliance, Security Hub for aggregation	CloudTrail, GuardDuty, AWS Config, Security Hub, Macie
Infrastructure Protection	VPC with private subnets, security groups (allow-only), NACLs for subnet-level deny, WAF on ALBs/CloudFront	VPC, WAF, Shield, Network Firewall, Inspector
Data Protection	Classify data by sensitivity (PII, PHI), encrypt at rest (KMS) and in transit (TLS), S3 Block Public Access, MFA Delete	KMS, ACM, S3, Macie, CloudHSM
Incident Response	Pre-defined runbooks, isolate compromised resources via Security Groups, use forensic accounts, automate containment	GuardDuty → EventBridge → Lambda, Security Hub, Systems Manager
Application Security	Threat modeling, SAST/DAST in CI/CD pipeline, dependency scanning, secrets management	Inspector, CodeGuru Security, Secrets Manager

Exam Scenario — "Detect and auto-remediate"

GuardDuty detects an EC2 instance communicating with a known C2 server. The recommended automated remediation is: GuardDuty finding → EventBridge rule → Lambda function that modifies the instance's Security Group to isolate it (removing all inbound/outbound rules except the forensics team's IP). This is "Automate security best practices" + "Prepare for security events" in action.

🎯

Exam pattern — match the service to the security area: Threat detection → GuardDuty. PII discovery in S3 → Macie. Compliance rules → AWS Config. Vulnerability scanning → Inspector. Aggregate findings → Security Hub. Audit log → CloudTrail. Web exploit protection → WAF.

⚠️

Common traps:

"Security and Operational Excellence are separate concerns" — FALSE; Well-Architected explicitly requires both — security must be built into operations, not bolted on.
"GuardDuty auto-remediates security threats" — FALSE; GuardDuty only generates findings. Remediation requires a separate automation (EventBridge + Lambda).
"AWS Shield Standard must be purchased" — FALSE; Shield Standard is free and automatically enabled for all AWS customers.

Key Services

Security Pillar — Key AWS Services

🔑 Security Services Map

Reference

▾

Category	Service	Purpose
Identity	IAM, IAM Identity Center, Cognito, STS	Authentication, authorization, federation, temporary credentials
Detection	GuardDuty, Macie, Inspector, CloudTrail, Security Hub	Threat detection, data classification, vulnerability scanning, audit logs, aggregation
Infrastructure	VPC, WAF, Shield, Network Firewall, Firewall Manager	Network isolation, web exploit protection, DDoS protection
Data	KMS, ACM, CloudHSM, Secrets Manager	Encryption keys, certificates, HSMs, secret storage
Compliance	AWS Config, Audit Manager, Artifact	Compliance rules, audit reports, regulatory documentation

IAMGuardDutyMacie InspectorSecurity HubCloudTrail KMSACMWAF ShieldAWS ConfigSecrets Manager

Pillar 3 — Reliability

The ability of a workload to perform its intended function correctly and consistently when it's expected to. Includes the ability to operate and test the workload through its total lifecycle — from design, through operations, to decommission.

🎯 Focus: Recover quickly from failures, scale to meet demand

Design Principles

Reliability Design Principles

🔄 5 Design Principles of the Reliability Pillar

Exam FaveReliability

▾

Automatically recover from failure — Monitor KPIs and trigger automation when thresholds are breached. Use Auto Scaling, Route 53 health checks, and ALB health checks to replace unhealthy resources without human intervention.
Test recovery procedures — Use automation to simulate component failures. Don't wait for real incidents to find out your DR plan doesn't work. Run regular failover drills — test Multi-AZ failover, backup restoration, and cross-region DR.
Scale horizontally to increase aggregate workload availability — Replace one large resource with many smaller resources distributed across multiple AZs. If one fails, the remaining capacity absorbs load. Avoid single large instances; prefer Auto Scaling groups.
Stop guessing capacity — Use Auto Scaling to provision the right amount of capacity at any given time. Add and remove resources automatically based on demand — never over-provision for peak or under-provision and throttle.
Manage change through automation — Use IaC (CloudFormation) and CI/CD to make changes in a consistent, tested, automated way. Unmanaged manual changes are a leading cause of reliability incidents.

Availability Targets — Know Your Numbers

Availability	Downtime per year	Downtime per month
99% ("Two nines")	~87.6 hours	~7.3 hours
99.9% ("Three nines")	~8.76 hours	~43.8 minutes
99.99% ("Four nines")	~52.6 minutes	~4.38 minutes
99.999% ("Five nines")	~5.26 minutes	~26.3 seconds

🎯

"Eliminate single points of failure" → Scale horizontally + Multi-AZ. "Never test DR only during a real incident" → Test recovery procedures principle. "Unpredictable traffic causes outages" → Stop guessing capacity → Auto Scaling.

Best Practices

Reliability Best Practices

Foundations, workload architecture, change management, and failure management.

🏗️ Foundations, Workload Architecture & Failure Management

High Frequency

▾

Foundations

Service quotas: Understand and proactively request increases for limits that could affect workload availability (e.g., Lambda concurrency, EC2 vCPUs, VPC subnets)
Network topology: Design VPCs with multiple AZs. Use private subnets for compute; public only for load balancers and NAT. Plan IP address space to avoid exhaustion.

Workload Architecture

Service-oriented architecture: Each component deployed independently. Failures in one component don't cascade to others.
Bulkhead pattern: Isolate components so failure of one doesn't exhaust resources for others (e.g., separate thread pools, separate queues per tier)
Circuit breaker: Detect downstream failures and stop sending requests to a failing service; fail fast instead of queuing up and causing cascading timeouts
Idempotency: Design operations that can be retried without side effects — critical for at-least-once delivery systems (SQS, Lambda async)

Change Management

Deploy changes using CI/CD with automated tests and rollback capabilities
Use CloudWatch alarms as CloudFormation rollback triggers — automatically roll back a stack update if alarm fires post-deployment
Feature flags (AppConfig) allow disabling problematic features without redeployment

Failure Management

Backup and restore: AWS Backup for centralized, policy-based backup across services. Test restores regularly.
Multi-AZ: Deploy across ≥2 AZs. RDS Multi-AZ, ALB across AZs, ECS service across AZs. Synchronous replication for zero-data-loss failover.
Multi-Region DR strategies (RTO/RPO tradeoffs):

Strategy	RTO	RPO	Cost	Description
Backup & Restore	Hours	Hours	Lowest	Restore from backups in DR region when disaster occurs
Pilot Light	~10 min	Minutes	Low	Minimal infrastructure (DB replicas) running; scale up on disaster
Warm Standby	Minutes	Seconds	Medium	Scaled-down copy fully running in DR region; scale to full on disaster
Multi-Site Active-Active	Near-zero	Near-zero	Highest	Full production load in ≥2 regions simultaneously; instant failover

💡

DR Cost vs. Speed: B&R (cheapest, slowest) → Pilot Light → Warm Standby → Active-Active (costliest, fastest). Exam will describe RTO/RPO requirements and ask which strategy fits.

⚠️

Common traps:

"Multi-AZ and Multi-Region are the same" — FALSE; Multi-AZ provides HA within a region. Multi-Region is for disaster recovery and geographic distribution across regions.
"RTO is how much data you can afford to lose" — FALSE; RTO is Recovery Time Objective (how long recovery takes). RPO is Recovery Point Objective (how much data loss is acceptable).
"Pilot Light means no resources are running in DR" — FALSE; Pilot Light keeps minimal critical resources (like DB replicas) running. Backup & Restore has nothing running.

Key Services

Reliability Pillar — Key AWS Services

⚡ Reliability Services Map

Reference

▾

Category	Service	Reliability Role
Compute HA	Auto Scaling Groups, ALB	Replace unhealthy instances; distribute load across AZs
Database HA	RDS Multi-AZ, Aurora Global DB, DynamoDB Global Tables	Synchronous replication; automatic failover; cross-region HA
DNS / Traffic	Route 53 health checks, failover routing, latency routing	Automatic traffic cutover to healthy endpoints
Backup	AWS Backup, S3 Cross-Region Replication, EBS snapshots	Centralized backup policies; cross-region data durability
Decoupling	SQS, SNS, EventBridge	Buffer requests; prevent cascade failures between components
Serverless HA	Lambda, DynamoDB, S3	Built-in HA across AZs by design — no customer configuration needed

Auto ScalingALBRDS Multi-AZ Aurora GlobalRoute 53AWS Backup SQSS3 CRRCloudFormation

Pillar 4 — Performance Efficiency

The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.

🎯 Focus: Use the right resource type, right size, and continuously measure

Design Principles

Performance Efficiency Design Principles

🚀 5 Design Principles of Performance Efficiency

Exam FavePerformance

▾

Democratize advanced technologies — Use managed services for complex capabilities (ML with SageMaker, search with OpenSearch, graph with Neptune). Don't build what AWS manages; consume it as a service to reduce effort and accelerate innovation.
Go global in minutes — Deploy to multiple AWS Regions with minimal effort. Use CloudFront to serve users from edge locations worldwide. Aurora Global Database replicates with <1s lag. Route 53 latency routing directs users to the nearest region.
Use serverless architectures — Remove the need to run and maintain servers. Lambda, Fargate, DynamoDB, Aurora Serverless, S3 — these services eliminate undifferentiated heavy lifting and scale automatically.
Experiment more often — The low cost of cloud resources enables A/B testing of architecture choices (instance types, storage classes, caching strategies) to find the best-performing option without long-term commitment.
Consider mechanical sympathy — Understand how services work internally and use them in alignment with their design. Example: DynamoDB performs best with high-cardinality partition keys; use keys that distribute load evenly across partitions.

Best Practices

Performance Efficiency Best Practices

Architecture selection, compute, data management, networking, and continuous review.

⚡ Architecture Selection, Compute, Data & Networking

High Frequency

▾

Architecture Selection

Workload Type	Recommended Architecture	AWS Services
Event-driven / variable load	Serverless	Lambda, DynamoDB, API GW, S3
Containerized microservices	Container platform	ECS Fargate, EKS, ECR
Stateful long-running	Compute instances	EC2 with right-sized instance family
High-performance compute (HPC)	Cluster placement groups	C/P/G instance families, EFA networking
Data analytics	Purpose-built data services	Redshift, Athena, EMR, Kinesis

Compute & Hardware Selection

Right-size instances: Use CloudWatch metrics (CPU, memory via custom metric) + AWS Compute Optimizer to find underutilized instances and resize to the correct family and size
Instance families: General (M/T) · Compute-optimized (C) · Memory-optimized (R/X) · Storage-optimized (I/D) · Accelerated (P/G/Inf/Trn)
Graviton (ARM): Up to 40% better price/performance for many workloads (Lambda, EC2, RDS). Cost optimization AND performance simultaneously.
Auto Scaling: Predictive Scaling anticipates traffic; Target Tracking maintains a metric at a target value

Data Management

Choose the right storage: S3 for objects, EBS for block (EC2), EFS for shared file, FSx for HPC/Windows
Use caching to reduce read latency: ElastiCache, DAX (DynamoDB), CloudFront (CDN), API GW cache
Partition data to enable parallel reads — DynamoDB partition key design, Kinesis shard count

Networking & Content Delivery

Proximity: Deploy resources in regions close to users. Use Route 53 latency routing or geolocation routing.
CloudFront: Cache static assets and API responses at 400+ edge locations globally — dramatically reduces latency and origin load
Enhanced Networking: For HPC and low-latency networking — use ENA (Elastic Network Adapter) and EFA (Elastic Fabric Adapter) on supported instance types
Global Accelerator: Routes traffic over the AWS backbone to the nearest healthy endpoint — reduces hops and jitter for non-cacheable traffic (dynamic APIs, gaming)

🎯

Exam shortcut — Performance Efficiency service matching: "Low latency for global users" → CloudFront. "Low latency for global dynamic API" → Global Accelerator. "Database query latency" → ElastiCache / DAX. "Wrong EC2 instance size" → Compute Optimizer. "Faster Lambda on Java" → SnapStart / Graviton.

Key Services

Performance Efficiency — Key AWS Services

📊 Performance Services Map

Reference

▾

CloudFrontGlobal AcceleratorElastiCache DAXAuto ScalingLambda Graviton EC2Compute OptimizerKinesis Aurora ServerlessSageMakerEFA

Pillar 5 — Cost Optimization

The ability to run systems to deliver business value at the lowest price point. Covers understanding spending, controlling fund allocation, selecting the right resource type and quantity, and scaling to meet business needs without overspending.

🎯 Focus: Eliminate waste, right-size resources, use pricing models strategically

Design Principles

Cost Optimization Design Principles

💰 5 Design Principles of Cost Optimization

Exam FaveCost

▾

Implement cloud financial management — Build organizational capability (FinOps practice). Establish a Cloud Center of Excellence. Finance and engineering must collaborate on cost goals. Tag all resources for chargeback/showback reporting.
Adopt a consumption model — Pay only for what you use. Use on-demand and serverless resources. Shut down non-production environments outside business hours. Use Auto Scaling to match supply to demand — never idle capacity.
Measure overall efficiency — Track cost per unit of business outcome (e.g., cost per order, cost per active user). Use AWS Cost Explorer, Cost & Usage Reports, and cost allocation tags.
Stop spending money on undifferentiated heavy lifting — Use managed services (RDS instead of self-managed MySQL on EC2, Lambda instead of always-on EC2). AWS manages patching, HA, and scaling — you stop paying for that operational work.
Analyze and attribute expenditure — Use cost allocation tags and AWS Organizations to attribute costs to teams/products. Enable cost visibility so engineers see the impact of their architectural decisions.

Best Practices

Cost Optimization Best Practices

Cloud financial management, usage awareness, cost-effective resources, demand management, and continuous optimization.

📉 Pricing Models, Right-Sizing, and Waste Elimination

High Frequency

▾

EC2 Pricing Models

Model	Discount vs On-Demand	Commitment	Best For
On-Demand	Baseline	None	Variable, unpredictable workloads; dev/test
Reserved Instances (Standard)	Up to 72%	1 or 3 years, specific type	Steady-state production workloads with known instance type
Reserved Instances (Convertible)	Up to 54%	1 or 3 years, flexible type	Steady-state but may change instance family
Savings Plans (Compute)	Up to 66%	1 or 3 years, $/hr commitment	Flexible — applies to EC2, Lambda, Fargate, any region/family
Savings Plans (EC2 Instance)	Up to 72%	1 or 3 years, specific family+region	Specific instance family in specific region
Spot Instances	Up to 90%	None (can be interrupted)	Fault-tolerant, stateless, batch, CI/CD, ML training
Dedicated Hosts	On-demand or Reserved	Optional	Licensing compliance (per-socket, per-core) or regulatory isolation

Right-Sizing & Waste Elimination

AWS Compute Optimizer: ML-based right-sizing recommendations for EC2, Lambda, ECS tasks, Auto Scaling groups, EBS volumes — based on 14 days of CloudWatch metrics
AWS Trusted Advisor: Free cost optimization checks — identifies idle EC2 instances, underutilized EBS volumes, unassociated Elastic IPs, low-utilization RDS instances
Idle resources: Terminate stopped EC2 instances with attached EBS, delete unattached EBS volumes, release unassociated Elastic IPs ($0.005/hr charge when unassociated)
S3 lifecycle policies: Move objects to cheaper storage classes over time (Standard → Standard-IA → Glacier → Glacier Deep Archive)

Expenditure & Usage Awareness

Tool	Purpose
AWS Cost Explorer	Visualize and analyze historical spend; forecast future costs; RI utilization and coverage reports
AWS Budgets	Set cost/usage/coverage thresholds; alert via SNS when threshold breached; take automated actions
Cost & Usage Report (CUR)	Most granular billing data — exported to S3, queryable with Athena; gold standard for chargeback
Cost Allocation Tags	Must be activated in Billing console; tag resources with Team/Project/Environment; filter Cost Explorer by tag

💡

Savings Plans vs. Reserved Instances: Savings Plans are more flexible (apply across EC2 + Lambda + Fargate, any region, any family). Standard RIs are less flexible but slightly higher discount for specific instance types. Compute Savings Plans = best all-round for most organizations.

🎯

"Alert when monthly spend exceeds $500" → AWS Budgets. "Find underutilized EC2 instances" → Compute Optimizer + Trusted Advisor. "Maximize discount for predictable 3-year workload" → Standard Reserved Instances. "Fault-tolerant batch processing cheaply" → Spot Instances. "Granular per-team billing data" → Cost & Usage Report + Athena + Cost Allocation Tags.

⚠️

Common traps:

"AWS Budgets stops spending when the threshold is hit" — FALSE; Budgets only alerts/notifies. To actually stop resources, you need a Lambda triggered by the Budget Action or an SCP.
"Reserved Instances can be used for any instance type" — FALSE; Standard RIs are locked to specific instance family, size, region, and OS. Convertible RIs allow changing type within the same family.
"Spot Instances are suitable for production databases" — FALSE; Spot Instances can be interrupted with 2 minutes notice. Never use for stateful workloads requiring persistence.

Key Services

Cost Optimization — Key AWS Services

📊 Cost Services Map

Reference

▾

Cost ExplorerAWS BudgetsSavings Plans Reserved InstancesSpot InstancesCompute Optimizer Trusted AdvisorCost & Usage ReportS3 Lifecycle Lambda (serverless)Aurora ServerlessGraviton

Pillar 6 — Sustainability

The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload by maximizing the benefits from provisioned resources and minimizing the total resources required.

🎯 Focus: Minimize environmental impact — maximize utilization, minimize waste

Design Principles

Sustainability Design Principles

🌱 6 Design Principles of the Sustainability Pillar

Medium FrequencySustainability

▾

Understand your impact — Measure the environmental impact of your workload. Use the Customer Carbon Footprint Tool to track your AWS carbon emissions. Establish KPIs for sustainability (e.g., carbon per API call).
Establish sustainability goals — Set long-term goals for carbon reduction aligned with your organization's sustainability commitments. Prioritize workloads with the highest impact.
Maximize utilization — Right-size resources to maximize utilization and reduce idle waste. Use Auto Scaling so resources only run when needed. Consolidate workloads onto fewer, larger instances vs. many small ones.
Anticipate and adopt new, more efficient hardware and software — Monitor AWS for new, more energy-efficient instance types (e.g., Graviton processors use up to 60% less energy than equivalent x86 instances). Adopt them as they become available.
Use managed services — AWS managed services are optimized for energy efficiency at scale (shared infrastructure, higher utilization). Moving from self-managed to managed (e.g., RDS vs MySQL on EC2) reduces your environmental footprint.
Reduce the downstream impact of your cloud workloads — Minimize data transfer, storage, and processing that end users perform. Use efficient data formats (Parquet vs CSV), compress data, deliver assets from edge (CloudFront) to reduce client-side energy use.

Best Practices

Sustainability Best Practices

Region selection, behavior patterns, software architecture, data patterns, hardware, and development process.

♻️ Sustainability Best Practice Areas & Exam Scenarios

Medium Frequency

▾

Area	Practice	AWS Implementation
Region Selection	Choose regions powered by higher % renewable energy	Check AWS sustainability region data; consider carbon footprint per region
User Behavior	Reduce unnecessary data transfer and client-side compute	CloudFront edge caching; efficient pagination; compression (gzip/brotli)
Software Architecture	Use serverless to eliminate idle compute; adopt event-driven	Lambda (no idle servers), Fargate (no idle EC2), DynamoDB On-Demand
Data Patterns	Minimize data stored and processed; use efficient formats	S3 Intelligent-Tiering; archive to Glacier; use Parquet/ORC over CSV; delete unused data
Hardware Patterns	Use energy-efficient processor architectures	Graviton (ARM) instances — up to 60% less energy for same workload
Development Process	Include sustainability in architecture reviews; measure and improve	Well-Architected Tool sustainability lens; Customer Carbon Footprint Tool

Shared Responsibility for Sustainability

AWS responsibility: Optimize the physical data center, networking, and hardware (energy-efficient cooling, renewable energy procurement, efficient hardware design)
Customer responsibility: Optimize their workloads — right-size, remove idle resources, choose efficient architectures, use efficient data formats, leverage managed services

Sustainability vs. Cost Optimization

Sustainability and Cost Optimization are closely aligned — both benefit from eliminating waste and maximizing utilization. Key difference: Sustainability focuses on minimizing environmental impact (carbon emissions, energy use); Cost Optimization focuses on minimizing financial spend. In practice, actions like right-sizing, using Graviton, and switching to serverless achieve both goals simultaneously.

🎯

On the exam, Sustainability scenarios often involve: Graviton instances (energy-efficient compute), S3 Intelligent-Tiering (reduce storage waste), serverless architectures (no idle compute), and data lifecycle policies (delete unused data). If the question mentions "reduce carbon footprint" or "minimize environmental impact" → Sustainability pillar.

⚠️

Common traps:

"Sustainability is only relevant for large enterprises with ESG requirements" — FALSE; it's one of the 6 pillars and tested on all associate and professional exams.
"Using more powerful instances improves sustainability because jobs finish faster" — CONTEXT-DEPENDENT; over-provisioning wastes energy. Right-sizing is the key — use the minimum power needed to meet performance requirements.

Key Services

Sustainability — Key AWS Services

🌿 Sustainability Services Map

Reference

▾

Graviton EC2Lambda (serverless)Fargate S3 Intelligent-TieringS3 GlacierAurora Serverless CloudFrontAuto ScalingCarbon Footprint Tool Spot InstancesCompute Optimizer

Well-Architected Framework
Exam Reference Guide

Pillar 1 — Operational Excellence

Operational Excellence Design Principles

Operational Excellence Best Practices

Operational Excellence — Key AWS Services

Pillar 2 — Security

Security Design Principles

Security Best Practice Areas

Security Pillar — Key AWS Services

Pillar 3 — Reliability

Reliability Design Principles

Reliability Best Practices

Reliability Pillar — Key AWS Services

Pillar 4 — Performance Efficiency

Performance Efficiency Design Principles

Performance Efficiency Best Practices

Performance Efficiency — Key AWS Services

Pillar 5 — Cost Optimization

Cost Optimization Design Principles

Cost Optimization Best Practices

Cost Optimization — Key AWS Services

Pillar 6 — Sustainability

Sustainability Design Principles

Sustainability Best Practices

Sustainability — Key AWS Services

On this page