Pillar 1 — Operational Excellence
The ability to support development and run workloads effectively, gain insight into their operations, and continually improve processes and procedures to deliver business value.
🎯 Focus: Run, monitor, and continuously improve operationsOperational Excellence Design Principles
Six guiding principles that shift operations from reactive, manual, and siloed to proactive, automated, and collaborative.
- Perform operations as code — Define your entire workload (infrastructure, config, procedures) as code. Use IaC (CloudFormation, CDK) and runbooks as code to reduce human error and enable consistent, repeatable operations.
- Make frequent, small, reversible changes — Design workloads for incremental updates. Small changes reduce blast radius and enable faster rollback. Use CI/CD pipelines, blue/green deployments, and feature flags.
- Refine operations procedures frequently — Evolve runbooks, playbooks, and processes as the workload changes. Schedule game days to exercise procedures and validate they work before an incident.
- Anticipate failure — Perform pre-mortem exercises — imagine what could fail and design mitigations. Test failure scenarios regularly (chaos engineering, DR drills). Never assume; inject faults to verify resilience.
- Learn from all operational failures — Run post-incident reviews (blameless post-mortems). Share learnings across teams. Drive improvements from every event, not just outages.
- Use managed services — Remove operational burden by using managed AWS services. AWS manages patching, availability, and scaling; you focus on business logic.
Mnemonic — FMRALU: Frequent small changes · Managed services · Refine procedures · Anticipate failure · Learn from failures · Use ops as code.
On the exam, "run books as code" and "automate operational procedures" point to Operational Excellence. If a scenario says "manual process prone to human error" → the OE principle is Perform operations as code. If it says "big-bang releases" → Make frequent, small, reversible changes.
Operational Excellence Best Practices
Organize, Prepare, Operate, Evolve — the four areas of operational best practices.
- Understand business priorities — each team member understands their role in delivering business outcomes
- Structure teams around products/services (two-pizza teams) not infrastructure layers
- Define and publish operational runbooks (how to handle known events) and playbooks (how to investigate unknowns)
- Design for operations: Instrument workloads with telemetry (logs, metrics, traces). Design dashboards before launch.
- Operational readiness reviews: Validate a workload is ready to be operated before launch
- Game days: Simulate production events (failures, traffic spikes) to test runbooks and team response. Run on a schedule, not just before major launches.
- Config management: Use AWS Config, Systems Manager Parameter Store / AppConfig for validated, versioned configuration
- Use CloudWatch dashboards for real-time workload health visibility
- Define KPIs and SLOs; alert on breaches — not just technical metrics but business outcomes (e.g., order success rate)
- Respond to events with runbooks: Automate responses where possible (EventBridge → Lambda for auto-remediation)
- Escalate events through defined severity levels with on-call rotations (e.g., PagerDuty, OpsGenie)
- Run blameless post-mortems after every significant event — document timeline, root cause, contributing factors, action items
- Share learnings across teams via internal wikis, post-mortem databases, and all-hands reviews
- Continuously improve runbooks, architecture, and tooling based on operational experience
- Track improvement work as backlog items with the same priority as feature work
- Free in the AWS console — run workload reviews against all 6 pillars
- Produces a list of identified risks (High, Medium) and improvement plan
- Supports custom Lenses (industry-specific or org-specific questions)
"Validate operational readiness before launch" → Operational Readiness Review (OE: Prepare). "Simulate failure to test runbooks" → Game Day (OE: Prepare). "Learn from incidents" → Post-mortem / Evolve. "Automate response to CloudWatch alarm" → EventBridge rule + Lambda (OE: Operate as code).
Common traps:
- "Operational Excellence is only about monitoring" — FALSE; it covers the full ops lifecycle: organization, preparation, day-to-day operations, and continuous evolution.
- "Post-mortems should identify who caused the outage" — FALSE; Well-Architected recommends blameless post-mortems that focus on system/process failures, not individual blame.
Operational Excellence — Key AWS Services
| Service | OE Area | How It Applies |
|---|---|---|
| AWS CloudFormation / CDK / SAM | Operations as Code | Define and version all infrastructure as templates; repeatable, consistent deployments |
| AWS CodePipeline / CodeBuild / CodeDeploy | Small, reversible changes | Automate build → test → deploy pipeline; enable frequent, safe releases |
| Amazon CloudWatch | Operate | Metrics, logs, alarms, dashboards, anomaly detection for workload health visibility |
| AWS X-Ray | Operate | Distributed tracing to identify latency bottlenecks and errors across microservices |
| AWS CloudTrail | Operate / Evolve | Audit log of all API calls — who did what, when. Essential for post-mortem analysis. |
| AWS Systems Manager | Prepare / Operate | Run Command, Patch Manager, Parameter Store, OpsCenter for operational automation |
| AWS Config | Organize / Operate | Record and evaluate resource configuration changes; compliance rules and auto-remediation |
| Amazon EventBridge | Operate as Code | Route operational events to automated Lambda remediation without human intervention |
| AWS Well-Architected Tool | Evolve | Guided review against all 6 pillars; produces improvement plan and risk register |
Pillar 2 — Security
The ability to protect data, systems, and assets while delivering business value through risk assessments and mitigation strategies. Encompasses identity management, detective controls, infrastructure protection, data protection, and incident response.
🎯 Focus: Protect data, systems, and assets — at every layerSecurity Design Principles
- Implement a strong identity foundation — Apply the principle of least privilege. Centralize identity management. Eliminate long-term static credentials wherever possible (use IAM roles, short-lived STS credentials).
- Maintain traceability — Log and monitor all actions and changes. Enable CloudTrail, VPC Flow Logs, and CloudWatch. Integrate logs into SIEM systems. Ensure every action by a human or system is auditable.
- Apply security at all layers — Defense in depth. Apply controls at the edge (CloudFront + WAF), VPC (security groups, NACLs), compute (OS hardening, instance metadata), and data (encryption). Never rely on a single security control.
- Automate security best practices — Use AWS Config rules for compliance. Automate vulnerability scanning (Amazon Inspector). Use AWS Security Hub to aggregate findings. Automate remediation via EventBridge + Lambda.
- Protect data in transit and at rest — Classify data by sensitivity. Use TLS for all in-transit data (enforce via bucket policy
aws:SecureTransport). Encrypt at rest using KMS. Apply SSE-KMS for auditability. - Keep people away from data — Reduce direct human access to sensitive data. Use automated processes for data access. When humans must access, use temporary, audited credentials. Limit production access to break-glass scenarios.
- Prepare for security events — Define incident response processes before an incident occurs. Run game days simulating security events. Use AWS Config, GuardDuty, and Security Hub for automated detection and pre-built runbooks for response.
Mnemonic — ITAPKPP: Identity foundation · Traceability · All layers · Automate · Protect data · Keep people away · Prepare for events.
Security Best Practice Areas
The 7 security focus areas from the Well-Architected Security whitepaper.
| Focus Area | Key Practices | Primary AWS Services |
|---|---|---|
| Security Foundations | AWS account structure, multi-account strategy with Organizations, trusted advisor checks | AWS Organizations, Control Tower, Trusted Advisor |
| Identity & Access Management | Least privilege, no root access keys, IAM roles over users, MFA everywhere, permissions boundaries | IAM, IAM Identity Center, STS, Cognito |
| Detection | Enable CloudTrail in all regions, GuardDuty for threat detection, Config for compliance, Security Hub for aggregation | CloudTrail, GuardDuty, AWS Config, Security Hub, Macie |
| Infrastructure Protection | VPC with private subnets, security groups (allow-only), NACLs for subnet-level deny, WAF on ALBs/CloudFront | VPC, WAF, Shield, Network Firewall, Inspector |
| Data Protection | Classify data by sensitivity (PII, PHI), encrypt at rest (KMS) and in transit (TLS), S3 Block Public Access, MFA Delete | KMS, ACM, S3, Macie, CloudHSM |
| Incident Response | Pre-defined runbooks, isolate compromised resources via Security Groups, use forensic accounts, automate containment | GuardDuty → EventBridge → Lambda, Security Hub, Systems Manager |
| Application Security | Threat modeling, SAST/DAST in CI/CD pipeline, dependency scanning, secrets management | Inspector, CodeGuru Security, Secrets Manager |
GuardDuty detects an EC2 instance communicating with a known C2 server. The recommended automated remediation is: GuardDuty finding → EventBridge rule → Lambda function that modifies the instance's Security Group to isolate it (removing all inbound/outbound rules except the forensics team's IP). This is "Automate security best practices" + "Prepare for security events" in action.
Exam pattern — match the service to the security area: Threat detection → GuardDuty. PII discovery in S3 → Macie. Compliance rules → AWS Config. Vulnerability scanning → Inspector. Aggregate findings → Security Hub. Audit log → CloudTrail. Web exploit protection → WAF.
Common traps:
- "Security and Operational Excellence are separate concerns" — FALSE; Well-Architected explicitly requires both — security must be built into operations, not bolted on.
- "GuardDuty auto-remediates security threats" — FALSE; GuardDuty only generates findings. Remediation requires a separate automation (EventBridge + Lambda).
- "AWS Shield Standard must be purchased" — FALSE; Shield Standard is free and automatically enabled for all AWS customers.
Security Pillar — Key AWS Services
| Category | Service | Purpose |
|---|---|---|
| Identity | IAM, IAM Identity Center, Cognito, STS | Authentication, authorization, federation, temporary credentials |
| Detection | GuardDuty, Macie, Inspector, CloudTrail, Security Hub | Threat detection, data classification, vulnerability scanning, audit logs, aggregation |
| Infrastructure | VPC, WAF, Shield, Network Firewall, Firewall Manager | Network isolation, web exploit protection, DDoS protection |
| Data | KMS, ACM, CloudHSM, Secrets Manager | Encryption keys, certificates, HSMs, secret storage |
| Compliance | AWS Config, Audit Manager, Artifact | Compliance rules, audit reports, regulatory documentation |
Pillar 3 — Reliability
The ability of a workload to perform its intended function correctly and consistently when it's expected to. Includes the ability to operate and test the workload through its total lifecycle — from design, through operations, to decommission.
🎯 Focus: Recover quickly from failures, scale to meet demandReliability Design Principles
- Automatically recover from failure — Monitor KPIs and trigger automation when thresholds are breached. Use Auto Scaling, Route 53 health checks, and ALB health checks to replace unhealthy resources without human intervention.
- Test recovery procedures — Use automation to simulate component failures. Don't wait for real incidents to find out your DR plan doesn't work. Run regular failover drills — test Multi-AZ failover, backup restoration, and cross-region DR.
- Scale horizontally to increase aggregate workload availability — Replace one large resource with many smaller resources distributed across multiple AZs. If one fails, the remaining capacity absorbs load. Avoid single large instances; prefer Auto Scaling groups.
- Stop guessing capacity — Use Auto Scaling to provision the right amount of capacity at any given time. Add and remove resources automatically based on demand — never over-provision for peak or under-provision and throttle.
- Manage change through automation — Use IaC (CloudFormation) and CI/CD to make changes in a consistent, tested, automated way. Unmanaged manual changes are a leading cause of reliability incidents.
| Availability | Downtime per year | Downtime per month |
|---|---|---|
| 99% ("Two nines") | ~87.6 hours | ~7.3 hours |
| 99.9% ("Three nines") | ~8.76 hours | ~43.8 minutes |
| 99.99% ("Four nines") | ~52.6 minutes | ~4.38 minutes |
| 99.999% ("Five nines") | ~5.26 minutes | ~26.3 seconds |
"Eliminate single points of failure" → Scale horizontally + Multi-AZ. "Never test DR only during a real incident" → Test recovery procedures principle. "Unpredictable traffic causes outages" → Stop guessing capacity → Auto Scaling.
Reliability Best Practices
Foundations, workload architecture, change management, and failure management.
- Service quotas: Understand and proactively request increases for limits that could affect workload availability (e.g., Lambda concurrency, EC2 vCPUs, VPC subnets)
- Network topology: Design VPCs with multiple AZs. Use private subnets for compute; public only for load balancers and NAT. Plan IP address space to avoid exhaustion.
- Service-oriented architecture: Each component deployed independently. Failures in one component don't cascade to others.
- Bulkhead pattern: Isolate components so failure of one doesn't exhaust resources for others (e.g., separate thread pools, separate queues per tier)
- Circuit breaker: Detect downstream failures and stop sending requests to a failing service; fail fast instead of queuing up and causing cascading timeouts
- Idempotency: Design operations that can be retried without side effects — critical for at-least-once delivery systems (SQS, Lambda async)
- Deploy changes using CI/CD with automated tests and rollback capabilities
- Use CloudWatch alarms as CloudFormation rollback triggers — automatically roll back a stack update if alarm fires post-deployment
- Feature flags (AppConfig) allow disabling problematic features without redeployment
- Backup and restore: AWS Backup for centralized, policy-based backup across services. Test restores regularly.
- Multi-AZ: Deploy across ≥2 AZs. RDS Multi-AZ, ALB across AZs, ECS service across AZs. Synchronous replication for zero-data-loss failover.
- Multi-Region DR strategies (RTO/RPO tradeoffs):
| Strategy | RTO | RPO | Cost | Description |
|---|---|---|---|---|
| Backup & Restore | Hours | Hours | Lowest | Restore from backups in DR region when disaster occurs |
| Pilot Light | ~10 min | Minutes | Low | Minimal infrastructure (DB replicas) running; scale up on disaster |
| Warm Standby | Minutes | Seconds | Medium | Scaled-down copy fully running in DR region; scale to full on disaster |
| Multi-Site Active-Active | Near-zero | Near-zero | Highest | Full production load in ≥2 regions simultaneously; instant failover |
DR Cost vs. Speed: B&R (cheapest, slowest) → Pilot Light → Warm Standby → Active-Active (costliest, fastest). Exam will describe RTO/RPO requirements and ask which strategy fits.
Common traps:
- "Multi-AZ and Multi-Region are the same" — FALSE; Multi-AZ provides HA within a region. Multi-Region is for disaster recovery and geographic distribution across regions.
- "RTO is how much data you can afford to lose" — FALSE; RTO is Recovery Time Objective (how long recovery takes). RPO is Recovery Point Objective (how much data loss is acceptable).
- "Pilot Light means no resources are running in DR" — FALSE; Pilot Light keeps minimal critical resources (like DB replicas) running. Backup & Restore has nothing running.
Reliability Pillar — Key AWS Services
| Category | Service | Reliability Role |
|---|---|---|
| Compute HA | Auto Scaling Groups, ALB | Replace unhealthy instances; distribute load across AZs |
| Database HA | RDS Multi-AZ, Aurora Global DB, DynamoDB Global Tables | Synchronous replication; automatic failover; cross-region HA |
| DNS / Traffic | Route 53 health checks, failover routing, latency routing | Automatic traffic cutover to healthy endpoints |
| Backup | AWS Backup, S3 Cross-Region Replication, EBS snapshots | Centralized backup policies; cross-region data durability |
| Decoupling | SQS, SNS, EventBridge | Buffer requests; prevent cascade failures between components |
| Serverless HA | Lambda, DynamoDB, S3 | Built-in HA across AZs by design — no customer configuration needed |
Pillar 4 — Performance Efficiency
The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.
🎯 Focus: Use the right resource type, right size, and continuously measurePerformance Efficiency Design Principles
- Democratize advanced technologies — Use managed services for complex capabilities (ML with SageMaker, search with OpenSearch, graph with Neptune). Don't build what AWS manages; consume it as a service to reduce effort and accelerate innovation.
- Go global in minutes — Deploy to multiple AWS Regions with minimal effort. Use CloudFront to serve users from edge locations worldwide. Aurora Global Database replicates with <1s lag. Route 53 latency routing directs users to the nearest region.
- Use serverless architectures — Remove the need to run and maintain servers. Lambda, Fargate, DynamoDB, Aurora Serverless, S3 — these services eliminate undifferentiated heavy lifting and scale automatically.
- Experiment more often — The low cost of cloud resources enables A/B testing of architecture choices (instance types, storage classes, caching strategies) to find the best-performing option without long-term commitment.
- Consider mechanical sympathy — Understand how services work internally and use them in alignment with their design. Example: DynamoDB performs best with high-cardinality partition keys; use keys that distribute load evenly across partitions.
Performance Efficiency Best Practices
Architecture selection, compute, data management, networking, and continuous review.
| Workload Type | Recommended Architecture | AWS Services |
|---|---|---|
| Event-driven / variable load | Serverless | Lambda, DynamoDB, API GW, S3 |
| Containerized microservices | Container platform | ECS Fargate, EKS, ECR |
| Stateful long-running | Compute instances | EC2 with right-sized instance family |
| High-performance compute (HPC) | Cluster placement groups | C/P/G instance families, EFA networking |
| Data analytics | Purpose-built data services | Redshift, Athena, EMR, Kinesis |
- Right-size instances: Use CloudWatch metrics (CPU, memory via custom metric) + AWS Compute Optimizer to find underutilized instances and resize to the correct family and size
- Instance families: General (M/T) · Compute-optimized (C) · Memory-optimized (R/X) · Storage-optimized (I/D) · Accelerated (P/G/Inf/Trn)
- Graviton (ARM): Up to 40% better price/performance for many workloads (Lambda, EC2, RDS). Cost optimization AND performance simultaneously.
- Auto Scaling: Predictive Scaling anticipates traffic; Target Tracking maintains a metric at a target value
- Choose the right storage: S3 for objects, EBS for block (EC2), EFS for shared file, FSx for HPC/Windows
- Use caching to reduce read latency: ElastiCache, DAX (DynamoDB), CloudFront (CDN), API GW cache
- Partition data to enable parallel reads — DynamoDB partition key design, Kinesis shard count
- Proximity: Deploy resources in regions close to users. Use Route 53 latency routing or geolocation routing.
- CloudFront: Cache static assets and API responses at 400+ edge locations globally — dramatically reduces latency and origin load
- Enhanced Networking: For HPC and low-latency networking — use ENA (Elastic Network Adapter) and EFA (Elastic Fabric Adapter) on supported instance types
- Global Accelerator: Routes traffic over the AWS backbone to the nearest healthy endpoint — reduces hops and jitter for non-cacheable traffic (dynamic APIs, gaming)
Exam shortcut — Performance Efficiency service matching: "Low latency for global users" → CloudFront. "Low latency for global dynamic API" → Global Accelerator. "Database query latency" → ElastiCache / DAX. "Wrong EC2 instance size" → Compute Optimizer. "Faster Lambda on Java" → SnapStart / Graviton.
Performance Efficiency — Key AWS Services
Pillar 5 — Cost Optimization
The ability to run systems to deliver business value at the lowest price point. Covers understanding spending, controlling fund allocation, selecting the right resource type and quantity, and scaling to meet business needs without overspending.
🎯 Focus: Eliminate waste, right-size resources, use pricing models strategicallyCost Optimization Design Principles
- Implement cloud financial management — Build organizational capability (FinOps practice). Establish a Cloud Center of Excellence. Finance and engineering must collaborate on cost goals. Tag all resources for chargeback/showback reporting.
- Adopt a consumption model — Pay only for what you use. Use on-demand and serverless resources. Shut down non-production environments outside business hours. Use Auto Scaling to match supply to demand — never idle capacity.
- Measure overall efficiency — Track cost per unit of business outcome (e.g., cost per order, cost per active user). Use AWS Cost Explorer, Cost & Usage Reports, and cost allocation tags.
- Stop spending money on undifferentiated heavy lifting — Use managed services (RDS instead of self-managed MySQL on EC2, Lambda instead of always-on EC2). AWS manages patching, HA, and scaling — you stop paying for that operational work.
- Analyze and attribute expenditure — Use cost allocation tags and AWS Organizations to attribute costs to teams/products. Enable cost visibility so engineers see the impact of their architectural decisions.
Cost Optimization Best Practices
Cloud financial management, usage awareness, cost-effective resources, demand management, and continuous optimization.
| Model | Discount vs On-Demand | Commitment | Best For |
|---|---|---|---|
| On-Demand | Baseline | None | Variable, unpredictable workloads; dev/test |
| Reserved Instances (Standard) | Up to 72% | 1 or 3 years, specific type | Steady-state production workloads with known instance type |
| Reserved Instances (Convertible) | Up to 54% | 1 or 3 years, flexible type | Steady-state but may change instance family |
| Savings Plans (Compute) | Up to 66% | 1 or 3 years, $/hr commitment | Flexible — applies to EC2, Lambda, Fargate, any region/family |
| Savings Plans (EC2 Instance) | Up to 72% | 1 or 3 years, specific family+region | Specific instance family in specific region |
| Spot Instances | Up to 90% | None (can be interrupted) | Fault-tolerant, stateless, batch, CI/CD, ML training |
| Dedicated Hosts | On-demand or Reserved | Optional | Licensing compliance (per-socket, per-core) or regulatory isolation |
- AWS Compute Optimizer: ML-based right-sizing recommendations for EC2, Lambda, ECS tasks, Auto Scaling groups, EBS volumes — based on 14 days of CloudWatch metrics
- AWS Trusted Advisor: Free cost optimization checks — identifies idle EC2 instances, underutilized EBS volumes, unassociated Elastic IPs, low-utilization RDS instances
- Idle resources: Terminate stopped EC2 instances with attached EBS, delete unattached EBS volumes, release unassociated Elastic IPs ($0.005/hr charge when unassociated)
- S3 lifecycle policies: Move objects to cheaper storage classes over time (Standard → Standard-IA → Glacier → Glacier Deep Archive)
| Tool | Purpose |
|---|---|
| AWS Cost Explorer | Visualize and analyze historical spend; forecast future costs; RI utilization and coverage reports |
| AWS Budgets | Set cost/usage/coverage thresholds; alert via SNS when threshold breached; take automated actions |
| Cost & Usage Report (CUR) | Most granular billing data — exported to S3, queryable with Athena; gold standard for chargeback |
| Cost Allocation Tags | Must be activated in Billing console; tag resources with Team/Project/Environment; filter Cost Explorer by tag |
Savings Plans vs. Reserved Instances: Savings Plans are more flexible (apply across EC2 + Lambda + Fargate, any region, any family). Standard RIs are less flexible but slightly higher discount for specific instance types. Compute Savings Plans = best all-round for most organizations.
"Alert when monthly spend exceeds $500" → AWS Budgets. "Find underutilized EC2 instances" → Compute Optimizer + Trusted Advisor. "Maximize discount for predictable 3-year workload" → Standard Reserved Instances. "Fault-tolerant batch processing cheaply" → Spot Instances. "Granular per-team billing data" → Cost & Usage Report + Athena + Cost Allocation Tags.
Common traps:
- "AWS Budgets stops spending when the threshold is hit" — FALSE; Budgets only alerts/notifies. To actually stop resources, you need a Lambda triggered by the Budget Action or an SCP.
- "Reserved Instances can be used for any instance type" — FALSE; Standard RIs are locked to specific instance family, size, region, and OS. Convertible RIs allow changing type within the same family.
- "Spot Instances are suitable for production databases" — FALSE; Spot Instances can be interrupted with 2 minutes notice. Never use for stateful workloads requiring persistence.
Cost Optimization — Key AWS Services
Pillar 6 — Sustainability
The ability to continually improve sustainability impacts by reducing energy consumption and increasing efficiency across all components of a workload by maximizing the benefits from provisioned resources and minimizing the total resources required.
🎯 Focus: Minimize environmental impact — maximize utilization, minimize wasteSustainability Design Principles
- Understand your impact — Measure the environmental impact of your workload. Use the Customer Carbon Footprint Tool to track your AWS carbon emissions. Establish KPIs for sustainability (e.g., carbon per API call).
- Establish sustainability goals — Set long-term goals for carbon reduction aligned with your organization's sustainability commitments. Prioritize workloads with the highest impact.
- Maximize utilization — Right-size resources to maximize utilization and reduce idle waste. Use Auto Scaling so resources only run when needed. Consolidate workloads onto fewer, larger instances vs. many small ones.
- Anticipate and adopt new, more efficient hardware and software — Monitor AWS for new, more energy-efficient instance types (e.g., Graviton processors use up to 60% less energy than equivalent x86 instances). Adopt them as they become available.
- Use managed services — AWS managed services are optimized for energy efficiency at scale (shared infrastructure, higher utilization). Moving from self-managed to managed (e.g., RDS vs MySQL on EC2) reduces your environmental footprint.
- Reduce the downstream impact of your cloud workloads — Minimize data transfer, storage, and processing that end users perform. Use efficient data formats (Parquet vs CSV), compress data, deliver assets from edge (CloudFront) to reduce client-side energy use.
Sustainability Best Practices
Region selection, behavior patterns, software architecture, data patterns, hardware, and development process.
| Area | Practice | AWS Implementation |
|---|---|---|
| Region Selection | Choose regions powered by higher % renewable energy | Check AWS sustainability region data; consider carbon footprint per region |
| User Behavior | Reduce unnecessary data transfer and client-side compute | CloudFront edge caching; efficient pagination; compression (gzip/brotli) |
| Software Architecture | Use serverless to eliminate idle compute; adopt event-driven | Lambda (no idle servers), Fargate (no idle EC2), DynamoDB On-Demand |
| Data Patterns | Minimize data stored and processed; use efficient formats | S3 Intelligent-Tiering; archive to Glacier; use Parquet/ORC over CSV; delete unused data |
| Hardware Patterns | Use energy-efficient processor architectures | Graviton (ARM) instances — up to 60% less energy for same workload |
| Development Process | Include sustainability in architecture reviews; measure and improve | Well-Architected Tool sustainability lens; Customer Carbon Footprint Tool |
- AWS responsibility: Optimize the physical data center, networking, and hardware (energy-efficient cooling, renewable energy procurement, efficient hardware design)
- Customer responsibility: Optimize their workloads — right-size, remove idle resources, choose efficient architectures, use efficient data formats, leverage managed services
Sustainability and Cost Optimization are closely aligned — both benefit from eliminating waste and maximizing utilization. Key difference: Sustainability focuses on minimizing environmental impact (carbon emissions, energy use); Cost Optimization focuses on minimizing financial spend. In practice, actions like right-sizing, using Graviton, and switching to serverless achieve both goals simultaneously.
On the exam, Sustainability scenarios often involve: Graviton instances (energy-efficient compute), S3 Intelligent-Tiering (reduce storage waste), serverless architectures (no idle compute), and data lifecycle policies (delete unused data). If the question mentions "reduce carbon footprint" or "minimize environmental impact" → Sustainability pillar.
Common traps:
- "Sustainability is only relevant for large enterprises with ESG requirements" — FALSE; it's one of the 6 pillars and tested on all associate and professional exams.
- "Using more powerful instances improves sustainability because jobs finish faster" — CONTEXT-DEPENDENT; over-provisioning wastes energy. Right-sizing is the key — use the minimum power needed to meet performance requirements.