Domain 1 Overview
Develop, test, and maintain applications on AWS. Covers architectural patterns, AWS SDKs and APIs, messaging and event-driven services, Lambda development, and data store integration. The highest-weighted domain at 32%.
⚡ 32% of scored content — Highest weighted domainDevelop Code for Applications Hosted on AWS
Architectural patterns, SDK usage, API design, messaging services, streaming, and event-driven patterns.
- Describe architectural patterns (event-driven, microservices, monolithic, choreography, orchestration, fanout)
- Describe differences between stateful/stateless and tightly/loosely coupled components
- Describe differences between synchronous and asynchronous patterns
- Write and run unit tests; write code that interacts with AWS services via APIs and SDKs
- Use messaging services, handle streaming data, implement resilient application code
- Use Amazon EventBridge for event-driven patterns; use Amazon Q Developer
The DVA-C02 exam tests your ability to choose the right architectural pattern for a given scenario. Understand the tradeoffs deeply.
| Pattern | Description | AWS Services | Best For |
|---|---|---|---|
| Monolithic | Single deployable unit; tightly coupled | EC2, Elastic Beanstalk | Simple apps, small teams, early stage |
| Microservices | Independent services with own data stores | ECS, EKS, Lambda, API GW | Scale independently, team autonomy |
| Event-Driven | Components react to events asynchronously | EventBridge, SNS, SQS, Lambda | Decoupled, scalable workflows |
| Fanout | One publisher → multiple subscribers | SNS → SQS queues | Parallel processing of same event |
| Choreography | Orchestration | |
|---|---|---|
| Control | Decentralized — each service reacts to events | Centralized — one service directs the flow |
| Coupling | Loose — services don't know each other | Tighter — orchestrator knows all services |
| AWS Service | EventBridge, SNS, SQS | AWS Step Functions |
| Visibility | Harder to trace the full workflow | Full execution history & audit trail |
| Best for | Simple, independent workflows | Complex multi-step processes with error handling |
An e-commerce order event is published to an SNS topic. Three SQS queues subscribe: one for inventory, one for shipping, one for email notifications. All three process the order in parallel. This is the classic SNS fanout pattern — one message, multiple independent consumers.
- Stateless: No session data stored in the application instance — any request can go to any instance. Store state externally (DynamoDB, ElastiCache, S3). Lambda functions are stateless by design.
- Stateful: Session data in memory — the same client must hit the same instance. Requires sticky sessions (ALB) or session replication. Harder to scale horizontally.
- Tightly coupled: Service A calls Service B synchronously and waits. If B fails, A fails. Direct function calls or synchronous HTTP are examples.
- Loosely coupled: Service A sends a message to a queue or topic and continues. B processes when ready. SQS, SNS, and EventBridge enable loose coupling.
| Synchronous | Asynchronous |
|---|---|
| Caller waits for response | Caller fires and continues |
| REST API calls, Lambda invoked by API GW | SQS, SNS, EventBridge, Lambda async invoke |
| Immediate feedback | Better scalability and resilience |
| Timeout risk under load | Backpressure handled by queue depth |
Mnemonic — CHESS: Choreography = each service reacts. Hard to trace. EventBridge/SNS. Step Functions = State machine for orchestration.
"Long-running multi-step workflow" → Step Functions Standard. "Multiple services react to same event" → SNS fanout to SQS. "Decouple producer from consumer" → SQS. "Route events based on rules" → EventBridge.
Common traps:
- "Microservices are always better than monolithic" — FALSE; microservices add operational complexity. For simple apps, a monolith is appropriate.
- "Lambda functions are stateful within their execution environment" — PARTIALLY TRUE; the execution context can be reused between warm invocations, but you cannot rely on it. Store durable state externally.
- "Choreography is better than orchestration because it's loosely coupled" — CONTEXT-DEPENDENT; orchestration (Step Functions) is preferred when you need visibility, error handling, and complex branching logic.
Every interaction with AWS from code goes through the SDK. Understanding the credential resolution chain and retry behavior is critical for DVA-C02.
- 1. Explicit in code — hardcoded (never do this in production)
- 2. Environment variables —
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY - 3. AWS credentials file —
~/.aws/credentials(default profile) - 4. AWS config file —
~/.aws/config - 5. Container credentials — ECS task role via metadata endpoint
- 6. Instance profile — EC2 IAM Role via IMDS (
169.254.169.254)
- AWS SDKs automatically retry on retryable errors (e.g., throttling
ThrottlingException, server errors 5xx) - Exponential backoff: Wait 2n seconds × random jitter between retries — prevents thundering herd
- Non-retryable errors (4xx like
AccessDeniedException) are NOT retried by the SDK - When you receive
ProvisionedThroughputExceededExceptionfrom DynamoDB or SQS throttling, implement backoff in your code
| Pattern | Description | Example |
|---|---|---|
| Paginator | Auto-handles NextToken pagination | list_objects_v2 on large S3 buckets |
| Waiter | Poll until resource reaches target state | Wait for EC2 instance to be running |
| Presigned URL | Temporary, signed URL for S3 access | Upload/download without credentials |
| Batch operations | Reduce API calls by batching | DynamoDB BatchWriteItem, SQS SendMessageBatch |
// Example: S3 presigned URL generation (Python boto3) s3 = boto3.client('s3') url = s3.generate_presigned_url( 'get_object', Params={'Bucket': 'my-bucket', 'Key': 'file.pdf'}, ExpiresIn=3600 # URL valid for 1 hour )
Never hardcode credentials. On EC2 → use IAM Role. On Lambda → use execution role. On ECS → use task role. The SDK resolves credentials in the chain order above automatically.
Common traps:
- "SDK retries all errors automatically" — FALSE; only retryable errors (5xx, throttling). AccessDenied (403) is not retried.
- "Presigned URLs can be used to grant permanent access to S3 objects" — FALSE; they are time-limited (max 7 days for SDK-generated, 12h for STS-based).
- "Environment variable credentials override instance profile credentials" — TRUE; they are checked earlier in the chain (position 2 vs 6).
API Gateway is the front door for serverless applications. The DVA-C02 exam tests all three API types, authorization methods, stages, and caching.
| Feature | REST API | HTTP API | WebSocket API |
|---|---|---|---|
| Protocol | HTTP/HTTPS | HTTP/HTTPS | WebSocket |
| Cost | Higher | ~70% cheaper | Per message/connection |
| Features | Full-featured (caching, WAF, usage plans) | Lightweight, low-latency | Persistent connections |
| Auth | Lambda, Cognito, IAM, API Keys | Lambda, Cognito, IAM | Lambda authorizer |
| Best for | APIs needing advanced features | Simple proxies, microservices | Real-time (chat, notifications) |
| Authorizer Type | How It Works | Use Case |
|---|---|---|
| IAM Authorization | Caller signs request with Sig V4 | Internal AWS service-to-service |
| Cognito User Pool | Validates JWT from Cognito | Web/mobile apps with Cognito login |
| Lambda Authorizer (Token) | Lambda receives Bearer token, returns IAM policy | Custom JWT, OAuth, 3rd-party IdP |
| Lambda Authorizer (Request) | Lambda receives full request (headers, query) | Complex auth logic, IP allow-listing |
| API Key | Key sent in header; not authentication — use for rate limiting | Usage plans, partner throttling |
- Stages are named snapshots of your API (e.g.,
dev,staging,prod) - Stage variables are key-value pairs like environment variables — referenced as
${stageVariables.lambdaAlias}in integrations - Use stage variables to point different stages to different Lambda aliases or backends without changing the API definition
- Canary deployments: Route a % of traffic to new deployment for testing before full release
- API Gateway cache: Caches responses per stage; configurable TTL (0–3600s); reduces Lambda invocations for repeated identical requests
- Throttling: Default 10,000 RPS per account; per-method throttling possible via usage plans
- Returns
429 Too Many Requestswhen throttled
Your API has three stages. The Lambda integration URL uses ${stageVariables.lambdaAlias}. In dev stage, the variable = dev. In prod, = live. Deploying a new Lambda version to the dev alias doesn't affect production — the same API definition routes to the correct Lambda alias per stage.
Auth choice: Internal AWS service → IAM. Mobile/web app users → Cognito User Pool. Custom/3rd-party token → Lambda Authorizer. Rate-limiting partners → API Keys + Usage Plans.
Common traps:
- "API Keys authenticate and authorize users" — FALSE; API Keys identify the calling application, not the user. They are used for throttling/metering, not security authorization.
- "HTTP API supports caching and WAF integration" — FALSE; HTTP API is lightweight and does NOT support response caching or AWS WAF integration. Use REST API for these features.
- "Caching is free" — FALSE; API Gateway caching is an additional charge based on cache size.
Understanding when to use SQS vs SNS vs EventBridge is a core DVA-C02 skill. Know the delivery model, ordering guarantees, and failure handling for each.
| Feature | SQS | SNS | EventBridge |
|---|---|---|---|
| Model | Pull (consumer polls) | Push (fan-out) | Push (rule-based routing) |
| Consumers | One consumer per message | Multiple subscribers | Multiple targets per rule |
| Persistence | Up to 14 days | No storage (fire & forget) | No storage |
| Ordering | FIFO queue only | None | None |
| Exactly-once | FIFO only | No | No |
| Message size | 256 KB | 256 KB | 256 KB |
- Visibility Timeout: How long a message is hidden after being received. If processing doesn't finish & delete within this window, the message becomes visible again → potential duplicate processing
- Dead-Letter Queue (DLQ): After
maxReceiveCountfailed processing attempts, message is moved to DLQ for investigation - Long Polling:
WaitTimeSecondsup to 20s — waits for messages instead of returning empty responses. Reduces API costs and false empties - FIFO Queues: Exactly-once processing, strict ordering, max 3,000 TPS (with batching) — use Message Group ID for ordered processing per group
- Message Retention: Default 4 days, max 14 days
- Serverless event bus — routes events based on content-based filtering rules
- Event sources: AWS services, SaaS partners (Zendesk, Shopify), custom apps
- Targets: Lambda, SQS, SNS, Step Functions, API GW, Kinesis, etc.
- Event Patterns: Filter events by source, detail-type, and any field in the event JSON
- Scheduled Rules (cron): Replace CloudWatch Events scheduled rules; run tasks on a schedule
- Event Buses: Default (AWS services), custom (your app events), partner (SaaS events)
Route events to multiple targets based on rules → EventBridge. One message, multiple parallel consumers → SNS → SQS fanout. Ensure exactly-once ordered processing → SQS FIFO. Decouple and buffer with retry → SQS Standard + DLQ.
Common traps:
- "SQS Standard guarantees at-most-once delivery" — FALSE; Standard provides at-least-once (duplicates can occur). Only FIFO provides exactly-once.
- "If a Lambda function fails to process an SQS message, the message is automatically sent to DLQ" — FALSE; it goes to DLQ only after
maxReceiveCountfailures. The message is retried first. - "SNS stores messages if a subscriber is unavailable" — FALSE; SNS is fire-and-forget. If the subscriber (e.g., HTTP endpoint) is down, the message is lost. Use SQS subscriber to persist.
Kinesis enables real-time data streaming. Know the difference between the Kinesis products and when to use each vs SQS.
| Service | Purpose | Consumers | Key Feature |
|---|---|---|---|
| Kinesis Data Streams (KDS) | Custom real-time processing | Multiple (Lambda, KCL, Analytics) | Configurable retention 1–365 days; ordered per shard |
| Kinesis Data Firehose | Load streaming data to destinations | S3, Redshift, OpenSearch, Splunk | Fully managed, no custom consumer code needed |
| Kinesis Data Analytics | Real-time SQL/Flink on streams | KDS or Firehose as source | Detect anomalies, aggregate, filter in real-time |
- Shard: Unit of capacity — 1 MB/s write, 2 MB/s read per shard. Add shards to scale.
- Partition Key: Determines which shard receives the record. Use high-cardinality keys for even distribution
- Sequence Number: Unique per record per shard; ordered within a shard
- Enhanced Fan-Out: Dedicated 2 MB/s per consumer per shard using HTTP/2 push — for multiple high-throughput consumers
- Lambda + KDS: Lambda polls via event source mapping; batch size configurable; bisect on error for failed batches
| Kinesis Data Streams | SQS | |
|---|---|---|
| Multiple consumers | Yes — same data read by multiple consumers | No — message consumed by one consumer |
| Ordering | Ordered per shard | FIFO only (SQS FIFO) |
| Replay | Yes — replay up to retention period | No — messages deleted after processing |
| Use case | Real-time analytics, event replay, audit logs | Task queue, job distribution, decoupling |
"Multiple consumers process same data stream" → Kinesis. "Real-time clickstream analytics" → Kinesis. "Deliver streaming data to S3 without code" → Kinesis Firehose. "Decouple service X from service Y" → SQS.
Common traps:
- "Kinesis Firehose delivers data in real-time with sub-second latency" — FALSE; Firehose buffers data (minimum 60-second buffer interval or 1 MB) before delivering. It is near-real-time, not sub-second.
- "Adding more shards to KDS increases read throughput for all consumers equally" — FALSE without Enhanced Fan-Out. With standard consumers, total read throughput is shared across consumers on the same shard (2 MB/s shared).
Develop Code for AWS Lambda
Lambda fundamentals, configuration, event sources, error handling, VPC access, and performance tuning.
- Describe access to private resources in VPCs from Lambda code
- Configure Lambda functions: environment variables, memory, concurrency, timeout, runtime, handler, layers, extensions, triggers, destinations
- Handle the event lifecycle and errors using code (Lambda Destinations, dead-letter queues)
- Write and run test code; integrate Lambda with AWS services
- Tune Lambda functions for optimal performance; process and transform data in near real time
Lambda is the core of serverless development on AWS. Understanding its execution lifecycle, concurrency model, and cold start behavior is essential.
- Init phase: Download code, start runtime, run initialization code outside handler. Happens on cold start. Keep this fast.
- Invoke phase: Run the handler function. Called for every invocation.
- Shutdown phase: Runtime receives shutdown signal; run cleanup extensions.
| Cold Start | Warm Start | |
|---|---|---|
| When | First invocation or after scale-out/idle | Reuse of existing execution environment |
| Latency | Higher (100ms–several seconds, depending on runtime) | Near-zero overhead |
| Init code runs? | Yes | No |
| Global variables cached? | No (fresh environment) | Yes (reused from previous invocation) |
# Python handler — event is the trigger payload, context has metadata def lambda_handler(event, context): print(context.function_name) print(context.get_remaining_time_in_millis()) return { "statusCode": 200, "body": "Hello from Lambda" }
| Parameter | Range / Default | Impact |
|---|---|---|
| Memory | 128 MB – 10,240 MB | CPU and network bandwidth scale proportionally with memory |
| Timeout | 3s default, max 15 min | Function killed if exceeded; set higher than p99 duration |
| Ephemeral Storage (/tmp) | 512 MB – 10,240 MB | Temporary file storage within execution environment |
| Concurrency | Default account limit: 1,000 | Max simultaneous executions across all functions |
- Initialize outside handler: DB connections, SDK clients, config loading — cached across warm invocations
- Right-size memory: More memory = more CPU = faster execution. AWS Lambda Power Tuning tool finds optimal setting
- Minimize deployment package size: Faster cold starts; use Layers for shared dependencies
- Avoid recursive invocations: Lambda calling Lambda in a loop can rapidly exhaust concurrency and cause runaway costs
Cold start optimization rule: Move everything you can outside the handler. Connections, clients, and config load once on cold start, then are reused across warm invocations.
Eliminate cold starts entirely → Provisioned Concurrency (pre-warms N execution environments). Java cold starts are long → use Lambda SnapStart (snapshots initialized environment). "Function needs more CPU" → increase memory allocation.
Common traps:
- "Lambda timeout is 15 minutes by default" — FALSE; default is 3 seconds. 15 minutes is the maximum.
- "Increasing Lambda memory only helps memory-bound functions" — FALSE; memory also increases allocated vCPU, so CPU-bound functions also benefit.
- "Global variables persist indefinitely between invocations" — FALSE; they persist only within the same execution environment instance. New instances start fresh.
| Type | Behavior | Retry on failure | Triggered by |
|---|---|---|---|
| Synchronous | Caller waits for response | No (caller handles errors) | API GW, ALB, SDK invoke, Cognito triggers |
| Asynchronous | Lambda queues event, returns 202 immediately | Up to 2 retries (auto) | S3 events, SNS, EventBridge, SES |
| Event Source Mapping (poll-based) | Lambda polls the source on your behalf | Retries until success or expiry | SQS, Kinesis, DynamoDB Streams, MSK |
| Type | Purpose | Behavior |
|---|---|---|
| Unreserved Concurrency | Default — shared pool across all functions | Throttles when account limit reached |
| Reserved Concurrency | Guarantee max N for one function; cap others | Function cannot exceed N; other functions cannot use N |
| Provisioned Concurrency | Pre-initialize N environments to eliminate cold starts | Costs money even when idle; enables smooth scaling |
You have a Lambda that processes SQS messages and writes to RDS. RDS max connections = 100. Setting Lambda's reserved concurrency to 50 ensures at most 50 concurrent DB connections, protecting RDS from connection storms during traffic spikes. The RDS Proxy is also a solution but reserved concurrency is simpler to configure.
- Lambda polls the source and invokes your function with a batch of records
- Batch size: SQS (1–10,000), Kinesis (1–10,000), DDB Streams (1–10,000)
- Batch window: Lambda waits up to N seconds to fill a batch before invoking
- On failure: For SQS, failed batch items are returned to queue. Use bisect on error to split batches and isolate poison pill messages
- Partial Batch Response: Return
batchItemFailuresto tell Lambda which items failed — only those are retried
"Lambda hammering RDS with connections" → Set reserved concurrency OR use RDS Proxy. "Cold start latency for critical path" → Provisioned Concurrency. "Lambda should not exceed X invocations" → Reserved Concurrency.
Common traps:
- "Reserved Concurrency eliminates cold starts" — FALSE; Provisioned Concurrency eliminates cold starts. Reserved only caps concurrency.
- "S3 event notifications invoke Lambda synchronously" — FALSE; S3 invokes Lambda asynchronously. The call returns immediately and Lambda retries on failure.
- "SQS FIFO queues support exactly-once Lambda processing automatically" — PARTIALLY TRUE; FIFO provides exactly-once delivery, but your Lambda handler must still be idempotent for robustness.
- On failure, Lambda retries async invocations up to 2 times (3 total attempts)
- Events can wait in the async event queue for up to 6 hours
- DLQ: Configure an SQS queue or SNS topic as the DLQ. Failed events (after all retries) are sent here for investigation
- DLQ only captures the final failure; Lambda Destinations capture success OR failure at each attempt
| Destination Type | On Success | On Failure |
|---|---|---|
| SQS Queue | ✅ | ✅ |
| SNS Topic | ✅ | ✅ |
| EventBridge Bus | ✅ | ✅ |
| Another Lambda Function | ✅ | ❌ |
DLQ: only captures the event payload when all retries are exhausted. Lambda Destinations: sends a richer record (original event + function response/error + execution context) for both success and failure conditions. Destinations are more flexible and are the modern recommendation.
- Kinesis/DDB Streams: If batch fails, Lambda retries the entire batch until success OR the data expires from the stream. Risk of blocking the shard. Configure
bisectBatchOnFunctionErrorto split the batch and isolate poison pills. - SQS: Failed messages return to queue and become visible after visibility timeout. After
maxReceiveCount, moved to DLQ. - Partial Batch Response (SQS/Kinesis): Return
{"batchItemFailures": [{"itemIdentifier": "..."}]}to retry only specific messages, not the whole batch
Modern pattern: Use Lambda Destinations over DLQ for async functions — you get richer metadata and can route success and failure independently. Use DLQ for simplicity or SQS event source mapping failures.
Common traps:
- "DLQ works for all Lambda invocation types" — FALSE; DLQ only works for async invocations. For event source mappings (SQS/Kinesis), configure the source's own DLQ.
- "Lambda automatically retries synchronous invocations" — FALSE; with synchronous invocations, the caller receives the error and must implement its own retry logic.
- By default, Lambda runs in an AWS-managed VPC with internet access but no access to your private VPC resources (RDS, ElastiCache, internal ALBs)
- Configure Lambda with VPC subnets + security groups → Lambda creates an ENI (Elastic Network Interface) in your subnet
- Internet access in VPC: Lambda in a private subnet needs a NAT Gateway to reach the internet. Lambda in a VPC does NOT get internet access via a public subnet without NAT
- VPC Endpoints: Use to privately access AWS services (S3, DynamoDB) from Lambda-in-VPC without NAT Gateway
- Layers are ZIP archives containing libraries, a custom runtime, or data — shared across multiple functions
- Up to 5 layers per function; combined unzipped size ≤ 250 MB
- Common use: Share boto3, numpy, pandas, or common utility code across Lambda functions
- Layers are versioned — reference a specific layer version ARN in your function config
- Key-value pairs available to function code via
os.environ(Python) orprocess.env(Node.js) - Encrypted at rest using KMS (Lambda service key by default, or your CMK)
- Do NOT store plaintext secrets in environment variables — use Secrets Manager or SSM Parameter Store and fetch at init time
- Max 4 KB total for all environment variables combined
- Run alongside the Lambda function in the same execution environment
- Use cases: monitoring agents, security agents, secret rotation, log forwarding (e.g., Datadog, New Relic agents)
- Internal extensions: run in the same process. External extensions: run as separate processes
"Lambda needs to access RDS in private subnet" → Configure Lambda with same VPC + security group allowing DB port. "Lambda needs to call external API but also access RDS in VPC" → Lambda in private subnet + NAT Gateway for external internet. "Share common library across 10 Lambda functions" → Lambda Layer.
Common traps:
- "Putting Lambda in a public subnet gives it internet access" — FALSE; Lambda in a VPC (even public subnet) does NOT get internet access automatically. You need a NAT Gateway in a public subnet, with Lambda in a private subnet routing to it.
- "Lambda can have unlimited layers" — FALSE; maximum 5 layers per function.
- "Lambda environment variables are always secure" — PARTIALLY TRUE; they are encrypted at rest with KMS, but plaintext secrets are visible to anyone with IAM GetFunctionConfiguration permission. Use Secrets Manager for true secrets.
Use Data Stores in Application Development
DynamoDB, S3, RDS/Aurora, ElastiCache, and specialized data stores — selection, access patterns, and caching.
- Describe high-cardinality partition keys, database consistency models, and differences between query and scan operations
- Define Amazon DynamoDB keys and indexing (LSI, GSI)
- Serialize and deserialize data; manage data lifecycles
- Use data caching services; use specialized data stores based on access patterns (e.g., Amazon OpenSearch Service)
DynamoDB is a fully managed NoSQL database. The DVA-C02 exam heavily tests data modeling, index selection, and read/write patterns.
| Key Type | Components | Uniqueness |
|---|---|---|
| Simple Primary Key | Partition Key (PK) only | PK must be unique across all items |
| Composite Primary Key | Partition Key (PK) + Sort Key (SK) | PK+SK combo must be unique |
| LSI (Local Secondary Index) | GSI (Global Secondary Index) | |
|---|---|---|
| Partition Key | Same as base table | Can be any attribute |
| Sort Key | Different from base table | Can be any attribute |
| Creation time | At table creation only | Anytime |
| Consistency | Strongly consistent reads supported | Eventually consistent only |
| Storage | Within same partition as base table | Separate partition space |
| Max per table | 5 | 20 |
A social media app stores users and posts. PK = USER#userId, SK = PROFILE for user items. PK = USER#userId, SK = POST#timestamp for post items. One table, multiple entity types — query all posts for a user with a single Query operation.
- High-cardinality keys = even data distribution across partitions = avoid hot partitions
- Good: User ID, Order ID, Session ID (high cardinality, random distribution)
- Bad: Country, Status, Boolean flag (low cardinality = hot partition)
- Write sharding: If you must use a low-cardinality key, append a random suffix (1-N) to distribute writes, then use a GSI or Query across shards to read
| Provisioned Mode | On-Demand Mode | |
|---|---|---|
| Billing | Per RCU/WCU provisioned | Per request (read/write) |
| Predictability | Predictable traffic patterns | Unknown or variable traffic |
| Auto Scaling | Supported (but lags behind spikes) | Instant — no pre-planning |
LSI vs. GSI: Local = same partition key, Limited to creation time. Global = any key, Great flexibility. If you might need different access patterns later, plan for GSIs at design time.
Common traps:
- "GSIs support strongly consistent reads" — FALSE; GSIs are always eventually consistent. LSIs support both.
- "You can add an LSI to an existing table" — FALSE; LSIs must be created with the table. GSIs can be added anytime.
- "DynamoDB automatically distributes data evenly across partitions" — TRUE only if your partition key has high cardinality. A poorly chosen PK leads to hot partitions and throttling.
| Query | Scan | |
|---|---|---|
| Requires | Partition key value (mandatory) | Nothing (reads entire table) |
| Optional filter | Sort key condition + filter expression | Filter expression |
| Efficiency | Efficient — reads only matching items | Expensive — reads all items, filters after read |
| Cost | Low (reads proportional to results) | High (reads entire table) |
| Use | Production access pattern | Avoid in production; admin/migration use |
| Eventually Consistent | Strongly Consistent | |
|---|---|---|
| Latency | Lower | Slightly higher |
| Cost | 0.5 RCU per 4 KB | 1 RCU per 4 KB |
| Reads latest data? | Not guaranteed (may see stale data) | Always latest committed data |
| Default | Yes | Opt-in via ConsistentRead=true |
- Ordered log of item-level changes (INSERT, MODIFY, REMOVE) — retained 24 hours
- View types: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, NEW_AND_OLD_IMAGES
- Trigger Lambda functions for change-data-capture patterns (e.g., update search index, send notification)
- In-memory cache specifically for DynamoDB — microsecond read latency (vs milliseconds)
- Fully API-compatible with DynamoDB SDK — change endpoint, no code rewrite needed
- Cache hit: response from DAX. Cache miss: DAX reads from DynamoDB and caches result
- Ideal for: read-heavy workloads, hot items, gaming leaderboards, social feeds
- Does NOT help for write-heavy or strongly consistent read workloads
- Automatically deletes expired items based on a timestamp attribute — no charge for TTL deletes
- Use for session data, temporary tokens, cart items with expiry
- TTL deletions are eventually consistent — items may be visible for up to 48 hours after expiry
"DynamoDB reads are slow" → Add DAX. "Need to trigger Lambda on table changes" → DynamoDB Streams + Lambda event source mapping. "Auto-expire session data" → TTL attribute. "Read only latest price for a product" → Query with ConsistentRead=true (pay double RCU).
Common traps:
- "Scan with FilterExpression only reads filtered items" — FALSE; FilterExpression applies AFTER reading the data. You still pay for all read items before filtering — use Query instead.
- "DAX works for all DynamoDB operations including Scan and transactional" — FALSE; DAX caches GetItem and Query results. Scans pass through DAX to DynamoDB. Transactional API calls bypass DAX.
| Feature | Description | Use Case |
|---|---|---|
| Presigned URL (GET) | Temporary URL to download object; no AWS credentials needed by caller | Share private files with users |
| Presigned URL (PUT) | Temporary URL to upload directly to S3 without going through your server | Client-side file uploads (browser, mobile) |
| Multipart Upload | Upload large files in parts (required >5 GB, recommended >100 MB) | Large file uploads with resume capability |
| S3 Select | SQL query to retrieve subset of data from CSV/JSON/Parquet object | Reduce data transfer for log analysis |
| Event Notifications | Trigger SNS, SQS, or Lambda on object events | Thumbnail generation, virus scanning |
- S3 provides strong read-after-write consistency for new objects and overwrites (as of December 2020)
- After a successful PUT, any GET of that key returns the latest version
- Applies to both objects and object listings
- Trigger on: s3:ObjectCreated:*, s3:ObjectRemoved:*, s3:Replication:*, etc.
- Destinations: SQS, SNS, Lambda, EventBridge
- For complex routing (multiple targets, filtering), use EventBridge as the destination — then route to multiple targets via EventBridge rules
// Presigned URL generation (SDK v3, JavaScript) const { getSignedUrl } = require("@aws-sdk/s3-request-presigner"); const url = await getSignedUrl(s3Client, new GetObjectCommand({ Bucket, Key }), { expiresIn: 3600 } );
"User uploads large file directly to S3" → Presigned PUT URL (server generates URL, client uploads directly — bypasses your server). "Trigger image resizing Lambda on upload" → S3 Event Notification → Lambda. "Multiple services need to react to the same S3 event" → S3 Event Notification → EventBridge → multiple targets.
Common traps:
- "S3 is eventually consistent for new PUTs" — FALSE (post-2020); S3 now provides strong consistency for all operations including new object creation.
- "Multipart upload can be used for objects less than 5 MB" — technically possible but not meaningful. Parts must be at least 5 MB (except last part).
- "S3 event notifications can trigger multiple Lambda functions directly" — FALSE; each event configuration has one destination. Use SNS fanout or EventBridge for multiple targets.
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Rich (lists, sets, sorted sets, hashes, bitmaps) | Simple key-value strings only |
| Persistence | Yes (RDB snapshots, AOF) | No (in-memory only) |
| Replication/HA | Yes (primary-replica, Multi-AZ) | No |
| Pub/Sub | Yes | No |
| Cluster mode | Yes (sharding) | Yes (multi-node, no replication) |
| Best for | Sessions, leaderboards, queues, complex caching | Simple caching, multi-threaded scale-out |
| Pattern | Read Flow | Write Flow | Drawback |
|---|---|---|---|
| Lazy Loading (Cache-Aside) | Check cache → miss → read DB → store in cache | Write to DB only (cache updated on next miss) | Cache miss = 3 trips; potential stale data |
| Write-Through | Always read from cache | Write to DB AND cache simultaneously | Cache bloat (items written but never read) |
| Write-Behind (Write-Back) | Read from cache | Write to cache, async write to DB | Data loss risk if cache fails before DB write |
- Sits between your application and RDS — pools and multiplexes database connections
- Critical for Lambda → RDS: Lambda can open thousands of concurrent connections during spikes; RDS Proxy limits actual DB connections
- Reduces failover time for Aurora Multi-AZ by maintaining connections through DB failover
- Supports MySQL, PostgreSQL, MariaDB, Aurora (not Oracle)
"Leaderboard with real-time rank" → Redis sorted set. "Lambda connecting to RDS causes too many connections" → RDS Proxy. "Cache frequently read DB results" → ElastiCache lazy loading (cache-aside). "Need HA cache with failover" → Redis with Multi-AZ enabled.
Common traps:
- "Memcached supports HA with primary-replica failover" — FALSE; Memcached is a distributed cache with no replication. If a node fails, that data is lost. Use Redis for HA.
- "DAX and ElastiCache serve the same purpose" — FALSE; DAX is DynamoDB-specific and API-compatible. ElastiCache is a general-purpose cache for any backend. Use DAX for DynamoDB, ElastiCache for RDS/MySQL/etc.
Domain 2 Overview
Secure applications and data on AWS. Covers authentication and authorization (Cognito, IAM, STS), encryption at rest and in transit (KMS, ACM), and management of sensitive data (Secrets Manager, Parameter Store).
⚡ 26% of scored contentImplement Authentication and/or Authorization
IAM roles, Cognito User & Identity Pools, bearer tokens, STS, programmatic access, and cross-service auth.
- Use an identity provider to implement federated access (Amazon Cognito, IAM)
- Secure applications by using bearer tokens; configure programmatic access to AWS
- Assume an IAM role; define permissions for IAM principals
- Implement application-level authorization for fine-grained access control
- Handle cross-service authentication in microservice architectures
Developers interact with IAM primarily through roles and the STS credential chain. Understanding how services assume roles and how temporary credentials flow is fundamental to DVA-C02.
- 1. Code-level credentials (never use in production)
- 2. Environment variables:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN - 3. AWS credentials file:
~/.aws/credentials - 4. AWS config file:
~/.aws/config - 5. ECS container credentials (task role via metadata endpoint)
- 6. EC2 Instance Profile / Lambda execution role (IMDS endpoint)
| API | Use Case |
|---|---|
AssumeRole | Cross-account or service-to-service, Lambda assume role |
AssumeRoleWithWebIdentity | Federated via OIDC (Google, Amazon, GitHub Actions) |
AssumeRoleWithSAML | Federated via enterprise IdP (ADFS, Okta) using SAML 2.0 |
GetSessionToken | Add MFA requirement to existing IAM user session |
- Default = implicit Deny
- Explicit Deny always wins — overrides any Allow
- Action allowed only when: explicit Allow exists AND no Deny blocks it at any level (SCP → Permission Boundary → Identity Policy → Resource Policy)
Service A (ECS task) calls Service B (Lambda via API Gateway). Pattern: ECS task has an IAM task role. Task calls STS AssumeRole to get temporary credentials for Service B's IAM role, or passes a signed request (Signature Version 4). The Lambda function's resource policy must allow Service A's role to invoke it.
Lambda execution role — the role the Lambda function assumes. Defined in Role in function config. Lambda resource-based policy — who can invoke the function (e.g., API Gateway, S3). Both must be correct for cross-service invocations.
Common traps:
- "A Lambda function with an execution role can call any AWS API" — FALSE; the execution role must explicitly allow the specific API actions needed.
- "IAM permissions boundaries grant additional permissions" — FALSE; boundaries only restrict the maximum permissions — they never grant by themselves.
- "ECS task role and EC2 instance role are the same concept" — conceptually similar but different configuration. ECS uses
taskRoleArnin the task definition.
Cognito is the primary AWS service for application-level auth. The exam frequently tests the distinction between User Pools (authentication) and Identity Pools (authorization for AWS resources).
| User Pools | Identity Pools (Federated Identities) | |
|---|---|---|
| Purpose | User directory + authentication | Exchange any token for AWS credentials |
| Returns | JWT tokens (id, access, refresh) | Temporary STS credentials (IAM role) |
| Supported IdPs | Username/password, Google, Facebook, SAML, OIDC | Cognito User Pool, Google, Facebook, SAML, unauthenticated |
| AWS API access | No (tokens for app-level auth only) | Yes (STS creds for S3, DynamoDB, etc.) |
| Use case | Login for web/mobile app | Let users directly call AWS APIs (e.g., upload to S3) |
- ID Token: Contains user identity claims (sub, email, custom attributes). Verify with your API.
- Access Token: Authorize against Cognito APIs (e.g., update user attributes). Use as Bearer token in API Gateway Cognito authorizer.
- Refresh Token: Long-lived (default 30 days). Exchange for new id/access tokens without re-login.
- Pre-signup: Validate or auto-confirm users during registration
- Pre-token generation: Add custom claims to tokens
- Post-authentication: Log or audit successful logins
- Custom message: Customize verification/MFA messages
- User migration: Migrate users from legacy system on first login
User Pool = Who are you? (AuthN) → returns JWT. Identity Pool = What can you do on AWS? (AuthZ) → returns STS credentials. Need both for "login with Google then upload to S3."
"Mobile app users log in and upload to S3 directly" → User Pool (authenticate) + Identity Pool (get STS creds for S3). "API Gateway validates JWT from Cognito" → Cognito User Pool authorizer on API GW. "Add user's department to JWT claims" → Pre-token generation Lambda trigger.
Common traps:
- "Cognito User Pool tokens grant direct access to AWS services like S3" — FALSE; User Pool tokens are JWTs for app-level auth. You need Identity Pools to exchange them for STS credentials.
- "Identity Pools require Cognito User Pools as the IdP" — FALSE; Identity Pools support any OIDC-compatible IdP (Google, Facebook, Apple) as well as unauthenticated (guest) access.
| Scenario | Recommended Authorizer |
|---|---|
| Internal AWS service calling API | IAM Authorization (Sig V4) |
| Users authenticated via Cognito User Pool | Cognito User Pool Authorizer |
| Custom JWT from 3rd-party IdP (Auth0, Okta) | Lambda Authorizer (Token type) |
| Complex rules: IP allowlist, multi-header logic | Lambda Authorizer (Request type) |
| Throttle / meter partner integrations | API Keys + Usage Plans |
- API GW invokes your Lambda with the Bearer token (or full request)
- Lambda validates the token (e.g., verifies JWT signature, checks expiry)
- Lambda returns an IAM policy document:
{"principalId": "user123", "policyDocument": {...}} - Policy caching: API GW caches the returned policy by token (TTL configurable). Reduces Lambda invocations per request.
# Lambda authorizer response structure { "principalId": "user-123", "policyDocument": { "Version": "2012-10-17", "Statement": [{ "Action": "execute-api:Invoke", "Effect": "Allow", "Resource": "arn:aws:execute-api:*:*:*" }] }, "context": { "userId": "user-123" } }
Pass context from authorizer to Lambda backend: Add context field in authorizer response. The backend Lambda receives it via event.requestContext.authorizer.userId. Avoids redundant token parsing in every backend function.
Common traps:
- "Lambda authorizer is called for every API request" — FALSE when caching is enabled. The policy is cached by token value for the configured TTL.
- "API Keys provide strong security authentication" — FALSE; API Keys are for usage tracking and throttling, not for authentication. Never use them as a security mechanism.
Implement Encryption by Using AWS Services
KMS, envelope encryption, S3 encryption options, in-transit TLS, ACM, and key rotation.
- Define encryption at rest and in transit; describe differences between client-side and server-side encryption
- Describe certificate management (AWS Private CA); use encryption keys to encrypt or decrypt data
- Generate certificates and SSH keys for development; use encryption across account boundaries
- Enable and disable key rotation
AWS KMS is the foundation of encryption on AWS. Every developer must understand the three key types, envelope encryption, and how to call KMS APIs from code.
| Key Type | Who Manages | Key Policy | Cost | Use Case |
|---|---|---|---|---|
| AWS Managed Key | AWS (auto-rotated every year) | Managed by AWS | Free | Default for S3, RDS, EBS (e.g., aws/s3) |
| Customer Managed Key (CMK) | You (optional auto-rotation) | You control it | $1/month + API calls | Custom encryption, cross-account, key policy control |
| Imported Key Material | You (bring your own key) | You control it | $1/month | Compliance: must own key material |
- KMS can only encrypt data up to 4 KB directly. For larger data, use envelope encryption:
- Step 1: Call
GenerateDataKey→ KMS returns a plaintext data key (DEK) AND an encrypted DEK - Step 2: Encrypt your data locally with the plaintext DEK (AES-256)
- Step 3: Store the encrypted DEK alongside the encrypted data. Discard the plaintext DEK.
- Decryption: Call
Decryptwith the encrypted DEK → get plaintext DEK → decrypt data locally
- Optional additional authenticated data (AAD) — key-value pairs sent with encrypt/decrypt calls
- If you encrypt with context
{"purpose":"payments"}, you must provide the same context to decrypt - Appears in CloudTrail logs — useful for auditing which application/service used a key
- Every KMS key has a key policy (resource-based policy). Unlike IAM, the key policy must explicitly allow the AWS account or it has no access — even root.
- Cross-account: Key policy must allow the external account → then IAM policy in that account must also allow
kms:Decrypt
| Automatic Rotation | Manual Rotation | |
|---|---|---|
| Frequency | Every 365 days (cannot customize) | Any schedule you choose |
| Old key material | Retained for decryption of old ciphertext | Must keep old key alias active |
| Application change? | No (same Key ID) | Yes (update alias to point to new key) |
| Available for | CMKs only (not imported key material) | All key types |
Envelope Encryption = DEK inside KMS envelope: Your data is encrypted with a local DEK. The DEK is "wrapped" (encrypted) by KMS. Only KMS can unwrap it. The data never goes to KMS.
"Encrypt a 500 MB file with KMS" → Use GenerateDataKey + encrypt locally (envelope encryption). "Compliance requires 90-day rotation" → Manual key rotation (automatic is fixed at 365 days). "Audit which app used KMS" → Add Encryption Context + query CloudTrail.
Common traps:
- "AWS Managed Keys can be disabled or deleted by the customer" — FALSE; you have no control over AWS Managed Keys. Use CMKs for control.
- "KMS automatic rotation changes the Key ID" — FALSE; rotation only changes the key material. The Key ID and alias stay the same — no application changes needed.
- "Imported key material supports automatic rotation" — FALSE; automatic rotation is only for CMKs with KMS-generated key material.
| Method | Key Management | Key Access | Use Case |
|---|---|---|---|
| SSE-S3 | S3 manages keys entirely | Transparent to user | Default encryption; no key management needed |
| SSE-KMS | AWS KMS CMK | You control key policy; CloudTrail logs access | Audit who accessed, cross-account encryption |
| SSE-C | Customer provides key in request header | AWS never stores the key | Compliance: must own key material; key not on AWS |
| Client-Side | Encrypted before sending to S3 | AWS never sees plaintext | Maximum control; AWS cannot access data even with account compromise |
- All AWS SDK and console calls use HTTPS (TLS) by default
- Enforce HTTPS on S3: Bucket policy with
Denywhenaws:SecureTransport = false - ACM (AWS Certificate Manager): Free public TLS certificates for ALBs, CloudFront, API GW. Auto-renews.
- CloudFront certificates must be in
us-east-1region — hard exam fact
| Server-Side (SSE) | Client-Side | |
|---|---|---|
| Who encrypts | AWS service (S3, RDS, etc.) | Your application code |
| AWS sees plaintext? | Briefly (SSE-S3, SSE-KMS) | Never |
| Complexity | Low — checkbox / config | Higher — you manage encryption library |
| Performance overhead | None to you | CPU on client |
"AWS must NEVER see the key" → SSE-C or Client-Side encryption. "Audit every access to S3 objects" → SSE-KMS (every decrypt call appears in CloudTrail). "Simplest encryption with no management" → SSE-S3 (enabled by default on all new buckets since Jan 2023).
Common traps:
- "SSE-KMS means AWS cannot read your data" — FALSE; with SSE-KMS, AWS KMS holds the key. With SSE-C or client-side encryption, AWS cannot read plaintext. Compliance may require one over the other.
- "ACM certificates can be used in any region for CloudFront" — FALSE; CloudFront requires ACM certificates in
us-east-1.
Manage Sensitive Data in Application Code
Secrets Manager, Parameter Store, environment variable encryption, data classification, and masking.
- Describe data classification (PII, PHI); encrypt environment variables containing sensitive data
- Use secret management services to secure sensitive data
- Sanitize sensitive data; implement application-level data masking and sanitization
- Implement data access patterns for multi-tenant applications
Choosing between Secrets Manager and Parameter Store is a common DVA-C02 scenario. Secrets Manager is purpose-built for secrets with rotation; Parameter Store is for configuration and simple secrets.
| Feature | Secrets Manager | SSM Parameter Store |
|---|---|---|
| Purpose | Secrets (DB passwords, API keys) | Config + secrets |
| Auto rotation | ✅ Native (uses Lambda) | ❌ No native rotation |
| RDS integration | ✅ Native DB credential rotation | ❌ Manual only |
| Encryption | Always KMS-encrypted | SecureString uses KMS; String is plaintext |
| Cost | $0.40/secret/month + $0.05/10K API calls | Free (Standard) / $0.05/10K (Advanced) |
| Max value size | 64 KB | 4 KB (Standard), 8 KB (Advanced) |
| Hierarchy / paths | Limited naming | ✅ Path-based: /app/prod/db-url |
| Best for | Database passwords, API keys needing rotation | App config, feature flags, non-rotating secrets |
- Rotation uses a Lambda function (AWS provides templates for RDS, Redshift, DocumentDB)
- Rotation schedule: interval in days or cron expression
- On rotation: new secret version is created → application fetches latest automatically using the
AWSCURRENTlabel - Versions:
AWSCURRENT(active),AWSPREVIOUS(just rotated),AWSPENDING(being created)
# Fetch secret once at Lambda cold start — cache for warm invocations import boto3, json client = boto3.client('secretsmanager') secret = json.loads(client.get_secret_value( SecretId='prod/myapp/db' )['SecretString']) DB_PASSWORD = secret['password'] # cached globally
"DB password needs auto-rotation every 30 days" → Secrets Manager. "App config values like feature flags" → Parameter Store (free). "Multiple apps share same config path" → Parameter Store with path hierarchy. "Secrets Manager or Env Vars?" → Always Secrets Manager for production secrets.
Common traps:
- "SSM Parameter Store SecureString auto-rotates" — FALSE; you must write your own rotation logic. Secrets Manager has this built-in.
- "Storing secrets in Lambda environment variables is secure" — PARTIALLY; env vars are KMS-encrypted at rest but are visible in the Lambda console and to any principal with
GetFunctionConfigurationaccess. Use Secrets Manager for sensitive production credentials.
| Type | Examples | Regulation |
|---|---|---|
| PII (Personally Identifiable Information) | Name, email, SSN, phone, IP address | GDPR, CCPA |
| PHI (Protected Health Information) | Medical records, diagnoses, insurance IDs | HIPAA |
| PCI-DSS | Credit card numbers, CVVs | PCI-DSS |
- Masking: Replace sensitive data with a representative value (e.g.,
4111-****-****-1234for credit card) - Tokenization: Replace sensitive data with a non-sensitive token; original data stored in a secure vault
- Sanitization: Remove or replace dangerous characters to prevent injection (SQL injection, XSS)
- Never log PII/PHI — use structured logging with field-level exclusion
| Pattern | Isolation Level | AWS Implementation |
|---|---|---|
| Silo (per-tenant DB) | High (physical isolation) | Separate RDS instances; separate DynamoDB tables |
| Pool (shared DB, tenant ID column) | Medium (logical isolation) | DynamoDB partition key = tenantId; RDS with row-level security |
| Bridge (hybrid) | Medium-High | Shared infrastructure, per-tenant encryption keys (KMS per tenant) |
"Scan S3 for PII/PHI" → Amazon Macie. "Multi-tenant app — isolate data per tenant in DynamoDB" → Partition key = tenantId. Use IAM condition dynamodb:LeadingKeys to enforce each user can only access their own partitions via Cognito Identity Pool role.
Domain 3 Overview
Package, deploy, and automate the release of AWS applications. Covers AWS SAM, CloudFormation, CDK, CI/CD pipeline services (CodeBuild, CodeDeploy, CodePipeline), deployment strategies, and testing approaches.
⚡ 24% of scored contentPrepare Application Artifacts to be Deployed to AWS
Lambda packaging, AWS SAM, CloudFormation, CDK, AppConfig, and IaC templates.
- Manage dependencies within the package (env vars, config files, container images)
- Organize files and directory structure for deployment
- Use code repositories in deployment environments
- Apply application requirements for resources (memory, cores)
- Prepare application configurations for specific environments (using AWS AppConfig)
AWS SAM is a shorthand extension of CloudFormation for serverless apps. The exam tests SAM template syntax, local testing, and the build/deploy workflow.
AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 # Declares this is a SAM template Globals: Function: Timeout: 30 Runtime: python3.12 Environment: Variables: TABLE_NAME: !Ref MyTable Resources: MyFunction: Type: AWS::Serverless::Function Properties: Handler: app.lambda_handler CodeUri: ./src/ Events: ApiEvent: Type: Api Properties: Path: /hello Method: get
| SAM Type | Expands To |
|---|---|
AWS::Serverless::Function | Lambda Function + IAM Role + optional Event Source Mapping |
AWS::Serverless::Api | API Gateway REST API + Stage + Deployment |
AWS::Serverless::HttpApi | API Gateway HTTP API |
AWS::Serverless::SimpleTable | DynamoDB table (single primary key) |
AWS::Serverless::StateMachine | Step Functions state machine |
sam init— scaffold a new SAM app from a templatesam build— package code and dependencies into.aws-sam/buildsam local invoke— run a Lambda function locally with a test event JSONsam local start-api— start a local API Gateway emulatorsam deploy --guided— deploy to AWS; creates/updates CloudFormation stacksam logs— tail Lambda logs from CloudWatch
SAM Transform: The line Transform: AWS::Serverless-2016-10-31 is what makes a template a SAM template. During deployment, CloudFormation calls the SAM transform to expand SAM-specific resource types into standard CloudFormation resources.
Common traps:
- "SAM is a completely separate service from CloudFormation" — FALSE; SAM is a superset/extension of CloudFormation.
sam deploycreates a CloudFormation stack. - "sam local invoke tests the same environment as AWS Lambda" — FALSE; local invocation runs in a Docker container that simulates Lambda but is not identical (no VPC, different IAM, different resource limits).
- Stack: A collection of AWS resources managed as a single unit. Create/update/delete together.
- Template: JSON or YAML file defining the desired state. Sections:
Parameters,Mappings,Conditions,Resources(required),Outputs - Change Set: Preview the impact of a template update before executing. Shows what will be Added/Modified/Removed.
- Drift Detection: Identify resources that have been manually modified outside CloudFormation
- Stack Policy: Protects stack resources from unintended updates during stack update operations
| Function | Purpose | Example |
|---|---|---|
!Ref | Reference parameter or resource | !Ref MyBucket |
!GetAtt | Get attribute of a resource | !GetAtt MyTable.Arn |
!Sub | String substitution | !Sub "arn:aws:s3:::${BucketName}/*" |
!ImportValue | Import an exported Output from another stack | Cross-stack references |
!If | Conditional resource creation | !If [IsProd, ProdConfig, DevConfig] |
- Nested Stacks: Reusable template components — a root stack references child stack templates via S3 URLs. Used to break up large templates.
- StackSets: Deploy a single template to multiple accounts and regions simultaneously — essential for multi-account governance
- Managed feature flags and configuration — separate configuration from code deployments
- Supports validation (JSON schema) before config goes live
- Supports gradual deployment of config changes (canary, linear, all-at-once)
- Applications fetch config at runtime without code redeployment
"Preview what changes before updating stack" → Change Set. "Deploy same template to 50 accounts" → StackSets. "Share VPC ID between stacks" → Outputs + !ImportValue. "Separate feature flags from Lambda code" → AppConfig.
Common traps:
- "CloudFormation can import existing resources into a stack" — TRUE via Resource Import, but only for supported resource types.
- "!ImportValue creates a hard dependency — if the exporting stack tries to delete the exported Output, it will fail until all importing stacks are deleted first."
Test Applications in Development Environments
API Gateway stages and stage variables, Lambda testing strategies, integration tests, mock APIs.
- Test deployed code using AWS services and tools
- Write integration tests and mock APIs for external dependencies
- Test applications by using development endpoints (configuring stages in Amazon API Gateway)
- Deploy application stack updates to existing environments (e.g., deploying SAM template to a different staging environment)
- Test event-driven applications
- Deploy same API to
dev,test, andprodstages — each stage is an independent deployment - Stage Variables: Use
${stageVariables.lambdaAlias}in Lambda integration ARN to route each stage to a different Lambda alias - Canary deployments on stages: Route X% of traffic to a new deployment version while keeping the rest on stable — built into API GW stage settings
| Test Type | Tool | What It Tests |
|---|---|---|
| Unit test | pytest, Jest, JUnit | Individual functions in isolation |
| Local integration | sam local invoke / start-api | Lambda handler with real event payloads |
| Integration test | Deploy to dev stage | Full stack — API GW → Lambda → DynamoDB |
| Mock API | API GW mock integration | Frontend dev without backend |
| Event test | Lambda console test, SAM CLI | Event-driven functions (S3, SQS, DDB Streams) |
- Use SAM CLI event templates:
sam local generate-event s3 put— generates sample S3 event JSON - For SQS:
sam local generate-event sqs receive-message - Test Lambda → SQS → Lambda pipelines locally with
sam local start-lambda+ real SQS queue in dev account
"Test new Lambda version without affecting prod traffic" → Deploy to Lambda alias pointing to new version. Use API GW stage variable to route dev stage to dev alias. "Frontend team needs API without backend" → API GW Mock Integration returns hardcoded response.
Automate Deployment Testing
Lambda versions and aliases, container image tags, Amplify branches, and IaC deployment automation.
- Create application test events (JSON payloads for Lambda, API GW, SAM)
- Deploy API resources to various environments
- Create application environments using approved versions (Lambda aliases, container image tags, Amplify branches)
- Implement and deploy IaC templates (SAM, CloudFormation)
- Manage environments in individual AWS services (dev/test/prod in API GW)
- Use Amazon Q Developer to generate automated tests
- Publishing a version creates an immutable snapshot of the function code + config
- Versions are numbered:
:1,:2,:3… $LATESTis the mutable, unpublished version — always points to the latest code- You cannot update a published version — it is read-only
- Named pointers to specific versions (e.g.,
live→ v5,beta→ v6) - Weighted routing: An alias can split traffic between two versions — e.g.,
live= 90% v5 + 10% v6. Used for blue/green and canary testing. - API Gateway stage variable or event source mapping references the alias — swap the version behind the alias without changing integrations
- Lambda supports container images up to 10 GB (vs 250 MB for ZIP)
- Images stored in Amazon ECR — tag with
:latest,:v1.2.0, or commit SHA for traceability - CI/CD pipeline: build → push to ECR with version tag → update Lambda to use new image URI → run tests → promote tag to
:stable
"Test 10% of prod traffic on new Lambda version" → Lambda alias with weighted routing (90/10). "Roll back to previous version instantly" → Update alias to point back to previous version number. "Prevent accidental deploy of untested code" → Require published version (not $LATEST) in prod alias.
Common traps:
- "Updating an alias to point to a new version causes downtime" — FALSE; alias update is near-instant and atomic.
- "$LATEST is a publishable version" — FALSE; $LATEST is always the unpublished current state. Publishing creates numbered versions from $LATEST.
- "Lambda aliases can point to another alias" — FALSE; aliases can only point to specific version numbers or $LATEST, not to other aliases.
Deploy Code by Using AWS CI/CD Services
CodeCommit, CodeBuild, CodeDeploy, CodePipeline, deployment strategies, rollback.
- Describe Lambda deployment packaging options and API Gateway stages with custom domains
- Update existing IaC templates; manage application environments using AWS services
- Deploy an application version using deployment strategies; commit code to invoke build/test/deploy actions
- Use orchestrated workflows to deploy code to different environments
- Perform application rollbacks; use labels and branches for version and release management
- Configure deployment strategies (blue/green, canary, rolling) for application releases
The four Code* services form the AWS-native CI/CD toolchain. Know each service's role and how they integrate in a pipeline.
| Service | Purpose | Key Concept |
|---|---|---|
| AWS CodeCommit | Git repository (source control) | Private Git repos; triggers CodePipeline on push |
| AWS CodeBuild | Fully managed build & test service | Runs buildspec.yml; Docker-based; pay-per-minute |
| AWS CodeDeploy | Automated deployment to compute targets | EC2, Lambda, ECS; uses appspec.yml |
| AWS CodePipeline | Orchestrates the full CI/CD workflow | Stages: Source → Build → Test → Deploy; integrates with GitHub/Bitbucket too |
| AWS CodeArtifact | Artifact repository (npm, Maven, PyPI) | Cache and share packages; pull-through cache from public repos |
version: 0.2 phases: install: commands: - pip install -r requirements.txt build: commands: - python -m pytest tests/ - sam build post_build: commands: - sam package --s3-bucket $ARTIFACT_BUCKET artifacts: files: - packaged-template.yaml
version: 0.0 Resources: - MyLambdaFunction: Type: AWS::Lambda::Function Properties: Name: MyFunction Alias: live CurrentVersion: !Ref Version1 TargetVersion: !Ref Version2 Hooks: - BeforeAllowTraffic: PreTrafficHookFunction - AfterAllowTraffic: PostTrafficHookFunction
CodeBuild vs. CodeDeploy: CodeBuild compiles, tests, packages. CodeDeploy deploys the artifact to a target. CodePipeline orchestrates both. Common exam trap: mixing up which service does what.
Common traps:
- "CodePipeline can only use CodeCommit as source" — FALSE; CodePipeline supports GitHub, GitHub Enterprise, GitLab, Bitbucket, S3, and ECR as sources.
- "CodeBuild replaces CodeDeploy" — FALSE; they serve different purposes. CodeBuild builds and tests; CodeDeploy deploys the built artifact.
| Strategy | How It Works | Downtime? | Rollback Speed | Cost |
|---|---|---|---|---|
| All-at-once | Deploy to all instances simultaneously | Yes (brief) | Redeploy (slow) | Cheapest |
| Rolling | Deploy to one batch at a time | Reduced capacity briefly | Redeploy | Low |
| Rolling with additional batch | Add new batch before removing old | No | Redeploy | Medium |
| Immutable | Launch new ASG with new version; swap on success | No | Terminate new instances (fast) | High (double capacity) |
| Blue/Green | Full duplicate environment; switch DNS/ALB | No | Instant (swap back) | Highest (2× environment) |
| Canary | X% traffic to new, rest to old | No | Shift traffic back | Medium |
- LambdaAllAtOnce: Instantly shifts all traffic to new version
- LambdaCanary10Percent5Minutes: 10% to new version → wait 5 min → 100% if hooks pass
- LambdaLinear10PercentEvery1Minute: Increase by 10% every minute until 100%
- Pre/Post traffic hooks: Lambda functions that run before/after traffic shift to validate the deployment (e.g., run smoke tests)
| Policy | Downtime | Best For |
|---|---|---|
| All at once | Yes | Dev/test rapid deployments |
| Rolling | No (reduced capacity) | Production with some risk tolerance |
| Rolling with additional batch | No (full capacity maintained) | Production zero-downtime |
| Immutable | No | Production — fastest rollback |
| Blue/Green (swap URLs) | No | Production — full environment isolation |
Speed of rollback (fastest to slowest): Blue/Green (instant DNS swap) → Canary (shift traffic back) → Immutable (terminate ASG) → Rolling (redeploy previous version).
"Zero downtime with fastest rollback for Lambda" → CodeDeploy LambdaCanary + pre/post traffic hooks. "Zero downtime for EC2 with no capacity impact" → Rolling with additional batch. "Instant full rollback for production" → Blue/Green. "Cheapest, accepts downtime" → All-at-once.
Common traps:
- "Rolling deployment maintains 100% capacity" — FALSE; standard rolling takes instances offline during update, reducing capacity. Use Rolling with additional batch to maintain full capacity.
- "Blue/Green deployment is always cheaper than rolling" — FALSE; Blue/Green requires double the infrastructure temporarily — it is the most expensive strategy.
Domain 4 Overview
Debug, monitor, and optimize AWS applications. Covers CloudWatch Logs and Insights, AWS X-Ray distributed tracing, custom metrics (including EMF), structured logging, and performance optimization with caching, concurrency tuning, and messaging efficiency.
⚡ 18% of scored contentAssist in a Root Cause Analysis
CloudWatch Logs, Logs Insights queries, X-Ray traces, custom metrics, dashboards, and deployment failure debugging.
- Debug code to identify defects; interpret application metrics, logs, and traces
- Query logs to find relevant data; implement custom metrics (CloudWatch EMF)
- Review application health using dashboards and insights
- Troubleshoot deployment failures using service output logs
- Debug service integration issues in applications
CloudWatch Logs is the primary log aggregation service on AWS. Knowing how to query, filter, and act on logs is a key DVA-C02 skill.
- Log Group: Container for related log streams (e.g., one per Lambda function or ECS service). Retention configured here (1 day – 10 years, or never expire).
- Log Stream: Sequence of log events from a single source (e.g., one Lambda execution environment, one EC2 instance).
- Log Event: A single timestamped record.
# Find Lambda errors in the last hour fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20 # Calculate p99 Lambda duration filter @type = "REPORT" | stats pct(@duration, 99) as p99Duration by bin(5m) # Count errors by error type filter @message like /Exception/ | parse @message "* Exception: *" as exType, exMsg | stats count(*) by exType
- Extract numeric values from log events and publish them as CloudWatch custom metrics
- Example: Count occurrences of
ERRORin logs → create metricErrorCount→ set alarm on it - Pattern:
{ $.level = "ERROR" }(JSON log format) or simple text patternERROR - One metric filter = one CloudWatch metric per log group
- Lambda automatically sends stdout/stderr to CloudWatch Logs (
/aws/lambda/functionName) - Lambda Logs format: START, END, REPORT lines contain requestId, duration, billed duration, memory used, init duration (cold start)
- CloudWatch Lambda Insights: Enhanced monitoring — CPU, memory, disk, network utilization via a Lambda layer
| Symptom | Likely Cause | Where to Look |
|---|---|---|
| Lambda 502 errors from API GW | Lambda function exception or timeout | Lambda log group for error detail |
| Lambda throttling (429) | Concurrency limit reached | CloudWatch metric: Throttles |
| Lambda timeout | Function exceeds timeout config | REPORT line: Duration > configured timeout |
| DynamoDB ProvisionedThroughputExceededException | Hot partition or insufficient capacity | CW metric: ConsumedWriteCapacityUnits, ThrottledRequests |
| API GW 5xx errors | Backend Lambda or integration error | API GW execution logs; enable in stage settings |
Enable API Gateway execution logging in the stage settings to see full request/response cycle. Without this, you only see access logs — insufficient for debugging integration errors. Set log level to INFO or ERROR.
Common traps:
- "CloudWatch Logs Insights queries run in real-time as logs arrive" — FALSE; Logs Insights queries historical log data. For real-time filtering, use CloudWatch Live Tail or Subscription Filters.
- "Lambda always writes logs to CloudWatch automatically" — TRUE only if the execution role has
logs:CreateLogGroup,logs:CreateLogStream, andlogs:PutLogEventspermissions.
X-Ray provides end-to-end distributed tracing across microservices. It is the go-to tool for identifying latency bottlenecks and errors across a request chain on the DVA-C02 exam.
| Concept | Description |
|---|---|
| Trace | Complete end-to-end request across all services — identified by unique Trace ID |
| Segment | Data from a single service (e.g., Lambda function, EC2 app) for one request |
| Subsegment | Granular unit within a segment (e.g., DynamoDB call, HTTP downstream call, custom code block) |
| Annotation | Indexed key-value pair — filterable in X-Ray console (e.g., userId, tenantId) |
| Metadata | Non-indexed key-value pair — visible in trace but not searchable |
| Service Map | Visual graph of all services in the request chain with latency and error rates |
- Lambda: Enable Active Tracing in function config. Lambda sends trace data automatically. Add X-Ray SDK for subsegments.
- API Gateway: Enable X-Ray Tracing on the stage settings. Adds trace header to requests.
- ECS/EC2: Run X-Ray Daemon as a sidecar or daemon process. SDK sends trace data to daemon → daemon batches and sends to X-Ray API.
- SDK instrumentation: Wrap AWS SDK calls and HTTP clients with X-Ray SDK to automatically create subsegments
# Python: add annotation (indexed — use for filtering) xray_recorder.put_annotation("userId", user_id) xray_recorder.put_annotation("plan", "premium") # Metadata (not indexed — for debugging context) xray_recorder.put_metadata("requestPayload", request_body) # Custom subsegment with xray_recorder.in_subsegment("database-query") as sub: result = table.query(...) sub.put_annotation("recordCount", len(result['Items']))
- X-Ray does not record every request — it samples to reduce cost and noise
- Default rule: Record first request per second + 5% of additional requests per service host
- Custom sampling rules: Target specific services, URLs, or HTTP methods with different rates
- Sampling rules are centrally configured — no code change needed to adjust sampling
Annotation vs. Metadata: Annotation = Actionable / searchable (like a database index). Metadata = More context (like a comment — descriptive but not queryable).
"Find all requests from tenantId=ABC that took >2s" → Add tenantId as X-Ray annotation → filter in X-Ray console. "Which downstream service is causing latency" → X-Ray Service Map — look for high response time subsegments. "Lambda function not sending traces" → Check Active Tracing is on AND execution role has xray:PutTraceSegments + xray:PutTelemetryRecords.
Common traps:
- "X-Ray traces every single request by default" — FALSE; X-Ray uses sampling. To trace every request, set a custom sampling rule with 100% fixed rate (not recommended at high volume).
- "X-Ray metadata is searchable in the console like annotations" — FALSE; only annotations are indexed and filterable. Metadata is visible within a specific trace detail but cannot be used to search/filter across traces.
- "X-Ray is enabled automatically on Lambda functions" — FALSE; you must explicitly enable Active Tracing in the Lambda function configuration.
Instrument Code for Observability
Logging vs. monitoring vs. observability, custom metrics, EMF, CloudWatch Alarms, X-Ray SDK, structured logging, health checks.
- Describe differences between logging, monitoring, and observability
- Implement an effective logging strategy to record application behavior and state
- Implement code that emits custom metrics; add annotations for tracing services
- Implement notification alerts for specific actions (quota limits, deployment completions)
- Implement tracing using AWS services and tools; implement structured logging
- Configure application health checks and readiness probes
Publishing custom metrics lets you monitor business-level KPIs alongside infrastructure metrics. EMF is the modern, cost-effective approach for Lambda functions.
| Pillar | What It Provides | AWS Service |
|---|---|---|
| Logging | Discrete event records — what happened | CloudWatch Logs, CloudTrail |
| Monitoring | Time-series metrics — how things are performing | CloudWatch Metrics & Alarms |
| Tracing | Request flow across services — why it happened | AWS X-Ray |
| Observability | Combine all three to understand system state | CloudWatch Application Signals, X-Ray, OTel |
# Emit custom metric: order processing time cw = boto3.client('cloudwatch') cw.put_metric_data( Namespace='MyApp/Orders', MetricData=[{ 'MetricName': 'ProcessingDuration', 'Value': processing_ms, 'Unit': 'Milliseconds', 'Dimensions': [{ 'Name': 'Environment', 'Value': 'prod' }] }] )
- Embed metric values in structured log JSON — CloudWatch automatically extracts them as metrics
- Zero additional API calls — metrics are extracted from your existing logs
- No extra cost for the PutMetricData API call; you only pay for the log ingestion
- Use the
aws-embedded-metricslibrary (Python/Node.js/Java)
# Python EMF — metrics extracted automatically from logs from aws_embedded_metrics import metric_scope @metric_scope def lambda_handler(event, context, metrics): metrics.set_namespace("MyApp") metrics.put_dimensions({"Service": "OrderProcessor"}) metrics.put_metric("OrderCount", 1, "Count") metrics.put_metric("ProcessingTime", elapsed_ms, "Milliseconds")
| Alarm Type | Triggers When | Use Case |
|---|---|---|
| Threshold Alarm | Metric crosses a static threshold | CPU > 80%, Error count > 5 |
| Anomaly Detection | Metric deviates from ML-predicted band | Traffic drops/spikes without fixed threshold |
| Composite Alarm | Combination of multiple alarms (AND/OR) | Alert only when both high CPU AND high errors |
| Math Expression Alarm | Metric math result crosses threshold | Error rate = errors / requests > 1% |
- OK: Metric is within the threshold
- ALARM: Metric has breached the threshold for the configured datapoints-to-alarm
- INSUFFICIENT_DATA: Not enough data points yet (common with new metrics or long evaluation periods)
"Notify on Lambda throttling" → CloudWatch Alarm on Throttles metric → SNS → email/PagerDuty. "Track business KPI in Lambda" → Use EMF (no extra API calls, extracted from logs automatically). "Alert only when multiple conditions are true" → Composite Alarm.
Common traps:
- "An alarm in INSUFFICIENT_DATA state means something is wrong" — FALSE; insufficient data is normal for new alarms or when a metric has gaps (e.g., Lambda not invoked recently).
- "CloudWatch custom metrics are retained indefinitely" — FALSE; CloudWatch retains data at different resolutions: 1-second for 3 hours, 1-minute for 15 days, 5-minute for 63 days, 1-hour for 15 months.
- Log as JSON instead of plain text — machine-parseable, queryable with Logs Insights
- Always include:
requestId,userId,timestamp,level,message,duration - Never include PII, credentials, or sensitive data in logs
- Use a correlation ID (requestId or traceId) to link logs across multiple services for the same request
# Structured log entry — queryable with Logs Insights { "level": "INFO", "requestId": "abc-123", "userId": "u-456", "action": "processOrder", "orderId": "ord-789", "durationMs": 45, "status": "success" }
| Type | What It Checks | Used By |
|---|---|---|
| ALB Target Health Check | HTTP endpoint returns 2xx | ALB removes unhealthy targets from rotation |
| ECS Container Health Check | Command/HTTP runs successfully inside container | ECS replaces unhealthy tasks |
| Route 53 Health Check | DNS-level endpoint availability | DNS failover routing policies |
| CloudWatch Synthetics Canary | Scripted browser/API test runs on a schedule | Proactive endpoint monitoring |
- Stream filtered log data from a CloudWatch log group to Lambda, Kinesis, or Firehose in real time
- Use case: Forward ERROR logs from all Lambda functions to a centralized Kinesis stream → Firehose → OpenSearch for analysis
- One subscription filter per log group (with Firehose); up to 2 with Lambda
"Real-time log analysis pipeline" → CloudWatch Logs Subscription Filter → Kinesis Data Firehose → S3 / OpenSearch. "Proactive endpoint test every 5 min" → CloudWatch Synthetics Canary. "Correlate logs across Lambda, API GW, DynamoDB for one request" → Use X-Ray trace ID as correlation ID in all log statements.
Optimize Applications by Using AWS Services and Features
Lambda concurrency, caching strategies (ElastiCache, DAX, CloudFront, API GW), messaging optimization, and application profiling.
- Define concurrency; profile application performance
- Determine minimum memory and compute power for an application
- Use subscription filter policies to optimize messaging
- Cache content based on request headers; implement application-level caching
- Optimize application resource usage; analyze application performance issues
- Use application logs to identify performance bottlenecks
Concurrency management is one of the most tested Lambda optimization topics. Understanding reserved vs. provisioned concurrency and throttling behavior is essential.
- Each in-flight request uses one concurrent execution
- Account default limit: 1,000 concurrent executions (soft limit — can be increased via Service Quotas)
- Burst limit: Initial burst capacity (500–3,000 depending on region); after burst, scales by 500 per minute
- When concurrency limit is hit → throttling →
TooManyRequestsException(HTTP 429)
| Type | Purpose | Cost when idle | Eliminates cold start? |
|---|---|---|---|
| Unreserved | Default — shared pool | No | No |
| Reserved | Guarantee max N; protect downstream | No (just a limit) | No |
| Provisioned | Pre-warm N environments | Yes (charged continuously) | Yes |
| Invocation Type | Throttle Behavior |
|---|---|
| Synchronous (API GW → Lambda) | Returns 429 immediately to caller; no retry |
| Asynchronous (S3, SNS, EventBridge) | Lambda retries for up to 6 hours; backs off automatically |
| Event Source Mapping (SQS, Kinesis) | Messages stay in queue/stream; Lambda retries when concurrency available |
- Open-source tool (AWS Lambda Power Tuning) runs your function at multiple memory sizes
- Finds the optimal memory setting for cost, speed, or both
- More memory ≠ always slower; often reducing duration pays for memory increase
- Rule: Cost = duration × memory. Higher memory may run faster enough to be cheaper overall.
Reserved Concurrency = Ceiling (caps usage). Provisioned Concurrency = Floor (pre-warms environments). Set both: Reserved prevents runaway scale; Provisioned ensures no cold starts up to that level.
"Prevent Lambda from overwhelming RDS" → Reserved Concurrency (cap Lambda). "Eliminate cold starts for latency-sensitive function" → Provisioned Concurrency. "Lambda throttled during traffic spike" → Request concurrency limit increase via Service Quotas. "Find cheapest memory setting" → Lambda Power Tuning tool.
Common traps:
- "Setting reserved concurrency to 0 disables the function" — TRUE — this is actually a useful technique to temporarily disable a function by throttling all executions.
- "Provisioned concurrency guarantees no throttling" — FALSE; Provisioned Concurrency pre-warms execution environments but the function can still be throttled if it exceeds the reserved or account concurrency limit.
- "Lambda always scales to handle any load instantly" — FALSE; there is a burst limit per region. Above the burst, Lambda scales by 500 concurrent executions per minute.
Caching at the right layer dramatically reduces latency and cost. The DVA-C02 exam tests which cache layer to choose for a given scenario.
| Cache Layer | Service | What It Caches | Best For |
|---|---|---|---|
| CDN / Edge | CloudFront | Static assets, API responses | Global users; reduce origin load; low latency worldwide |
| API Layer | API Gateway cache | API endpoint responses | Repeated identical API requests; reduce Lambda invocations |
| Application Layer | ElastiCache (Redis/Memcached) | DB query results, session data | RDS read offload; session store; complex data structures |
| Database Layer | DynamoDB DAX | DynamoDB GetItem/Query results | DynamoDB hot reads; microsecond latency |
- Cache based on request headers: Add headers to the Cache Key to vary cached responses (e.g.,
Accept-Languagefor localized content). More headers = fewer cache hits. - TTL: Set via
Cache-Control: max-age=Nresponse header from origin, or override in CloudFront policy (min/max/default TTL) - Cache invalidation:
CreateInvalidationAPI or console — invalidate specific paths. Cost: first 1,000 paths/month free, then $0.005/path. - Origin Shield: Additional caching layer between CloudFront edge and origin — reduces origin load further
- Cache enabled per stage; capacity from 0.5 GB to 237 GB
- TTL: 0–3600 seconds (default 300s)
- Cache invalidation: client sends
Cache-Control: max-age=0header (if you grant them permission) - Cache hit metric:
CacheHitCount/CacheMissCountin CloudWatch
- Filter policies on SNS subscriptions prevent unwanted messages from being delivered to a subscriber
- Filtering happens at SNS — subscribers only receive messages matching their policy
- Reduces downstream processing costs (fewer Lambda invocations, fewer SQS messages)
- Filter on message attributes:
{"eventType": ["ORDER_PLACED", "ORDER_SHIPPED"]}
"Reduce Lambda invocations for repeated API calls" → API GW cache. "Reduce RDS load from read-heavy app" → ElastiCache lazy loading. "Serve static content globally with low latency" → CloudFront. "Only process ORDER events from SNS topic" → SNS subscription filter policy on your SQS subscriber.
Common traps:
- "CloudFront caches POST requests" — FALSE; CloudFront only caches GET and HEAD requests by default. POST, PUT, DELETE pass through to origin (not cached).
- "Adding more headers to the CloudFront cache key improves cache hit rate" — FALSE; more headers in the cache key means more unique cache entries → lower cache hit rate. Use only headers that genuinely vary the response.
- "SNS filter policies are applied at the publisher" — FALSE; filter policies are configured on the subscription (subscriber side). The publisher sends the message to all subscribers; SNS filters before delivering.
| Technique | How | Benefit |
|---|---|---|
| Long Polling | Set WaitTimeSeconds = 1–20 | Reduces empty API calls; reduces cost |
| Batch Processing | Receive up to 10 messages per call (MaxNumberOfMessages) | Fewer API calls; higher throughput |
| Message Batching (send) | Use SendMessageBatch (up to 10 messages) | 10× throughput per API call |
| Visibility Timeout Tuning | Set to > max processing time | Prevent duplicate processing |
| Dead-Letter Queue | Set maxReceiveCount | Isolate poison-pill messages; prevent infinite retry loop |
- CloudWatch Metrics: Identify CPU/memory saturation, DB connection counts, queue depth trends over time
- X-Ray Traces: Find which subsegment (DB call, downstream HTTP, internal code) contributes most to latency
- CloudWatch Logs Insights: Query p99 latency:
stats pct(@duration, 99) by bin(5m) - Lambda REPORT logs:
Init Durationreveals cold start cost;Max Memory Usedreveals memory waste
| Symptom | Root Cause | Fix |
|---|---|---|
| High Lambda duration | Slow downstream DB/API call | Add DAX/ElastiCache; use async pattern |
| Lambda cold starts | No pre-warmed environment | Provisioned Concurrency; reduce package size |
| DynamoDB throttling | Hot partition or under-provisioned | Better partition key; On-Demand mode; DAX |
| RDS connection errors under load | Connection pool exhausted | RDS Proxy; reduce Lambda reserved concurrency |
| High API GW latency | Repeated identical requests hitting Lambda | Enable API GW caching |
| SQS high message age | Consumers too slow | Increase Lambda concurrency; increase batch size |
Full observability stack for Lambda: X-Ray (tracing) + CloudWatch Logs (structured JSON) + EMF (custom metrics) + CloudWatch Alarms (notification) + Lambda Insights layer (system metrics). This combination gives you complete visibility without external tools.
Common traps:
- "Long polling always returns messages faster" — FALSE; long polling waits up to
WaitTimeSecondsfor a message. If no message arrives, it returns empty after the wait. It is more efficient, not necessarily faster. - "Increasing Lambda memory always reduces cost" — FALSE; if the function is not compute-bound, increasing memory just costs more with no duration benefit. Use Power Tuning to find the optimum.