AWS Interview — Step Functions and Serverless Orchestration

Design an order fulfillment Step Functions workflow: (1) Validate payment, (2) Reserve inventory, (3) Prepare shipment, (4) Notify customer. If any step fails, compensate: release reserved inventory, refund payment. Requirements: retry on transient failures, timeout if any step takes >5 min. Draw this out and explain retry + compensation logic.

Step Functions order fulfillment with compensation (saga pattern): (1) Workflow definition (ASL - Amazon States Language): ```json { "Comment": "Order fulfillment with compensation", "StartAt": "ValidatePayment", "States": { "ValidatePayment": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Parameters": { "FunctionName": "validate-payment", "Payload.$": "$" }, "Next": "ReserveInventory", "Retry": [{ "ErrorEquals": ["States.TaskFailed"], "IntervalSeconds": 2, "MaxAttempts": 3, "BackoffRate": 2 }], "Catch": [{ "ErrorEquals": ["States.ALL"], "Next": "PaymentFailed", "ResultPath": "$.error" }], "TimeoutSeconds": 300 }, "ReserveInventory": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "reserve-inventory", "Payload.$": "$" }, "Next": "PrepareShipment", "Retry": [{ "ErrorEquals": ["OutOfStock"], "IntervalSeconds": 3, "MaxAttempts": 2, "BackoffRate": 1 }], "Catch": [{ "ErrorEquals": ["OutOfStock"], "Next": "RefundAndFail" }], "TimeoutSeconds": 300 }, "PrepareShipment": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "prepare-shipment", "Payload.$": "$" }, "Next": "NotifyCustomer", "TimeoutSeconds": 300 }, "NotifyCustomer": { "Type": "Task", "Resource": "arn:aws:sns:invoke", "Parameters": { "TopicArn": "arn:aws:sns:...", "Message.$": "$.order" }, "Next": "Success" }, "Success": { "Type": "Succeed" }, "PaymentFailed": { "Type": "Pass", "Result": "Payment validation failed", "Next": "OrderFailed" }, "RefundAndFail": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "refund-payment", "Payload.$": "$" }, "Next": "ReleaseInventory" }, "ReleaseInventory": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "release-inventory", "Payload.$": "$" }, "Next": "OrderFailed" }, "OrderFailed": { "Type": "Fail", "Error": "OrderProcessingFailed", "Cause": "See execution history" } } } ``` (2) Key patterns: (a) Retry: exponential backoff. First failure waits 2 sec, second 4 sec, third 8 sec. Handles transient failures (Lambda cold start, DynamoDB throttle). (b) Timeout: each task max 300 sec (5 min). If task takes >5 min, auto-fail and trigger compensation. (c) Catch: if error occurs, next state is compensation (RefundAndFail). (3) Compensation logic: (a) On ReserveInventory failure (OutOfStock): skip PrepareShipment, jump to RefundAndFail. (b) RefundAndFail: call refund Lambda (returns money to customer). (c) ReleaseInventory: call release Lambda (put inventory back in stock). (d) OrderFailed: mark order as failed, notify customer (no money taken, no inventory consumed). (4) Idempotency safeguard: (a) Each Lambda (validate, reserve, refund, release) must be idempotent. Calling twice = same result as once. (b) Payment Lambda: check if order_id already charged. If yes, return success (don't charge again). (c) Inventory Lambda: check if order_id already reserved. If yes, return existing reservation. (5) Execution example: (a) Happy path: validate → reserve → prepare → notify → success (4-5 min). (b) Sad path (out of stock): validate → reserve [FAIL] → refund → release → fail (1-2 min). (6) Monitoring: (a) CloudWatch metric: Step Functions execution success rate. Goal: >99% (retries handle transient failures). (b) Track compensation invocations: how many refunds/releases per hour? Alert if >5% of orders fail at reserve step (indicates inventory issue). (c) Execution history: Step Functions Console shows every step, timing, errors. Invaluable for debugging.

Follow-up: Payment is charged but reserve-inventory fails (out of stock). Refund is called but fails (payment gateway timeout). Order is stuck in limbo (money taken, no inventory reserved). How do you handle refund failure in the saga?

Your Step Functions workflow processes 1M orders/day. Each order takes 5 minutes (4 sequential Lambda calls). Currently, workflows are triggered by SQS. But if Lambda cold-starts, total latency balloons to 10 minutes. You're considering: (1) keep SQS + add provisioned concurrency, (2) switch to EventBridge + parallel Lambda, (3) use Step Functions Map state for batch processing. Which is best?

Optimize Step Functions throughput for 1M orders/day: (1) Current state: (a) SQS → Lambda (trigger) → Step Functions. (b) Step Functions orchestrates 4 sequential Lambdas (5 min total). (c) Problem: 1M orders/day = 11 orders/sec. If cold start = 5 sec each, total latency = 10 sec + 5 min logic = 5:10 min. (d) At high volume, cold starts are rare (Lambda instances stay warm). But at low volume (night hours), cold starts hurt. (2) Option 1 (provisioned concurrency): (a) Configure Lambda with 50 provisioned concurrency (always keep 50 warm instances). Cost: 50 × $0.015/hour = $36/month. (b) Eliminates cold start penalty. Latency: 5 min + 1 sec execution = 5:01 min. (c) Pros: simple, minimal code change. Cons: fixed cost regardless of traffic. (3) Option 2 (EventBridge + parallel Lambda): (a) EventBridge routes events to multiple Lambda functions in parallel (map transform). (b) Instead of sequential: Lambda1 → Lambda2 → Lambda3 → Lambda4. Do: Lambda1 || Lambda2 || Lambda3 (parallel where possible). (c) Problem: order processing is sequential (payment → reserve → ship → notify). Can't truly parallelize without dependency changes. (d) ROI: low for this use case. (4) Option 3 (Step Functions Map state): (a) Batch processing: receive 100 orders in array, process via Step Functions Map state. (b) Map state executes same state machine for each order in parallel (configurable concurrency). (c) Example: ```json { "Type": "Map", "ItemsPath": "$.orders", "MaxConcurrency": 10, "Iterator": { "StartAt": "ProcessOrder", "States": { "ProcessOrder": {...} } }, "End": true } ``` (d) Pros: process 100 orders with 10 parallel workers = 50x throughput. Cons: requires batch ingestion (might add latency if waiting for batch to fill). (5) Recommendation for 1M orders/day: (a) Short-term: provisioned concurrency ($36/month). Eliminates cold start. Latency = 5:01 min. (b) Long-term: redesign for parallelizable steps (payment + reservation in parallel if possible). (c) Even better: measure bottleneck. If 5 min is split equally (1:15 min per step), all 4 are equally slow. Optimize each with Lambda optimization (dependency imports, connection pooling). Could reduce 5 min to 2 min without architectural change. (d) If order volume forecast increases to 10M/day, use Map state with batching + queuing. (e) Cost-benefit: provisioned concurrency ($36/month) saves 5 min latency per order × 1M/month = 83K person-min/month saved. At $0.50/min, value = $41K/month. Breakeven at 1 min latency improvement. (6) Decision: provisioned concurrency for now (quick win). Re-evaluate after profiling Lambda execution times.

Follow-up: You enabled provisioned concurrency but bill jumped to $4K/month (unexpected). You're paying for 50 concurrent instances even when traffic is 0.5 orders/sec (1 instance needed). How do you optimize provisioned concurrency for variable load?

Your Step Functions workflow has a known issue: if a task returns within 299 seconds but network timeout occurs at exactly 300 seconds, the function might retry or continue incorrectly. You want to implement a robust timeout strategy for production. What's your approach?

Robust timeout strategy for Step Functions + Lambda: (1) Problem: timeout at task boundary (300 sec hard limit) is risky. Task might succeed but timeout fires anyway. (2) Solution: implement timeouts at multiple layers: (a) Lambda level (inner timeout): each Lambda has function timeout (e.g., 60 sec). (b) Step Functions task timeout (middle): 120 sec (2x Lambda timeout for safety). (c) Step Functions workflow timeout (outer): 3600 sec (1 hour max). (3) Implementation: ```json { "Tasks": { "PaymentTask": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "process-payment", "Payload.$": "$" }, "TimeoutSeconds": 120, "HeartbeatSeconds": 30, "Retry": [{ "ErrorEquals": ["States.TaskFailed"], "MaxAttempts": 2, "BackoffRate": 2 }], "Catch": [{ "ErrorEquals": ["States.TaskFailed"], "Next": "TaskFailed", "ResultPath": "$.timeoutError" }] } } } ``` (4) Heartbeat pattern (key for long-running tasks): (a) Heartbeat = Lambda sends progress signal to Step Functions every 30 sec. (b) If Lambda crashes before sending heartbeat, Step Functions detects timeout in 30 sec (not waiting full 120 sec). (c) Reduces detection latency significantly. (d) Lambda code: ```python import boto3 import time sfn = boto3.client('stepfunctions') def process_payment_with_heartbeat(event, context): task_token = event['taskToken'] try: for i in range(10): # 10-step process time.sleep(5) # do work sfn.send_task_heartbeat(taskToken=task_token) # signal progress every 5 sec return {"status": "success"} except Exception as e: sfn.send_task_failure(taskToken=task_token, error=str(type(e).__name__), cause=str(e)) ``` (5) Cascading timeouts (safety net): (a) Lambda timeout: 60 sec. Step Functions sees Lambda doesn't return in 120 sec → retries. (b) Retry logic: exponential backoff. First retry after 2 sec, second after 4 sec. After 2 retries, fail. (c) Total time budget: 60 + 2 + 60 + 4 + 60 = 186 sec (well under 300 sec Step Functions limit). (d) This prevents runaway tasks from blocking entire workflow. (6) Idempotency (critical for retries): (a) If Lambda is retried, it might be invoked twice. First succeeds, payment charged. Second retried, attempts to charge again. (b) Implement idempotency key: every Lambda stores request_id in DynamoDB. Before processing, check if request_id already processed. (c) If yes, return cached result (don't re-charge, don't re-process). (d) TTL on DynamoDB entry: 24 hours (retain idempotency for 1 day). (7) Monitoring: (a) CloudWatch metric: Lambda duration (99th percentile). If consistently approaching timeout, increase Lambda timeout. (b) Track retry rate: if >5% of executions retry, investigate Lambda performance (too slow) or external service (throttling). (c) Timeout alerts: alert if any Step Functions task hits timeout (might indicate infrastructure issue). (8) Cost-benefit: heartbeat adds ~5% overhead (extra API calls to send heartbeat). But reduces incident response from 5 min (noticing task hung) to 30 sec (heartbeat detected missing). ROI: clear for production systems.

Follow-up: Heartbeat is working, but Lambda is crashing after sending heartbeat (out of memory). Step Functions sees "heartbeat received, still alive" but Lambda is actually dead. Task times out after 120 sec. How do you detect Lambda crashes mid-heartbeat?

Your Step Functions workflow processes customer orders. Some orders are high-priority (VIP customers, >$10K). You want: high-priority orders complete in <2 min, standard orders in <5 min. Currently, all orders go through same workflow (no priority). How do you implement priority routing in Step Functions?

Priority-based routing in Step Functions: (1) Architecture: (a) Input enrichment: SQS message includes customer_tier (VIP or standard) + order_amount. (b) Step Functions Choice state: if customer_tier == VIP, take fast-path. Else, take standard-path. (c) Fast-path: skip optional steps (detailed fraud checks, shipment optimization). Focus on speed. (d) Standard-path: full process (fraud checks, optimization). (2) Workflow definition: ```json { "StartAt": "CheckPriority", "States": { "CheckPriority": { "Type": "Choice", "Choices": [ { "Variable": "$.customer_tier", "StringEquals": "VIP", "Next": "VIPPath" }, { "Variable": "$.order_amount", "NumericGreaterThan": 10000, "Next": "VIPPath" } ], "Default": "StandardPath" }, "VIPPath": { "Type": "Parallel", "Branches": [ { "StartAt": "FastValidatePayment", "States": { "FastValidatePayment": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "validate-payment-fast" }, "TimeoutSeconds": 20, "End": true } } }, { "StartAt": "FastReserveInventory", "States": { "FastReserveInventory": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "reserve-inventory-fast" }, "TimeoutSeconds": 20, "End": true } } } ], "Next": "ShipmentVIP", "TimeoutSeconds": 30 }, "StandardPath": { "Type": "Parallel", "Branches": [ { "StartAt": "FullValidatePayment", "States": { "FullValidatePayment": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "validate-payment-full", "Payload": { "fraud_check_level": "strict" } }, "TimeoutSeconds": 60, "End": true } } }, { "StartAt": "FraudAnalysis", "States": { "FraudAnalysis": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "fraud-detection" }, "TimeoutSeconds": 45, "End": true } } } ], "Next": "ShipmentStandard", "TimeoutSeconds": 120 }, "ShipmentVIP": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "expedite-shipment" }, "TimeoutSeconds": 30, "Next": "NotifyCustomer" }, "ShipmentStandard": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "standard-shipment" }, "TimeoutSeconds": 60, "Next": "NotifyCustomer" }, "NotifyCustomer": { "Type": "Task", "Resource": "arn:aws:sns:invoke", "Next": "Success" }, "Success": { "Type": "Succeed" } } } ``` (3) Key techniques: (a) Choice state: conditional routing based on customer_tier or order_amount. (b) Parallel branches: VIP path runs payment + inventory in parallel (10 + 20 sec = 20 sec total, vs sequential 30 sec). (c) Timeouts: VIP tasks have shorter timeouts (20-30 sec). Standard tasks 45-60 sec. Fails fast if service degraded. (d) Fast functions: fast-payment validates PCI only (10 sec). Full-payment also does velocity checks (60 sec). (4) Operational optimization: (a) VIP Lambdas provisioned concurrency: keep 10 warm. Cost: $4.50/month. (b) Standard Lambdas: on-demand (cold starts OK for 5-min SLA). (c) SLA monitoring: CloudWatch metric for P99 latency by tier. VIP: alert if >2 min. Standard: alert if >5 min. (5) Cost-benefit: (a) VIP customers pay premium (50% higher margin). Worth spending $5/month on provisioned concurrency for faster service. (b) Standard customers: save money by not provisioning. (c) Revenue impact: if VIP customers churn due to slow orders, cost >> $5/month provisioning. (6) Monitoring dashboard: (a) Step Functions execution duration by customer_tier. (b) Parallel branch timing (to identify which path is slower). (c) Error rates by path (if VIP path has 0.1% errors, adjust timeouts/retry). (7) Testing: load test both paths separately. VIP path should complete in 1-2 min (room for network/Lambda jitter). Standard path in 3-5 min.

Follow-up: VIP path is meeting 2-min SLA, but expedite-shipment Lambda is slow (takes 45 sec instead of expected 15 sec). No errors, just slow. How do you debug Lambda performance without touching production?

Your Step Functions workflow has a Map state processing 1M items (customer records, batch job). The Map iterates through each record, calls a Lambda to enrich it, stores result in S3. But Map state has a concurrency limit (~1000). After 1000 concurrent items, remaining queue waits. Total job takes 8 hours. How do you optimize throughput?

Optimize Step Functions Map state for batch processing (1M items): (1) Bottleneck analysis: (a) Step Functions Map state: max 1000 concurrent executions (hard AWS limit). (b) Processing speed: if each Lambda takes 5 sec, throughput = 1000 concurrent ÷ 5 sec per task = 200 tasks/sec = 720K tasks/hour. (c) For 1M items at 200 tasks/sec = 5,000 sec = 1.4 hours theoretical. But in practice, hitting cold starts, network jitter, etc. Actual ~8 hours. (d) Root cause: sequential batching. Process first 1000, wait for all to complete, process next 1000. (2) Solution: increase concurrency. (a) Step Functions Map state has `MaxConcurrency` parameter. Raise from default 1000 to 2000+ (soft limit, request increase from AWS). (b) Example: ```json { "Type": "Map", "ItemsPath": "$.items", "MaxConcurrency": 2000, "Iterator": { "StartAt": "EnrichRecord", "States": { "EnrichRecord": { "Type": "Task", "Resource": "arn:aws:lambda:invoke", "Parameters": { "FunctionName": "enrich-customer" }, "TimeoutSeconds": 30, "End": true } } } } ``` (c) Increase to 2000: throughput doubles. 8 hours → 4 hours. (3) Alternative: parallel Step Functions (if increased concurrency rejected): (a) Instead of 1 Map state with 1M items, spawn 10 Step Functions executions, each processing 100K items. (b) Use Step Functions Parallel state to kick off 10 sub-workflows in parallel. (c) Each processes 100K items sequentially within its own Map state. (d) Total time: time to process 100K (1M ÷ 10) = 50 min per executor × 1 (parallel) = 50 min total. 16x faster than 8 hours. (4) Even better: distributed batch processing via SQS + Lambda pool: (a) Put 1M items in SQS (FIFO for ordering, if needed, or standard for speed). (b) Deploy Lambda fleet (100 concurrent Lambdas, provisioned). Each Lambda: pull from SQS, enrich record, push to S3, repeat. (c) Throughput: 100 Lambdas × 1 item/5 sec = 20 items/sec = 72K items/hour. 1M items = 14 hours (slower). BUT parallelizable across accounts/regions. (d) Advantage: Step Functions has upper execution limit (10K running simultaneously). SQS can queue unlimited items. (5) Recommended approach: (a) Request AWS to increase Step Functions Map concurrency to 2000+. (b) Use Parallel state to fan out multiple Step Functions (10x) for additional parallelism. (c) Combined: 2000 concurrency × 10 parallel executions = 20K theoretical parallelism. Process 1M items in 50 sec (vs 8 hours). (d) Cost: Step Functions $0.000025 per state transition × 2000 × 10 × (number of states) = ~$50 for entire 1M-item batch. Cheap. (6) Monitoring: (a) Step Functions execution duration. (b) Lambda concurrency utilization (CloudWatch metric). (c) S3 write throughput (should be high, not throttled). (7) Timeline: implement Parallel state + Map increase = 1 week dev + testing. ROI: 8 hours reduced to 1 hour = 7 hour speedup per run. If batch runs daily, saves 5+ hours/day = $20K/month productivity gain (rough estimate).

Follow-up: You increased concurrency to 2000 and parallelized. Job now runs in 50 min. But S3 write throughput is throttled (1K requests/sec cap). S3 writes are blocking, total time balloons to 3 hours. How do you optimize S3 writes?

Your Step Functions workflow manages complex order state transitions: pending → payment-processing → inventory-reserved → shipped → delivered. If an order gets stuck in payment-processing (Lambda crashes, doesn't move to inventory-reserved), it's orphaned. You want: auto-recovery for stuck orders. How do you detect and recover?

Detect and auto-recover stuck Step Functions executions: (1) Problem: Lambda crashes mid-execution. Order state = "payment-processing" but execution never reaches next state. No automatic recovery. (2) Detection mechanism: (a) CloudWatch rule (EventBridge): detect Step Functions execution failures. (b) Trigger Lambda: query DynamoDB for all orders in "payment-processing" state. Check Step Functions execution history: is execution running? (c) If execution not running + order stuck >1 hour → likely orphaned. (3) Implementation: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "states:ListExecutions", "states:DescribeExecution", "states:StartExecution" ], "Resource": "arn:aws:states:...:stateMachine:order-fulfillment" }, { "Effect": "Allow", "Action": [ "dynamodb:Query" ], "Resource": "arn:aws:dynamodb:...:table/orders" } ] } ``` Lambda function: ```python def detect_stuck_orders(event, context): dynamodb = boto3.client('dynamodb') sfn = boto3.client('stepfunctions') # Find orders stuck in payment-processing > 1 hour ago stuck_time = int((time.time() - 3600) * 1000) # 1 hour ago response = dynamodb.query( TableName='orders', IndexName='status-timestamp', KeyConditionExpression='#status = :status AND #ts < :ts', ExpressionAttributeNames={'#status': 'status', '#ts': 'timestamp'}, ExpressionAttributeValues={':status': 'payment-processing', ':ts': stuck_time} ) stuck_orders = response['Items'] for order in stuck_orders: execution_arn = order['execution_arn']['S'] exec_info = sfn.describe_execution(executionArn=execution_arn) # If execution status = FAILED or ABORTED, auto-recover if exec_info['status'] in ['FAILED', 'ABORTED']: # Retry: start new execution sfn.start_execution( stateMachineArn=order['state_machine_arn']['S'], name=f"retry-{order['order_id']['S']}-{int(time.time())}", input=json.dumps(order) ) dynamodb.update_item( TableName='orders', Key={'order_id': order['order_id']}, UpdateExpression='SET #status = :status, #attempt = #attempt + 1', ExpressionAttributeNames={'#status': 'status', '#attempt': 'retry_attempt'}, ExpressionAttributeValues={':status': 'retrying'} ) return {"recovered": len(stuck_orders)} ``` (4) Execution flow: (a) EventBridge rule: trigger every 5 minutes. (b) Lambda runs query: find stuck orders. (c) For each stuck order: check execution status. If FAILED, start new execution. (d) Update order status to "retrying". (e) Notify ops team: "5 stuck orders recovered". (5) Retry strategy: (a) Max retries: 3. After 3rd retry, escalate to manual review (create Jira ticket). (b) Backoff: 1st retry after 1 hour, 2nd after 2 hours, 3rd after 4 hours. (c) Track retry_attempt in order DynamoDB table. If retry_attempt >= 3, stop auto-recovery. (6) Monitoring: (a) CloudWatch metric: stuck orders per hour. Alert if >5 (indicates systemic issue, not transient). (b) Retry success rate: % of stuck orders that successfully complete after retry. Goal: >90%. (c) Manual review queue: orders requiring escalation. Alert ops team. (7) Cost: (a) EventBridge rule: free (up to 40 rules free tier). (b) Lambda: 5-min interval × 60 × 24 = 288 invocations/day × 0.2 sec execution = $0.50/month. (c) Total: negligible. (8) Alternative: use Step Functions Execution History API + CloudTrail to audit stuck orders (more detailed, but complex).

Follow-up: Auto-recovery worked, stuck order retried and succeeded. But customer was charged twice (original + retry). Idempotency key wasn't preserved across retry. How do you ensure idempotent retries?