Your AWS bill jumped from $15K/month to $21K (40% spike). Finance wants answers in 24 hours. You log into Cost Explorer but see only high-level views (EC2, RDS, etc.). Walk me through your investigation step-by-step: what data do you pull, which services do you suspect, and how do you isolate the culprit?
Production investigation flow: (1) Cost Explorer → Last 90 days, grouped by "Service", sorted descending. Identify the top 3 services by delta. If it's EC2 (typical), drill deeper. (2) EC2 Console → Instances tab, sort by LaunchTime, filter last 30 days. Count new instances. If 5 new p3.2xlarge instances appeared (common mistake), that's ~$9K/month delta alone. Check if they're dev/test instances left running. (3) CloudTrail → Filter "RunInstances" events in last 30 days. Identify IAM user/role that launched expensive instances. Contact them immediately. (4) Cost Allocation Tags: if your org tags by environment/project, Cost Explorer → filter by tag. If tag coverage <60%, that's the real problem (blind spot). (5) AWS Trusted Advisor → look for "Idle RDS instances", "unattached EBS volumes", "underutilized EC2". Typical quick wins: $500-2K/month by deleting orphaned resources. (6) CloudWatch → check if a Lambda/ECS scaling event triggered. If Lambda invocations jumped 10x, that's spike source. (7) S3 Cost: if it spiked, check S3 metrics for data transferred (egress to internet is $0.09/GB; egress to CloudFront is $0.085/GB; cross-region is expensive). Run: `aws s3api list-buckets --query 'Buckets[*].Name' | xargs -I {} aws s3api list-objects-v2 --bucket {} --query 'Contents[*].Size' | awk '{sum+=$1} END {print sum/1024/1024/1024}'` to estimate sizes. (8) Reserve analysis: if you're on-demand only, Reserved Instances would cut 40-60% cost. Document findings in spreadsheet with service, delta, cause, recommended action.
Follow-up: You found 8 p3.2xlarge instances (dev/test, launched by intern). But removing them only explains $5K of the $6K delta. Where's the other $1K?
Your company runs 100 microservices on ECS/Fargate. Compute costs are $80K/month (steady state). Your cloud architect says: "Buy 3-year Reserved Instances for baseline, Savings Plans for variable load." Your team is skeptical: "What if we need to scale down or migrate?" How do you make the reservation decision with 70% confidence in 3-year stability?
Decision framework: (1) Baseline analysis: CloudWatch metrics last 12 months. Calculate P10 (low percentile) of CPU/memory usage across all services. This is your "always-on" baseline = 40-50% of current capacity. Baseline = $32-40K/month. (2) Buy Reserved for baseline: 3-year RI on 40-50% of current capacity at 70% discount = saves $11-14K/month on baseline. Cost: $200-250K upfront for 3 years. (3) Savings Plans (1-year): for variable 20% = $8-10K/month savings, flexible if workload changes. Cost: $95K upfront. (4) Combined: $200K upfront + $95K upfront = $295K, but saves $25-28K/month. Breakeven: 11-12 months. ROI at year 3: saves $850K-$1M vs on-demand. (5) Risk mitigation: (a) If company might pivot infrastructure (e.g., migrate to hybrid cloud in 18 months), buy 1-year RIs instead of 3-year. Slightly worse discount (50% vs 70%) but flexibility. (b) If company plans hiring surge (2x headcount), buy flexible instance types (e.g., m5.large through m6i.large) with RI blending. (c) If product is B2B with customer-specific peaks, use Compute Savings Plans (covers any instance type) instead of instance-type-specific RIs. (6) Recommendation: 70% confidence baseline = split: 50% of baseline as 3-year RI (safest), 30% as 1-year RI (flexibility), 20% on-demand (variable spikes). Achieves 60-65% aggregate discount instead of optimal 70%, but hedges risk.
Follow-up: You committed to 3-year RIs but 6 months later, a major customer churns and load drops 30%. You're now paying for 30% unused capacity. How do you recover?
Your company's AWS bill is $200K/month. Finance says: "Save 15% ($30K/month) within 6 months or we move to Azure." You audit and find: (1) 12 RDS databases (multi-AZ), of which only 2 are production; (2) 50 unattached EBS volumes accumulating monthly; (3) 200 Lambda functions, 150 never invoked in 30 days; (4) 8TB of old S3 logs at $0.023/GB (not in Glacier). Prioritize savings opportunities. Which do you tackle first?
Prioritization by ROI and implementation risk: (1) Multi-AZ RDS downgrade: 10 non-production databases in multi-AZ = ~$8-12K/month. Convert to single-AZ for dev/staging environments (95% cost reduction). Risk: low if environments already tolerate downtime. Savings: $7-11K/month immediate. (2) Delete unused Lambda functions (150 never-invoked): ~$50-100/month savings (minimal, but zero risk and good hygiene). (3) Delete unattached EBS volumes (50 volumes × 8GB avg): ~$200-300/month savings. Risk: zero if volumes aren't snapshots. Savings: immediate. (4) Move S3 logs to Glacier/Glacier Deep Archive: 8TB * 30 days * $0.023/GB = $5,520/month spend. Move to Glacier: $0.004/GB/month = $32/month + retrieval cost. Savings: $5,488/month. Risk: medium (retrieval latency if logs needed for audit). (5) Reserved Instance optimization: if running on-demand, commit to 1-year Compute Savings Plans for 25-30% discount on variable workload = $15-20K/month savings. Risk: low if workload is predictable. Cumulative savings with items 1-5: $7-11K (RDS) + $5.5K (S3 Glacier) + $15-20K (Savings Plans) = $27.5-36.5K/month. Hit 15% target. Timeline: 6 weeks (RDS migration 2 weeks, S3 archival 1 week, RI analysis/commitment 2 weeks, testing 1 week).
Follow-up: You moved logs to Glacier but compliance audit now requires 30-day availability. Your retrieval costs spike. How do you rebalance?
Your organization has 50 AWS accounts (dev, staging, prod, data, analytics, etc.). Each account autonomously buys Reserved Instances, leading to massive waste: some accounts buy 3-year RIs they never fully utilize, others run entirely on-demand because they don't know about RIs. How do you centralize purchasing and save 25%?
Implement AWS Organizations + Consolidated Billing + RI Marketplace + Savings Plans: (1) Consolidate billing: enable Consolidated Billing in Organizations so all accounts pool RIs and Savings Plans. Example: if Account-A buys 3-year RI for 50 m5.large but only uses 30, Account-B can use the remaining 20 RIs. No RI waste. Savings: 5-10% immediately by pooling. (2) RI purchasing strategy: create cross-organizational RI governance. Finance buys RIs centrally (not individual accounts). Approach: (a) Baseline capacity per account = CloudWatch P10 over 12 months. (b) Sum all baselines across 50 accounts. (c) Buy 3-year RIs for summed baseline at organization level (70% discount). Cost: centralized team negotiates volume discounts with AWS. (3) Savings Plans (1-year): variable capacity = buy Compute Savings Plans at organization level (30-40% discount, instance-type flexible). (4) Querying tool: build Lambda + CloudWatch dashboard querying Cost Explorer by account/service/day. Auto-alert if any account's reserved capacity utilization drops <60% (flag for investigation). (5) Chargebacks: bill individual accounts for their RI share monthly (internal showback, not chargeback, to encourage usage). Example: Account-A gets bill: "Your RI share: $5K/month; your on-demand usage: $500/month; total: $5.5K/month." Accountability drives optimization. Result: baseline costs drop 25-30% via centralized RIs + Savings Plans, plus bonus savings from accountability. Implementation: 4-6 weeks (Org setup, RI re-purchasing, dashboard, training).
Follow-up: One account (engineering) complains their RI share is too high because product team launched a new feature with 5x compute demand. How do you handle inter-account disputes over RI allocation?
Your data team runs daily Spark jobs on EC2 (self-managed cluster). Jobs take 4 hours, cost $1.2K/day in EC2, network, and EBS. You're considering: (1) AWS EMR (managed Spark), (2) Glue (serverless ETL), (3) keep EC2 but buy spot instances. Your data team says "We control the cluster and don't want to rewrite queries." How do you justify migration with hard ROI numbers?
Build a cost model for each: (1) Current EC2 self-managed: 4-hour job daily on 20x m5.2xlarge (on-demand at $0.384/hr) + 1x m5.4xlarge driver + 500GB EBS (gp3 at $0.1/GB = $50/month) + network egress 100GB/month at $0.09/GB ($9/month) = (20*0.384*4 + 0.768*4) + 50/30 + 9/30 per day = $31.36 + $2 = $33.36/day = $1,000/month (vs claimed $1.2K, numbers roughly match). (2) EMR (managed): same cluster config but AWS auto-manages failover/scaling. Cost: EC2 = $33/day (same) + EMR service fee ($0.18/vCore/hour for on-demand) = $0.18 * 32 cores * 4 hours = $23.04/day. Total: $56/day = $1.7K/month. Verdict: EMR on-demand is 70% MORE expensive. (3) EMR + Spot instances: Spot prices for m5 average $0.12/hr (70% discount). Cost: 20*0.12*4 + 0.768*4 + $23.04 = $9.6 + $3 + $23 = $35.6/day = $1.1K/month. Verdict: breaks even, adds complexity (Spot interruptions). (4) Glue (serverless): charges per DPU-hour (1 DPU = 0.25 vCore). Glue: 5 DPU * 4 hours = 20 DPU-hours. Cost: $0.44/DPU-hour = $8.80/day = $264/month + data catalog/crawler costs = $300/month. Verdict: 70% cheaper, BUT requires rewriting Spark queries to Glue format (ETL vs DataFrame). (5) Honest assessment: if data team refuses rewrite, EMR + Spot is breakeven. If data team can rewrite (2-week effort), Glue saves $8K/year. Timeline: 1 week POC Glue vs 3 weeks POC EMR Spot + team training. Recommendation: Glue for long-term (savings justify rewrite), EMR Spot as interim while planning Glue migration.
Follow-up: Glue job runs fine in dev but fails in prod due to OutOfMemory. Scaling Glue workers is manual and hits budget threshold. How do you debug and fix?
Your startup has 3 AWS accounts: dev (1 account), prod (1 account), and data (1 account). You currently spend $5K/month. Your CFO asks: "Can we get enterprise volume discounts?" Your current AWS rep says: "You'd need $100K+ annual commitment to unlock better rates." You have ARR of $2M but AWS is only 2.5% of costs. Is enterprise discount worth pursuing? How do you negotiate?
Enterprise discount math: (1) Current spend: $5K/month = $60K/year. Standard discount tiers: <$5K/month (0%), $5-10K (5%), $10-25K (10%), $25-50K (15%), $50K+ (20-30%). You're at tier 1 (0% discount). (2) AWS rep's $100K commitment is for AWS Credits (not true discounts). Credits are different: they apply at service level but don't reduce list price. Not worth pursuing. (3) True enterprise discount path: consolidate billing through AWS Organizations (automatic, no negotiation). This alone unlocks: Reserved Instance pooling (5-10% efficiency gain), Savings Plans (25-30% on committed capacity). Realistic savings on $60K: apply 10% Savings Plans discount = $6K/year saved. (4) Negotiation lever: if you forecast 3x growth (ARR $2M → $6M) AND commit to tripling AWS spend to $15K+/month within 18 months, AWS account manager might unlock: (a) 10% volume discount on on-demand (rare), or (b) Enterprise Savings Plan at 40% (vs standard 30%), or (c) 5% AWS credit (equivalent to ~3% true discount). (5) ROI calculation: negotiation effort (20 hours at $200/hr eng time + management time) = $4K cost. Expected savings: $6-9K/year if you hit committed spend growth. Break-even at 12 months if you actually grow. (6) Verdict: negotiate only if: (a) you're forecasting 2x+ cloud spend growth in next 18 months, (b) you have executive sponsor (CFO), and (c) you can commit to spend targets. Otherwise, optimize via Savings Plans alone (DIY, no negotiation).
Follow-up: You committed to $150K annual AWS spend and now product launches 3 months early. Actual spend is $12K/month = $144K projected. You'll hit commit. But a new service is overprovisioned and could save $2K/month. Do you optimize it away to stay under budget, or optimize for efficiency regardless?
Your organization uses AWS in 4 regions (us-east-1, eu-west-1, ap-southeast-1, ca-central-1). Each region has separate billing, and AWS Cost Explorer shows 20% cost variance between regions for equivalent workloads. Your CTO suspects pricing differences or inefficient resource placement. How do you investigate and standardize costs across regions?
Regional cost variance is usually driven by: (1) On-demand pricing differences: m5.large costs $0.096/hr in us-east-1, $0.106/hr in eu-west-1 (10% higher), $0.105/hr in ap-southeast-1 (9% higher), $0.10/hr in ca-central-1 (4% higher). This explains 5-10% variance naturally; not waste. (2) Reserved Instance discounts applied unevenly: if you bought 3-year RIs in us-east-1 (70% discount) but not eu-west-1 (on-demand only), EU bill will be 70% higher. This explains the 20% variance. Fix: buy same RI footprint in all regions. (3) Data transfer costs: if eu-west-1 is pushing data to us-east-1 (cross-region transfer at $0.02/GB), that's hidden. Check CloudWatch Network metrics. (4) Inefficient resource sizing: use AWS Compute Optimizer across all regions. Run: `aws compute-optimizer get-ec2-instance-recommendations --region
Follow-up: After standardization, you discover eu-west-1 is still 12% more expensive due to higher data transfer costs. AWS bills show $150K/month transfer. How do you architect around expensive cross-region data movement?