AWS Interview — Security Hub, GuardDuty, and Threat Detection

GuardDuty flagged an EC2 instance (prod-api-7) communicating with 198.51.100.0/24 on port 443, identified as known C2 (Command & Control) server. The instance is a production API server. Security alerts you immediately. You have 5 minutes to decide: isolate the instance, investigate first, or dismiss as false positive? Walk through your response.

Production incident response for C2 communication (5 min decision): (1) Immediate action (within 60 sec): (a) Enable VPC Flow Logs for the instance's ENI to capture full packet stream (do this first, not after isolation). (b) Take snapshot of instance EBS volume (forensics, in case you need to restore). (c) DO NOT terminate or isolate yet (might destroy evidence). (2) Investigation (2-3 min): (a) Query GuardDuty details: Finding type, severity, confidence. Confidence >80% = likely real, <50% = likely false positive. (b) Check VPC Flow Logs for the specific communication: (i) Flow direction (outbound from prod-api-7?). (ii) Byte count (bulk data exfil = suspicious; small keep-alives = possible legitimate). (iii) Destination port (443 = HTTPS, could be legitimate SaaS call). (c) Check SystemsManager Session Manager logs: when was instance accessed? Who accessed? Indicates compromise window. (d) Check IAM role attached to instance: has it made unusual API calls (ListS3Buckets, DescribeInstances)? Use CloudTrail. (3) Decision point (4-5 min): (a) Confidence high + outbound bulk data + unusual IAM calls = ISOLATION. (i) Use AWS Systems Manager Automation to: isolate instance (change SG to allow only inbound SSH from bastion). (ii) Redirect traffic to standby instance (load balancer flips to backup). (iii) Alert incident response team. (b) Confidence low + small traffic + normal IAM activity = INVESTIGATE without isolation. (i) Keep instance running but monitored. (ii) Update SG to block 198.51.100.0/24 (prevent further C2 comms). (iii) Schedule forensics analysis in 1 hour (after incident commander confirms). (4) Post-isolation (5-10 min): (a) Trigger incident timeline: when did compromise occur? Check CloudTrail for anomalous actions in past 24 hours. (b) Affected scope: does C2 server point to any other instances? Query all GuardDuty findings for same C2 destination. (c) Root cause: how did attacker gain access? Check CloudTrail for IAM key compromise, SSH brute-force, exploit. (5) Communication: immediately notify (in parallel): (a) Incident commander (organize response). (b) Customers (if data was accessed). (c) Executive team (board may need notification if data breach). (d) AWS Support (create P1 ticket for forensics guidance). (6) Automation: this entire flow should be trigger-able via Lambda (GuardDuty → SNS → Lambda → auto-isolate if confidence >80%). Testing: monthly drill with fake C2 findings.

Follow-up: After isolation, forensics shows an attacker stole AWS credentials from the instance. They used those credentials to describe and read 500 RDS backups containing customer PII. Breach scope is massive. How do you contain and audit?

Your organization enabled GuardDuty 3 months ago. You now have 2,000 active findings (high/medium severity). Many are false positives: EC2 scanning ports (pen testing, intentional), Lambda calling third-party APIs flagged as suspicious, etc. Your team is drowning in alerts. How do you prioritize without missing real threats?

Reduce GuardDuty noise via suppression rules + threat prioritization: (1) Finding types diagnosis: (a) Export all 2K findings to CSV: Finding Type, Severity, Instance/Service, Description. (b) Identify top 10 finding types by frequency. Likely: "UnauthorizedAPI_Probe" (port scanning), "CryptoCurrency.Bits!Mining", "UnusualNetworkTraffic" (false positives for dev/test). (c) Estimate false positive rate: spot-check 50 findings. If 40/50 are FP, you're 80% noise. (2) Suppression rules (AWS Security Hub + GuardDuty integration): (a) Create filters to suppress known FPs: (i) Instance tagged with "environment=test" or "purpose=pentesting" → suppress all findings. (ii) RDS instances in dev environment → suppress "UnusualNetworkTraffic". (iii) Lambda calling known third-party APIs (Stripe, Twilio) → suppress "CryptoCurrency" findings. (b) AWS Security Hub allows up to 100 suppression rules. Implement 50 rules targeting top FP types. Result: 2K findings → ~200 actionable findings (90% reduction). (3) Severity-based triage: (a) Critical findings: only act on "Critical" severity from GuardDuty. Create PagerDuty alert. (b) High findings: reviewed daily, not immediate. (c) Medium findings: reviewed weekly. (d) Low findings: monthly review (usually deprecate). (4) Automation via Lambda: (a) GuardDuty Finding → SNS → Lambda. (b) Lambda: read finding, check suppression list. If suppressed, auto-archive in Security Hub. (c) If actionable (Critical + non-suppressed): create ticket in Jira + page on-call. (d) For certain findings (unencrypted S3 bucket, public RDS), auto-remediate (enable encryption, restrict SG). (5) Tuning process: (a) Weekly review: 10 actionable findings from past week. Were they real threats or still FP? Adjust suppression rules. (b) Track metrics: findings reviewed, findings actioned, findings auto-remediated. Goal: <20 actionable per day (reviewable by 1 person in 2 hours). (6) Implementation: 2 weeks (suppression rules, Lambda, Jira integration, metrics dashboard). Cost: negligible (CloudWatch + Lambda <$10/month). Result: incident response from 2K findings/week to 40 findings/week (50x reduction).

Follow-up: You suppressed 1,800 findings. A week later, a Dev employee's laptop is compromised. The attacker uses stolen AWS credentials and tries to describe instances. GuardDuty flags it (but it's suppressed due to "test environment" tag). Breach goes undetected for 6 hours. How do you prevent suppression from hiding real compromises?

Security Hub is now integrated with GuardDuty, Config, and IAM Access Analyzer. You have findings from all 4 services reporting on the same resource. Example: S3 bucket flagged by (1) Config (not encrypted), (2) GuardDuty (unusual access), (3) IAM Access Analyzer (overly permissive policy). How do you correlate and prioritize cross-service findings?

Centralize and correlate findings via Security Hub + custom automation: (1) Finding correlation problem: 3 independent services report on same S3 bucket, but with different context. Ops team manually connects dots. (a) Config: "S3 encryption disabled" (infrastructure misconfiguration). (b) IAM Access Analyzer: "bucket policy allows s3:* to *" (access misconfiguration). (c) GuardDuty: "unusual DataGetObject patterns from IP 203.0.113.0" (behavioral anomaly). (d) Independently, each is medium-severity. Together = critical (data exfil risk). (2) Security Hub correlation: (a) Enable all integrators (Config, GuardDuty, IAM Access Analyzer, CloudTrail, Macie). (b) Security Hub consolidates findings and performs basic deduplication (same resource, same type = 1 finding across 4 integrators). (c) Custom insights (Security Hub feature): create insight for multi-source correlations: ``` Finding Type: CONFIG AND GuardDuty AND IAM_AA Resource Type: S3 Bucket Severity: HIGH+ ``` This surface only buckets flagged by 3+ services (high confidence threats). (d) Result: 10K raw findings → 200 multi-source correlations (top priority). (3) Automated response via EventBridge: (a) Security Hub finding → EventBridge rule (if severity=critical AND multi-source=true). (b) Trigger Lambda: (i) Query all findings for resource. (ii) Build incident summary: all 4 service contexts merged. (iii) Create JIRA ticket with auto-remediation suggestion (e.g., "Enable S3 encryption + restrict bucket policy + investigate IP 203.0.113.0"). (iv) Page security on-call. (4) Remediation automation (risk-aware): (a) Low-risk remediations: auto-fix (enable encryption, update SG). (b) High-risk remediations: require approval (bucket policy changes require security sign-off). (c) Use AWS Systems Manager Automation for conditional remediation: if resource is non-prod + low-risk, auto-fix; if prod, create ticket + wait for approval. (5) Metrics dashboard: (a) Total findings by source + severity. (b) Multi-source findings (critical). (c) Findings auto-remediated. (d) MTTR (mean time to remediate) by finding type. Goal: <1 day for critical, <7 days for high. (6) Implementation: 3 weeks (Security Hub integration, EventBridge rules, Lambda automation, JIRA integration). Cost: Security Hub $50/month + Lambda <$10/month. Result: incident response from 10K findings to 200 actionable + auto-remediated (50x efficiency).

Follow-up: Auto-remediation disabled an overly permissive S3 bucket policy, but one legitimate service suddenly can't read objects. MTTR was 5 min (auto-fix), but incident response took 8 hours. How do you balance auto-remediation speed vs validation?

GuardDuty found 50 EC2 instances connecting to a botnet (Mirai). All instances are in staging environment, development team's playground. Remediation normally takes 2 hours (rebuild instances). But your on-call engineer is asleep, and decision is yours. Isolate (break developer workflow for 30 min)? Or leave running (security risk)? Decision and rationale?

Decision: Isolate staging instances immediately (networking SG change), notify dev team. Rationale: (1) Risk assessment: (a) Botnet in staging = less severe than production, but still serious. (b) Mirai = DDoS botnet. If compromised instances are used to attack AWS infrastructure or customers, AWS could throttle your account. (c) Attackers could pivot to prod (lateral movement via shared IAM roles, cross-environment credentials). (d) Impact: 30 min staging isolation < days of account suspension. (2) Isolation mechanism (zero-downtime rebuilds): (a) Change SG: add explicit "deny all outbound" rule (takes 30 sec). Instances can't reach botnet C2, but internal services still work. (b) Notify dev team via Slack: "Staging instances isolated due to botnet detection. ETA: 2 hours to rebuild. Use dev/local environment meanwhile." (c) Auto-trigger remediation (rebuild via Terraform/CDK): terminate compromised instances, relaunch fresh from AMI, rerun tests. (3) Investigation (parallel): (a) Snapshot 5 instances (EBS volumes) for forensics. Don't snapshot all 50 (storage cost). (b) Query CloudTrail + VPC Flow Logs: when did C2 communication start? Did attacker execute any other API calls? (c) Check IAM roles for staging instances: if they have overly permissive permissions, assume attacker stole credentials. Rotate IAM keys immediately. (4) Notification (immediate): (a) Page on-call security engineer (even if they were asleep, this is P0). (b) Notify dev team lead (staging rebuild incoming). (c) Alert AWS Support (file P1 ticket, AWS will monitor for further activity). (d) If any customer-facing data was in staging, notify compliance team (potential breach notification). (5) Post-incident (next 24 hours): (a) Root cause: how did Mirai infect staging? (i) Vulnerable AMI? Audit AMI for outdated packages. (ii) Overly open SG? Staging exposed SSH to internet. Restrict to VPN/bastion only. (iii) Lateral movement from dev? Dev and staging share IAM role. Split into separate roles. (b) Implement permanent preventions: (i) GuardDuty findings → auto-isolate (SG change) for staging/dev (low-risk). (ii) AWS Config rule: detect instances with public SSH access. Auto-remediate. (iii) Patch management: deploy automated patching for all EC2 instances (Systems Manager Patch Manager). (6) Dev team recovery: rebuild from clean AMI takes 30 min. Staging back online before end-of-day. No prod impact.

Follow-up: Forensics shows attacker entered staging via unpatched SSH vulnerability in the AMI. Dev team built the AMI 6 months ago, no patches. Your company now mandates "all AMIs patched within 7 days of CVE." How do you enforce this?

Your company requires compliance audit: "All critical findings from Security Hub must be resolved within 30 days or account faces suspension." You have 15 critical findings (RDS unencrypted, S3 public, IAM overly permissive). Some findings are in prod and require architectural changes (6+ weeks). Others are quick fixes (1 hour). Prioritize and plan remediation within 30-day window.

30-day remediation plan for critical findings: (1) Finding inventory + effort estimate: ```Critical Findings: 1. RDS unencrypted (prod-warehouse) — 2 weeks (requires snapshots, migration, downtime planning) 2. S3 public (prod-backups) — 1 hour (restrict policy + ACL) 3. IAM overly permissive (cross-account role) — 3 days (audit dependencies, refactor) 4. ElastiCache no encryption — 1 week (snapshot, relaunch) 5. Elasticsearch no auth — 1 hour (enable fine-grained access) 6. Lambda execution role too permissive — 2 days (audit usage, restrict permissions) 7-15. [8 more findings, ~3 weeks combined] ``` (2) Resource allocation: (a) Critical path: RDS encryption (2 weeks, must start day 1). (b) Low-effort quick wins: S3 public, Elasticsearch auth (1 hour each). Finish by day 1 (show progress to auditors). (c) Parallel workstreams: (i) Team A: RDS migration (weeks 1-2). (ii) Team B: IAM refactoring (weeks 1-3). (iii) Team C: ElastiCache + Lambda (weeks 1-2). (3) Remediation without downtime (where possible): (a) RDS: (i) Create Read Replica with encryption enabled (zero downtime, 1-2 hours). (ii) Promote replica to primary (30 sec downtime, off-peak). (iii) Update replication (backward-compatible, no app changes). (b) S3 public: change policy (1 sec, zero impact, policy audit prevents accidental reversal). (c) IAM: create new restricted roles, test with canary services, gradually shift traffic. (4) Audit trail: (a) Track remediation in spreadsheet: Finding, Severity, Start Date, Completion Date, Verification Method. (b) For each finding, document: "Resolved via [method]" + CloudTrail evidence + Security Hub re-scan confirmation. (c) Auditor will ask to re-verify via Security Hub after remediation (ensure finding is actually gone). (5) Communication: (a) Week 1: update exec team on plan (15 critical, 30-day timeline, on track). (b) Week 2: complete quick wins (show 5 findings resolved). (c) Week 3: RDS migration mid-way, IAM refactoring 60% done. (d) Week 4: final push, target completion by day 28 (2-day buffer before 30-day deadline). (6) Risk: if remediation hits blocker (e.g., RDS migration slower than expected), escalate to architecture team immediately (week 2, not week 4). Request finding waiver from auditor (rare, but possible with evidence of good-faith effort). Implementation: 4 weeks concurrent (not sequential), 5-10 FTE effort. Cost: zero (internal time), plus potential prod downtime (1-2 hours for RDS failover). ROI: maintain compliance, avoid account suspension (invaluable).

Follow-up: RDS migration completed successfully. Security Hub re-scan shows finding as resolved. But 2 weeks later, someone manually encrypted the RDS backup, not the active DB. The finding re-appeared. How do you enforce continuous compliance (not one-time fixes)?

You want to implement automated incident response: GuardDuty finding (critical) → auto-isolate instance (SG change) → notify Slack → create JIRA ticket → trigger forensics (snapshot + VPC Flow Logs). But you're worried: (1) auto-isolation might break legitimate services, (2) false positives could waste time. How do you safely automate?

Automated incident response with guardrails: (1) Architecture: GuardDuty Finding → EventBridge → Lambda → conditional automation. (a) Lambda evaluates: severity, confidence, finding type, resource tags, environment. (b) If safe, auto-isolate. If risky, notify (let human decide). (c) Logging: all decisions logged to DynamoDB + CloudTrail for audit. (2) Safety conditions (define before automating): (a) Auto-isolate only if: (i) Severity = Critical (high confidence from AWS). (ii) Resource tagged with "auto_remediate=true" (opt-in). (iii) Environment = staging/dev (not prod). (iv) Finding type = known malware/C2 (low false positive rate). (b) Require approval if: (i) Prod environment (page on-call first). (ii) Finding confidence <70%. (iii) Resource provides critical service (flagged in DynamoDB). (3) Implementation (Lambda pseudocode): ```python def handle_guardduty_finding(event): finding = event['detail'] if finding['severity'] >= 'CRITICAL': if should_auto_isolate(finding): isolate_instance(finding['resource_id']) notify_slack(f"Auto-isolated {finding['resource_id']}") else: page_oncall(finding) log_decision(finding, decision) def should_auto_isolate(finding): return (finding['confidence'] > 80 and finding['environment'] in ['dev', 'staging'] and finding['type'] in SAFE_AUTO_REMEDIATE_TYPES and finding['resource_tags']['auto_remediate'] == 'true') ``` (4) Monitoring + rollback: (a) Track auto-isolation events: 10 auto-isolations last week, 2 were user-reported false positives. (b) If false positive rate >10%, disable auto-isolation temporarily (manual review mode instead). (c) Rollback: if service alerts (e.g., dev team says "our app is down"), Lambda has auto-undo: remove SG rule, restore instance within 5 min. (d) Slack alert to on-call: "Instance prod-api-5 auto-isolated 30 min ago. Still isolated? Confirm by reacting with :+1: or :-1: to auto-undo." (5) Fine-tuning: (a) Week 1: auto-isolation enabled for dev/staging only. Mon-Fri business hours. (b) Week 2: collect data on auto-isolation accuracy (accuracy = no false positives). (c) Week 3: if accuracy >95%, enable for prod with approval requirement (Lambda pages, waits 5 min for approval, then isolates). (d) Week 4+: graduated automation, tune confidence threshold. (6) Compliance: document all auto-isolation decisions for auditors. Show: "Critical finding detected, auto-isolated in 30 sec, investigated 5 min later, confirmed malware, kept isolated, incident closed." Audit trail = proof of effective incident response.

Follow-up: Auto-isolation worked but one service depends on the isolated instance. Service failed silently (no alert) for 2 hours. How do you discover rapid service degradation after auto-isolation?