A junior developer in the "dev" AWS account launches a p4d.24xlarge EC2 instance (8 GPUs, $32K/month cost) without approval. They don't realize the cost. Finance alerts you. Your company has $1M annual cloud budget; this one instance is 3% of budget. How do you prevent this organization-wide without blocking legitimate GPU workloads (ML team needs p3 instances)?
Use AWS Organizations + Service Control Policies (SCPs) to enforce instance type limits: (1) SCP approach: Create a policy that denies EC2 RunInstances for expensive instance families (p4, p4d, x2) unless the request includes a specific tag (cost_center=ml_approved) and is launched in a designated ML account. (2) Policy structure: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Deny", "Action": "ec2:RunInstances", "Resource": "arn:aws:ec2:*:*:instance/*", "Condition": { "StringLike": { "ec2:InstanceType": ["p4*", "x2*"] }, "StringNotEquals": { "aws:PrincipalOrgID": "o-abc123" }, "Null": { "aws:RequestedRegion": "true" } } } ] } ``` (3) Implementation: (a) Attach policy to "dev" and "staging" OUs (organizational units). "ml" OU is exempt. (b) Require approval workflow: EC2-instance requests in dev must go through change control (Jira ticket + approval). (c) Cost alerts: CloudWatch budget alerts if any account exceeds $500/month for GPU instances (auto-escalate). (4) Legitimate use case handling: ML team gets dedicated account with p3 access. They submit monthly forecast of GPU needs (e.g., "2x p3.2xlarge for March"). Finance approves, SCP allows it only in ml_prod account. (5) Cost recovery: charge the intern's team (backend-services) for the errant p4d instance (~$1K for 30 days of runtime). This creates accountability. (6) Alternative to SCP (less restrictive): Use AWS Config rules to auto-detect expensive instances and trigger Lambda to either (a) terminate (risky), or (b) stop + send Slack notification to DevOps. Operator reviews and decides. Less blocking, more flexible.
Follow-up: Your SCP blocks p4 instances, but a dev team circumvents it by launching 8x p3.2xlarge instances (similar cost, different instance family). How do you close this loophole?
Your company has 30 AWS accounts across 5 organizational units (OUs): dev, staging, prod, ml, data. You want to enforce: (1) prod: must use CloudTrail, GuardDuty, KMS encryption. (2) dev/staging: optional (cost-sensitive). (3) All accounts: must tag resources with team/cost-center. Design SCPs for this compliance model.
Multi-OU SCP design for differentiated governance: (1) Account structure: ```org/ ├── prod-ou (4 accounts, prod-web, prod-api, prod-data, prod-mobile) ├── staging-ou (3 accounts) ├── dev-ou (10 accounts) ├── ml-ou (5 accounts) ├── data-ou (3 accounts) ├── security-ou (1 account, logging centralized) ├── shared-services-ou (2 accounts, ssm, networking) └── backup-ou (2 accounts, snapshots) ``` (2) Policy #1 (prod-ou, mandatory): Enforce CloudTrail + GuardDuty + KMS. ```json { "Effect": "Deny", "Action": [ "ec2:RunInstances", "rds:CreateDBInstance", "s3:CreateBucket" ], "Resource": "*", "Condition": { "Bool": { "aws:RequestedRegion": ["!us-east-1"] } } } ``` Better: `DenyWithoutCloudTrail` policy requires that before launching resources, CloudTrail must be enabled. Implement via Config rule + Lambda auto-enable. (3) Policy #2 (all OUs, tagging): Enforce resource tagging. ```json { "Sid": "DenyUntaggedEC2", "Effect": "Deny", "Action": ["ec2:RunInstances", "ec2:CreateVolume"], "Resource": ["arn:aws:ec2:*:*:instance/*", "arn:aws:ec2:*:*:volume/*"], "Condition": { "Null": { "aws:RequestTag/team": "true" } } } ``` (4) Implementation: (a) Prod-ou: attach "prod-mandatory" SCP. Dev-ou/staging-ou: attach "dev-optional" (allows untagged resources, no CloudTrail enforcement). (b) Root-ou: attach "universal-deny" policy that prevents disabling guardrails (e.g., deny iam:PutUserPolicy for service roles that manage SCPs). (c) Testing: use Policy Simulator in IAM console to test policies before deployment. (5) Deployment process: (a) SCP changes require finance + security sign-off. (b) Pilot on 1 dev account first. (c) Monitor IAM events for 1 week (CloudTrail shows all "denied" API calls). (d) Rollout to remaining accounts. (6) Cost: SCPs are free. Enforcement via AWS Config = $2/rule/account/month = ~$200/month for 50 rules across 30 accounts. Cheap insurance.
Follow-up: Dev team says SCP tagging requirement broke their automation. They're launching instances via Terraform but tags aren't applied. How do you update SCP to allow Terraform automation while keeping enforcement?
Your organization has 50 AWS accounts and a data governance requirement: "Only 3 people (security team) can delete S3 objects; dev teams can only read/write." You're currently using IAM policies per account, which is manual and error-prone. Can you enforce this org-wide with SCPs?
Yes, use SCPs combined with IAM permission boundaries for fine-grained access: (1) SCP approach (org-wide deny): Attach policy to root that denies s3:DeleteObject for all users except those in a specific role (e.g., security-data-admin). ```json { "Sid": "DenyDeleteObjectExceptSecurity", "Effect": "Deny", "Action": "s3:DeleteObject", "Resource": "*", "Condition": { "StringNotLike": { "aws:PrincipalArn": "arn:aws:iam::*:role/security-data-admin" } } } ``` (2) This is too broad—dev teams can't even delete their own test objects. Refinement: use policy variables to allow deletion only in specific buckets. ```json { "Effect": "Deny", "Action": "s3:DeleteObject", "Resource": "arn:aws:s3:::prod-critical-data/*", "Condition": { "StringNotLike": { "aws:PrincipalArn": "arn:aws:iam::*:role/security-data-admin" } } } ``` (3) Separate buckets by sensitivity: (a) prod-critical-data (financial records): deny deletes for all but security-data-admin. (b) dev-test-data (ephemeral): allow all devs to delete. (c) product-data (customer-facing): allow deletion only with MFA (additional condition). (4) Implementation in single account: (a) Create security-data-admin role in security account. (b) Create dev-access role in each account with trust relationship to allow dev users to assume it. (c) Attach permission boundary: ``` { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject"], "Resource": "arn:aws:s3:::prod-critical-data/*", "Deny": "s3:DeleteObject" } ``` (5) Across org: (a) Create SCPs per OU. Prod-ou: strong deny. Dev-ou: allow deletes in dev buckets only. (b) Use AWS Resource Access Manager (RAM) to share read-only views of prod buckets to dev accounts. (c) Central logging: CloudTrail logs all DeleteObject attempts org-wide. Alerts if denied deletion happens >5 times/day (suggests compromise attempt). (6) Governance: security team reviews deletion logs monthly. If a dev needs to delete, they file ticket → security approves → temporary IAM policy grant for 1 hour. Cost: minimal (SCPs free, CloudTrail ~$2K/month for 50 accounts).
Follow-up: A legitimate developer can't delete their test S3 objects in dev because the SCP is too broad. You exempt their IAM role, but then they use the role to delete critical prod data. How do you fix this without exempting roles?
Your organization prohibits leaving AWS resources public (s3:Block Public Access, enforce VPC endpoints). You use SCPs to deny public access. But a team wants to host a static website on S3 (must be public). Your SCP blocks it. How do you allow intentional public resources while blocking accidental exposure?
Allow intentional public resources via approved bucket whitelist + tagging: (1) Current SCP (too strict): ```json { "Deny", "s3:PutBucketPublicAccessBlock" } ``` This prevents all public buckets. Better approach: (2) Revised SCP: ```json { "Sid": "AllowPublicOnlyIfTagged", "Effect": "Deny", "Action": ["s3:PutBucketPolicy", "s3:PutBucketAcl"], "Resource": "*", "Condition": { "Null": { "aws:RequestTag/purpose": "true", "aws:RequestTag/owner": "true" } } } ``` This allows bucket policy changes only if tags specify purpose (e.g., purpose=website) and owner (email). (3) Additional guardrails: (a) S3 Bucket Policy template: restrict to CloudFront distribution or specific IPs (not all internet). Config rule: auto-remediate if bucket becomes publicly accessible (auto-revert to private). (b) Approved buckets list: maintain in DynamoDB table (bucket_name, owner_team, purpose, created_date). Config rule queries table; if bucket not in approved list, auto-block. (4) Workflow for public bucket request: (a) Dev files ticket: "We need to make s3://mywebsite-public public for website hosting." (b) Security team approves, adds to DynamoDB approved list. (c) SCP now allows the bucket (based on tag + approval). (d) Monitoring: CloudWatch alerts if any S3 bucket becomes public without tag. (5) Risk mitigation: (a) Approved buckets get tagged with expiration date (e.g., purpose=website, expires=2026-06-30). Config rule auto-resets ACL to private after expiration. (b) Quarterly review: audit all public buckets. If expired, mark for deletion. (c) Cross-account prevention: SCP forbids s3:PutBucketPolicy in other AWS accounts (multi-account exposure risk). (6) Implementation: 2 weeks (SCP design, Config rule, Lambda auto-remediation, approval workflow). Cost: AWS Config ~$2-3/rule/month, Lambda <$1/month (minimal invocations).
Follow-up: A web developer tags their S3 bucket as purpose=website to bypass SCP, then deletes the CloudFront distribution and makes the bucket fully public. Bucket contains PII. How do you detect this drift?
Your company uses AWS Organizations with 20 accounts. You set an SCP that denies all EC2 instances outside us-east-1 (cost control, compliance requires US-only). But a product team needs to run instances in us-west-2 for latency testing. Currently, they can't. Your VP says: "Grant them exception for Q2 only." Design an SCP exception mechanism.
SCP exceptions via tagging + time-based policies: (1) Base SCP (denies non-us-east-1): ```json { "Effect": "Deny", "Action": "ec2:RunInstances", "Resource": "*", "Condition": { "StringNotEquals": { "aws:RequestedRegion": "us-east-1" } } } ``` (2) Exception mechanism: add conditional allow for tagged requests. ```json { "Effect": "Deny", "Action": "ec2:RunInstances", "Resource": "*", "Condition": { "StringNotEquals": { "aws:RequestedRegion": "us-east-1" }, "Null": { "aws:RequestTag/temp-exception": "true" } } } ``` This allows instances in us-west-2 IF they're tagged with temp-exception. (3) Time-based validation: store exceptions in DynamoDB table: ```{ "exception_id": "prod-team-q2-west2", "account_id": "123456789", "region": "us-west-2", "approved_by": "vpeng", "valid_from": "2026-04-01", "valid_until": "2026-06-30", "tags_required": ["temp-exception", "expiration-date"], "max_instances": 5 }``` (4) Approval workflow: (a) Team requests exception via form (Jira ticket with region/duration/justification). (b) VP/Finance approves, entry added to DynamoDB. (c) Lambda hook (every 4 hours) scans DynamoDB, identifies expired exceptions. (d) For any account running instances with temp-exception tag: validate against DynamoDB. If expired, trigger SNS alert → DevOps → auto-stop instance. (5) Implementation: (a) SCP + Config rule in parallel. SCP blocks initially; Config rule catches exceptions. (b) Lambda custom resource in security account checks DynamoDB every 6 hours. (c) Dashboard: show all active exceptions, expiration dates. (6) Cost: DynamoDB on-demand ($1.25/million requests, negligible), Lambda <$1/month. Personnel: 1-2 weeks dev. (7) Alternative (lighter weight): use SCP with manual expiration. Attach time-limited exception policy, then remove it manually after Q2. Requires discipline but works for infrequent exceptions.
Follow-up: A team member creates an instance with temp-exception tag but it's now expired. The Config rule detects it 12 hours later. During those 12 hours, the instance incurred $200 cost. Who pays and how do you prevent delays?
Your organization has a "developer access" account used by 200 contractors + full-time devs. You want to enforce: (1) devs can launch EC2, RDS, Lambda, (2) but cannot delete production data (snapshots, backups), (3) cannot modify cross-account IAM roles. Currently, you have 20 individual IAM policies per dev (manual nightmare). Can SCPs + IAM roles simplify this?
Yes, use SCPs + centralized IAM roles to simplify permissions: (1) Instead of 20 individual policies: create single "developer-base" role with standard permissions. Attach permission boundary (SCP-like but at IAM level) that prevents dangerous actions. (2) Permission boundary for developer role: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["ec2:*", "rds:*", "lambda:*"], "Resource": "*" }, { "Effect": "Deny", "Action": ["rds:DeleteDBSnapshot", "ec2:DeleteSnapshot", "iam:*", "backup:DeleteBackupVault"] }, { "Effect": "Deny", "Action": "sts:AssumeRole", "Resource": "arn:aws:iam::*:role/cross-account-*" }, { "Effect": "Deny", "Action": ["kms:ScheduleKeyDeletion", "s3:DeleteBucket"], "Resource": "*" } ] } ``` (3) Permission boundary (org-level SCP): even if developer role accidentally gets admin permissions, the boundary restricts them. (4) Implementation: (a) Create role in central account: `developer-base` with permission boundary attached. (b) Other 200 developers simply assume this role (no individual policies). (c) Onboarding: 5 min (add email to allowlist, they assume role). Offboarding: remove from allowlist, role becomes inactive. (5) Fine-grained exceptions: if a dev needs to delete a snapshot (legitimate backup cleanup), they request through approval workflow: (a) File ticket: "Need to delete snapshot snap-12345 (6 months old)." (b) Security team verifies snapshot is indeed old, not production. (c) Grant temporary permission: add specific resource ARN exception for 1 hour. (d) After hour, permission revoked. (6) Monitoring: CloudTrail logs all actions + permission denials. Weekly report: "Developers attempted 45 denied actions this week." Investigate if pattern suggests compromise (10x normal). (7) Cost savings: 20 IAM policies maintained by ops → 1 centralized role + boundary managed by security. Personnel reduction: 50% (from 1 FTE to 0.5 FTE ongoing management). Implementation: 2 weeks (design, testing, onboarding docs).
Follow-up: A developer uses permission boundary loophole: they list all snapshots, identify one they created earlier, and "accidentally" delete it while claiming it was their own. How do you audit ownership and prevent this abuse?