Multi-tenant SaaS: 1000 customers sharing same DB. Design architecture so customer A never sees customer B's data—even with SQL injection. Row-Level Security (RLS): setup, policies, testing for data isolation.
RLS: Database-enforced policy; queries automatically filtered by tenant_id. Setup: (1) Enable RLS: ALTER TABLE customers ENABLE ROW LEVEL SECURITY; ALTER TABLE orders ENABLE ROW LEVEL SECURITY. (2) Create policy: CREATE POLICY tenant_policy ON orders USING (tenant_id = current_setting('app.current_tenant_id')::INT) WITH CHECK (tenant_id = current_setting('app.current_tenant_id')::INT). USING = SELECT filter; WITH CHECK = INSERT/UPDATE/DELETE filter. (3) Set tenant context: App sets session variable after login: psql -v tenant_id=123 -c "SET app.current_tenant_id = '123'; SELECT * FROM orders" returns only customer 123's orders. (4) Prevent policy bypass: Remove public SELECT/INSERT/UPDATE: REVOKE ALL ON orders FROM public; GRANT SELECT, INSERT, UPDATE, DELETE ON orders TO app_user. User must use RLS policy, not raw access. (5) Test isolation: (a) Login as customer 123: SET app.current_tenant_id = '123'; SELECT * FROM orders returns customer 123 data. (b) Attempt bypass: SELECT * FROM orders WHERE tenant_id = 456 returns nothing (RLS filters out). (c) Direct SQL: UPDATE orders SET amount=999 WHERE tenant_id=456 fails silently (RLS prevents). Gotchas: (1) Superuser bypasses RLS: ALTER POLICY ... AS (PERMISSIVE | RESTRICTIVE) FOR (SELECT | INSERT | UPDATE | DELETE). (2) Forget to set context = default policies allow all. (3) RLS slow if policy complex (evaluated per row). Optimization: (1) Index on tenant_id: CREATE INDEX idx_orders_tenant ON orders(tenant_id); policy filter uses index. (2) Partition by tenant: PARTITION BY HASH (tenant_id) for extreme isolation. (3) Monitor policy overhead: EXPLAIN (ANALYZE) SELECT * FROM orders should show Filter added by RLS. Recommended: RLS essential for SaaS; default deny, explicit grant per policy.
Follow-up: Explain RLS PERMISSIVE vs. RESTRICTIVE policies. How do multiple policies interact?
RLS performance issue: 10k policies (one per customer) or 1M policies (one per user). Policy evaluation slow. Redesign for scalable multi-tenant access control.
Individual policies don't scale; each evaluated per row. Scalable design: (1) Tenant/customer-level RLS (not per-user): CREATE POLICY customer_policy ON orders USING (customer_id = current_setting('app.current_customer_id')::INT). Single policy per table. (2) Consolidate policies: Instead of 10k policies, 1 policy + context variable. (3) Attribute-based access control (ABAC): Set multiple attributes: SET app.current_tenant_id = '123'; SET app.current_user_role = 'admin'; SET app.current_region = 'US'. Policy: CREATE POLICY abac_policy ON orders USING (tenant_id = current_setting('app.current_tenant_id')::INT AND (current_setting('app.current_user_role')='admin' OR user_id = current_setting('app.current_user_id')::INT)). Complex but single policy. (4) Alternative: Application-level filtering instead of RLS. App filters after query (less secure but faster). (5) Database role per customer: CREATE ROLE customer_123 WITH PASSWORD 'pwd'; GRANT SELECT, INSERT ON orders TO customer_123. Authenticate as customer_123 directly; implicit RLS via role (fewer context variables). Downside: 1000+ roles slow to create. (6) Materialized roles: Use cached role membership queries instead of per-query. (7) Lateral joins: SELECT o.* FROM orders o, LATERAL (SELECT CASE WHEN current_setting('app.role')='admin' THEN TRUE ELSE user_id = current_setting('app.user_id')::INT END) policy WHERE policy. Expensive. Performance tuning: (1) Index context variables: CREATE INDEX idx_orders_user ON orders(user_id) WHERE current_setting('app.current_user_id')::INT = user_id (conditional index). (2) Use prepared statements: RLS evaluates once per statement. (3) Monitor: EXPLAIN ANALYZE SELECT ... should show minimal Filter overhead. Recommended: Customer-level RLS for SaaS, not per-user; use application auth for per-user access.
Follow-up: Explain ABAC (Attribute-Based Access Control). How does it scale better than role-based?
Bug in RLS policy logic: condition was `tenant_id = 123 OR admin = true` instead of `AND`. Customer A (admin) saw all data. Audit: how would you detect this? Prevention going forward.
RLS policy bugs are silent and catastrophic. Detection: (1) Manual audit: SELECT tablename, policyname, qual, with_check_clause FROM pg_policies WHERE schemaname='public'. Review logic manually. (2) Automated query: SELECT * FROM pg_policies WHERE with_check_clause LIKE '%OR%' to find problematic patterns. (3) Test suite: Query as different users/roles, verify results match expectations. (4) Logging: Enable query logging: ALTER SYSTEM SET log_statement = 'all'; ALTER SYSTEM SET log_min_duration_statement = 0; SELECT pg_reload_conf(). Search logs for anomalous queries. (5) Data loss audit: Query to find cross-tenant accesses: SELECT tenant_id, COUNT(*) FROM orders GROUP BY tenant_id; -- repeat hourly, check for anomalies. Prevention: (1) Code review: Every RLS policy change reviewed by security team. (2) Automated testing: pgTAP or PgUnit to unit-test policies. Example: SELECT plan(2); SELECT results_eq('SELECT * FROM orders', ROW((1, 123, 100)), 'Customer 123 sees order 1'); SELECT results_ne('SELECT * FROM orders', ROW((1, 456, 100)), 'Customer 123 cannot see other customers orders'). (3) Policy versioning: Git-tracked policy definitions. (4) Dry-run environment: Test policy changes on staging with real data (sanitized). (5) Monitoring: Alert if RLS evaluates abnormally (many rows returned despite filter). Recovery: (1) Forensics: Identify affected customers: SELECT DISTINCT tenant_id FROM orders WHERE accessed_by_wrong_user = true (custom tracking). (2) Notify affected: Legal/compliance requirement. (3) Data purge: Erase unauthorized access logs. (4) Policy rollback: Revert to previous policy version, re-apply filter. (5) Compliance: Document incident, remediation, and prevention for SOC2/HIPAA audits.
Follow-up: Design a unit-test framework for RLS policies. How to validate correctness?
Multi-tenant: Customers rent isolated databases (1 DB per customer) vs. shared DB with RLS. Compare architectures: ops complexity, security, cost, performance.
Architecture decision: 1 DB per customer vs. shared DB with RLS. Comparison: 1 DB per Customer: Pros: Maximum isolation (DB breach affects 1 customer). Separate backup/recovery per customer. Different schema per customer possible. Cons: 1000+ DBs = massive ops overhead. Replication/backup O(1000). Upgrade/patch all 1000s (risky). Shared DB + RLS: Pros: Single DB to operate. Economies of scale (1 large DB < 1000 small). Unified backups. Cons: RLS breach = all customers. Must ensure RLS doesn't fail. Database outage = all customers down. Hybrid: Pool of 10-20 DBs, 50 customers per DB. Middle ground between ops burden and blast radius. Recommendation: Choose by customer base: (1) Startup (10-100 customers): Shared DB + RLS (simple ops, cost-effective). (2) Growth (100-1000 customers): Hybrid pool (balance cost + isolation). (3) Enterprise (1000+ customers with SLA): 1 DB per tier (Premium = dedicated, Basic = shared). Security-wise: Shared DB + RLS requires: (1) Strict RLS testing. (2) Network isolation (PgBouncer + firewall per DB). (3) Audit logging (all queries logged). (4) Encryption (at-rest + TLS in-transit). (5) Compliance: SOC2, HIPAA, GDPR requirements. Cost: 1 DB per customer: $50/month × 1000 = $50k/mo. Shared DB: $5k/mo (1 large instance). Hybrid: $15k/mo (15 medium instances). Ops: 1 DB/customer = weeks per deploy. Shared = hours. Recommended: Start shared + RLS; migrate to hybrid pool at 500+ customers.
Follow-up: Explain blast radius and security implications of shared vs. isolated DBs. When should you choose isolated?
Customer A has orders from 10 years ago; RLS policy filters them out (stale data, wrong tenant_id). Query returns no results but data exists. Debug and fix without data loss.
RLS filtering out old data indicates: (1) Policy changed (old records have different tenant_id format). (2) Data migration bug (tenant_id not updated correctly). (3) Policy condition wrong (e.g., checks != instead of =). Debug: (1) Check policy: SELECT policyname, qual FROM pg_policies WHERE tablename='orders'. Inspect condition logic. (2) Compare data: SELECT tenant_id, COUNT(*) FROM orders WHERE customer_id='123' GROUP BY tenant_id. If multiple tenant_ids for same customer, migration bug. (3) View actual data: Temporarily disable RLS: ALTER TABLE orders DISABLE ROW LEVEL SECURITY. SELECT * FROM orders WHERE customer_id='123' ORDER BY created_at DESC LIMIT 10. Check tenant_id values. (4) Enable RLS, retry: ALTER TABLE orders ENABLE ROW LEVEL SECURITY; SELECT * FROM orders WHERE customer_id='123' should now return results if tenant_id matches context. Fix: (1) If migration bug: UPDATE orders SET tenant_id = customer_id WHERE customer_id='123' AND tenant_id IS NULL. Batch in small chunks. (2) If policy changed: Review policy history (git). Revert or update to match old data. (3) If multiple tenant_ids for same customer: Consolidate: UPDATE orders SET tenant_id = primary_tenant_id WHERE customer_id='123' AND tenant_id != primary_tenant_id. (4) Re-enable RLS and validate: SELECT COUNT(*) FROM orders WHERE customer_id='123' should return expected count. Recovery: (1) Create temp table with unfiltered data (for verification). (2) Fix tenant_id values. (3) Re-apply RLS. (4) Audit trail: Document what was fixed and why. Prevention: (1) RLS testing per customer: Verify each customer sees expected data. (2) Migration validation: Post-migration, run data quality checks. (3) Monitoring: Alert if RLS filter returns 0 rows when customer claims data exists.
Follow-up: Design data quality checks post-migration to catch RLS filtering issues early.
RLS performance degrades over time as table grows (100k rows -> 10M rows). SELECT queries slow despite simple policy. Root cause and optimization.
RLS performance degrades if: (1) Policy evaluated sequentially (every row), O(N). (2) Index not used for policy filter. (3) Policy complex (join, subquery). Diagnose: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id='123'. Compare: (a) With RLS: Plan shows Filter (rows=10k, estimated=5k). (b) Without RLS (disable, test): Plan shows IndexScan, no Filter. Difference = RLS overhead. Optimization: (1) Index on tenant_id: CREATE INDEX idx_orders_tenant ON orders(tenant_id). RLS uses index to prune. Verify: EXPLAIN ... shows "Index Scan" not "Seq Scan". (2) Partial index: CREATE INDEX idx_orders_tenant_active ON orders(tenant_id) WHERE status='active' (smaller index for common filter). (3) Simplify policy: Replace complex CASE with simple condition. Example: Bad: USING (tenant_id = current_setting(...) OR (user_role='admin' AND EXISTS(SELECT 1 FROM audit WHERE...))). Good: USING (tenant_id = current_setting(...)) (RLS is baseline; app handles admin override). (4) Partition by tenant: PARTITION BY HASH(tenant_id). RLS scans single partition (faster). (5) Cached policy: Pre-compute (not dynamic). Example: current_setting('app.tenant_id') cached in session (not recalculated per row). (6) Denormalize: Duplicate tenant_id in indexes, e.g., CREATE INDEX idx_orders_tenants_id ON orders(tenant_id, id) (covering index). Tuning: (1) Monitor RLS Filter time: SELECT (EXPLAIN ...)::json ->'Plan' ->'Filter' ->'Actual Time'. Should be <1% of total query time. (2) Query planner cost: ALTER SYSTEM SET random_page_cost = 1.1 (SSD assumed; lower = index preferred). (3) Work_mem: ALTER SYSTEM SET work_mem = '1GB'. Recommended: Index on tenant_id mandatory; partition if >100M rows.
Follow-up: Explain how partitioning improves RLS performance. What's the relationship between partition key and RLS filter?