Authorization Architecture Selection Playbook (RBAC/ABAC/ReBAC, Zanzibar, Cedar, OPA)
Date: 2026-03-24
Category: knowledge / software
TL;DR
If your product mostly needs attribute-heavy business rules (time, IP, MFA, risk score, object attributes), start with a policy engine (Cedar or OPA).
If your product mostly needs share graphs and delegated access ("A can view doc because A is in team T, team T owns folder F, doc D is in F"), start with a relationship graph engine (OpenFGA/SpiceDB-style ReBAC).
At scale, many teams end up with a hybrid: ReBAC for who is connected to what, policy engine for contextual constraints.
1) The core problem you are really solving
Authorization systems usually fail not because of syntax, but because teams mix three concerns:
- Relationship truth — who belongs where; ownership; sharing graph.
- Decision logic — contextual policy checks (MFA, geography, time, risk, tenant state).
- Consistency model — how fresh decisions must be after permission changes.
Choose tools by which concern dominates your incident history.
2) Three practical models
A) Policy-centric (RBAC + ABAC): Cedar / OPA style
Best for: API/business-rule-heavy systems, moderate sharing graph complexity.
- Expressive conditions on principal/resource/context.
- Good decoupling of auth logic from app code.
- Easier to reason about explicit deny/allow precedence.
Watch-outs:
- Graph traversal can become awkward if you force deep sharing hierarchies into pure policy conditions.
- Data loading (entity/context hydration) often becomes the hidden complexity.
B) Relationship-centric (ReBAC): Zanzibar/OpenFGA/SpiceDB style
Best for: collaborative apps with deep sharing, inherited permissions, group nesting, delegated access.
- Native "tuple graph" reasoning and usersets.
- Naturally models transitive permissions.
- Strong conceptual fit for documents/projects/folders/org trees.
Watch-outs:
- Contextual constraints (time-window, risk score, IP) often require additional policy/context logic.
- Consistency mode choices become a first-class product decision.
C) Hybrid
Best for: most mature SaaS products.
- ReBAC engine answers graph question: "does candidate relation exist?"
- Policy engine answers context question: "is this request allowed now under runtime conditions?"
- Combines explainability of tuples with expressiveness of policy conditions.
Trade-off: More moving parts, but clearer boundaries.
3) Decision matrix (operator view)
| Signal in your system | Prefer | Why |
|---|---|---|
| Frequent "shared with team/org/folder chain" bugs | ReBAC | Graph semantics are primary complexity |
| Frequent compliance/context rules (MFA/IP/time/device) | Policy engine | Context evaluation is primary complexity |
| Need both, and incidents come from integration seams | Hybrid | Separate graph truth from contextual gates |
| Very tight p99 latency budget with local data | Embedded policy engine | Lowest hop count possible |
| Cross-service, multi-product permission reuse | Centralized auth service | Shared decision authority |
4) Consistency: the part teams under-specify
Zanzibar lesson
Google’s Zanzibar publication reports production operation at massive scale with decisions that respect causal ordering, and states p95 latency <10ms and availability >99.999% over years of use.
Practical implication
"Permission changed" does not equal "all caches globally coherent immediately."
You need a consistency contract per endpoint class.
ReBAC consistency modes (SpiceDB-style concepts)
- minimize_latency: fastest, may temporarily serve stale reads.
- at_least_as_fresh(token): read not older than a causal point (great for read-after-write UX).
- at_exact_snapshot(token): exact snapshot semantics (pagination/reporting windows).
- fully_consistent: freshest, usually highest latency and cache bypass.
Design rule: default to low latency for feed-like reads, require at-least-as-fresh for security-sensitive post-change checks.
5) Performance budgeting (before coding policy)
Set an explicit budget:
- End-to-end API p99 target (e.g., 80ms)
- Max auth decision budget inside it (e.g., 2–5ms local, 5–15ms remote)
OPA docs explicitly discuss use cases targeting ~1ms authorization budget and techniques like linear fragment/indexing.
If your auth budget is unclear, architecture debates are noise.
6) Recommended reference architecture (hybrid)
- PEP (Policy Enforcement Point) in each API service.
- Auth decision facade (single SDK/client abstraction).
- ReBAC service/store for tuples/relations.
- Policy evaluator for contextual checks.
- Token/consistency propagation from write path to sensitive read path.
- Decision logs + explanation IDs for audit/debug.
Request flow
- API receives request + identity + request context.
- Check graph permission in ReBAC.
- Evaluate context policy (MFA/time/IP/risk/resource attrs).
- Combine with deny-overrides strategy.
- Return decision + reason tuple/policy IDs.
7) Migration path (RBAC → ReBAC/Hybrid) without drama
Phase 1: Inventory
- Enumerate all auth checks in codebase.
- Classify each as role, attribute, or relationship.
Phase 2: Externalize decision point
- Introduce a single
authorize()boundary in services. - Keep existing logic behind it initially.
Phase 3: Move graph checks first
- Migrate sharing/inheritance logic to ReBAC tuples.
- Keep context checks in app code/policy side.
Phase 4: Add consistency tokens
- Persist causal token with critical content updates.
- Use at-least-as-fresh checks for security-sensitive reads.
Phase 5: Cut over by endpoint class
- Low-risk endpoints first.
- Diff old/new decisions in shadow mode.
8) Common failure modes
- Auth as side-effect of ORM joins → impossible to audit.
- No explicit deny semantics → accidental privilege creep.
- No consistency tiering → either stale security decisions or overpaying latency everywhere.
- No explainability artifacts → on-call cannot debug why user lost/gained access.
- Tuple cardinality surprises → storage/query blowups from unbounded relation expansion.
9) Minimal production checklist
- One canonical
authorize()call path per service. - Decision logs contain principal/action/resource/context hash.
- Deny decision includes reason IDs (policy/tuple path).
- Consistency mode selected per endpoint class (documented).
- Post-permission-change read path has causal freshness strategy.
- Shadow tests against historical access scenarios.
- Incident runbook: stale-read vs policy-bug vs data-hydration-bug triage.
10) Quick selection guidance
- Choose Cedar/OPA-first when context policy complexity dominates.
- Choose ReBAC-first when sharing graph complexity dominates.
- Choose Hybrid-first when you already know both are true (most collaborative SaaS at scale).
If uncertain: run a 2-week spike with 20 representative authorization scenarios and compare:
- implementation complexity, 2) p95 latency, 3) explainability quality, 4) migration blast radius.
References (starting points)
- Zanzibar: Google’s Consistent, Global Authorization System (Google Research / USENIX ATC 2019)
https://research.google/pubs/zanzibar-googles-consistent-global-authorization-system/ - OpenFGA Concepts
https://openfga.dev/docs/concepts - SpiceDB Consistency Concepts
https://authzed.com/docs/spicedb/concepts/consistency - Cedar Policy Language Reference
https://docs.cedarpolicy.com/ - OPA Policy Performance
https://www.openpolicyagent.org/docs/policy-performance - AWS Verified Permissions terminology (permit/forbid determining policy behavior)
https://docs.aws.amazon.com/verifiedpermissions/latest/userguide/terminology.html