Authorization Architecture Selection Playbook (RBAC/ABAC/ReBAC, Zanzibar, Cedar, OPA)

2026-03-24 · software

Authorization Architecture Selection Playbook (RBAC/ABAC/ReBAC, Zanzibar, Cedar, OPA)

Date: 2026-03-24
Category: knowledge / software

TL;DR

If your product mostly needs attribute-heavy business rules (time, IP, MFA, risk score, object attributes), start with a policy engine (Cedar or OPA).
If your product mostly needs share graphs and delegated access ("A can view doc because A is in team T, team T owns folder F, doc D is in F"), start with a relationship graph engine (OpenFGA/SpiceDB-style ReBAC).
At scale, many teams end up with a hybrid: ReBAC for who is connected to what, policy engine for contextual constraints.


1) The core problem you are really solving

Authorization systems usually fail not because of syntax, but because teams mix three concerns:

  1. Relationship truth — who belongs where; ownership; sharing graph.
  2. Decision logic — contextual policy checks (MFA, geography, time, risk, tenant state).
  3. Consistency model — how fresh decisions must be after permission changes.

Choose tools by which concern dominates your incident history.


2) Three practical models

A) Policy-centric (RBAC + ABAC): Cedar / OPA style

Best for: API/business-rule-heavy systems, moderate sharing graph complexity.

Watch-outs:


B) Relationship-centric (ReBAC): Zanzibar/OpenFGA/SpiceDB style

Best for: collaborative apps with deep sharing, inherited permissions, group nesting, delegated access.

Watch-outs:


C) Hybrid

Best for: most mature SaaS products.

Trade-off: More moving parts, but clearer boundaries.


3) Decision matrix (operator view)

Signal in your system Prefer Why
Frequent "shared with team/org/folder chain" bugs ReBAC Graph semantics are primary complexity
Frequent compliance/context rules (MFA/IP/time/device) Policy engine Context evaluation is primary complexity
Need both, and incidents come from integration seams Hybrid Separate graph truth from contextual gates
Very tight p99 latency budget with local data Embedded policy engine Lowest hop count possible
Cross-service, multi-product permission reuse Centralized auth service Shared decision authority

4) Consistency: the part teams under-specify

Zanzibar lesson

Google’s Zanzibar publication reports production operation at massive scale with decisions that respect causal ordering, and states p95 latency <10ms and availability >99.999% over years of use.

Practical implication

"Permission changed" does not equal "all caches globally coherent immediately."
You need a consistency contract per endpoint class.

ReBAC consistency modes (SpiceDB-style concepts)

Design rule: default to low latency for feed-like reads, require at-least-as-fresh for security-sensitive post-change checks.


5) Performance budgeting (before coding policy)

Set an explicit budget:

OPA docs explicitly discuss use cases targeting ~1ms authorization budget and techniques like linear fragment/indexing.

If your auth budget is unclear, architecture debates are noise.


6) Recommended reference architecture (hybrid)

  1. PEP (Policy Enforcement Point) in each API service.
  2. Auth decision facade (single SDK/client abstraction).
  3. ReBAC service/store for tuples/relations.
  4. Policy evaluator for contextual checks.
  5. Token/consistency propagation from write path to sensitive read path.
  6. Decision logs + explanation IDs for audit/debug.

Request flow

  1. API receives request + identity + request context.
  2. Check graph permission in ReBAC.
  3. Evaluate context policy (MFA/time/IP/risk/resource attrs).
  4. Combine with deny-overrides strategy.
  5. Return decision + reason tuple/policy IDs.

7) Migration path (RBAC → ReBAC/Hybrid) without drama

Phase 1: Inventory

Phase 2: Externalize decision point

Phase 3: Move graph checks first

Phase 4: Add consistency tokens

Phase 5: Cut over by endpoint class


8) Common failure modes

  1. Auth as side-effect of ORM joins → impossible to audit.
  2. No explicit deny semantics → accidental privilege creep.
  3. No consistency tiering → either stale security decisions or overpaying latency everywhere.
  4. No explainability artifacts → on-call cannot debug why user lost/gained access.
  5. Tuple cardinality surprises → storage/query blowups from unbounded relation expansion.

9) Minimal production checklist


10) Quick selection guidance

If uncertain: run a 2-week spike with 20 representative authorization scenarios and compare:

  1. implementation complexity, 2) p95 latency, 3) explainability quality, 4) migration blast radius.

References (starting points)