# Sujeet Jaiswal - Technical Blog (Full Content)
> Complete technical blog content for LLM consumption.
Source: https://sujeet.pro
Generated: 2026-04-21T21:25:44.934Z
Total content: 188 (185 articles, 0 blogs, 3 projects)
---
# BROWSER & RUNTIME INTERNALS
Deep dives into browser rendering and JavaScript runtime internals.
---
## Critical Rendering Path: Rendering Pipeline Overview
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/browser-runtime-internals/critical-rendering-path/crp-rendering-pipeline-overview
**Category:** Browser & Runtime Internals / Critical Rendering Path
**Description:** The browser’s rendering pipeline transforms HTML, CSS, and JavaScript into visual pixels through a series of discrete, highly optimized stages. Modern browser engines like Chromium employ the RenderingNG architecture—a next-generation rendering system developed between 2014 and 2021—which decouples the main thread from the compositor and GPU processes to ensure 60fps+ performance and minimize interaction latency.
# Critical Rendering Path: Rendering Pipeline Overview
The browser's rendering pipeline transforms HTML, CSS, and JavaScript into visual pixels through a series of discrete, highly optimized stages. Modern browser engines like Chromium employ the **RenderingNG** architecture—a next-generation rendering system developed between 2014 and 2021—which decouples the main thread from the compositor and GPU processes to ensure 60fps+ performance and minimize interaction latency.

The RenderingNG pipeline: Main thread stages (DOM → Paint) produce immutable outputs committed to the compositor thread, which handles rasterization and compositing independently.
## Abstract
The rendering pipeline is fundamentally a **producer-consumer architecture** split across threads and processes:
- **Main Thread**: Produces structured data (DOM, CSSOM, computed styles, layout geometry, property trees, display lists). Each stage's output is immutable once complete.
- **Compositor Thread**: Consumes committed data to handle scrolling, animations (transform/opacity), and frame assembly without blocking the main thread.
- **Viz Process**: Aggregates compositor frames from all sources and issues GPU draw calls.
The key insight: **Property Trees** (transform, clip, effect, scroll) replaced monolithic layer trees, reducing animation updates from O(layers) to O(affected nodes). This enables compositor-driven animations that bypass the main thread entirely—the architectural foundation for responsive scrolling and 60fps animations even when JavaScript is busy.
Performance impact flows from this split: Interaction to Next Paint (INP) measures how quickly the pipeline can present a frame after user input. Each pipeline stage that runs on the main thread directly contributes to input delay and processing time.
## Pipeline Stages: Inputs and Outputs
Each stage has well-defined inputs and outputs. Understanding this data flow is essential for debugging performance issues.
| Stage | Input | Output | Consumed By |
| ------------------------------------------------------------- | ---------------------------------- | -------------------------------------------------------------------------- | ---------------- |
| [**DOM Construction**](../crp-dom-construction/README.md) | HTML bytes | DOM Tree | Style Recalc |
| [**CSSOM Construction**](../crp-cssom-construction/README.md) | CSS bytes | CSSOM Tree | Style Recalc |
| [**Style Recalc**](../crp-style-recalculation/README.md) | DOM + CSSOM | ComputedStyle (per node) + **LayoutObject Tree** | Layout |
| [**Layout**](../crp-layout/README.md) | LayoutObject Tree + ComputedStyle | **Fragment Tree** (immutable geometry) | Prepaint |
| [**Prepaint**](../crp-prepaint/README.md) | LayoutObject Tree + Fragment Tree | **Property Trees** (transform, clip, effect, scroll) + paint invalidations | Paint |
| [**Paint**](../crp-paint/README.md) | LayoutObject Tree + Property Trees | **Display Lists** (drawing commands) | Commit |
| [**Commit**](../crp-commit/README.md) | Property Trees + Display Lists | Copied data on compositor thread | Layerize, Raster |
| **Layerize** | Display Lists | Composited layer list | Raster |
| [**Raster**](../crp-raster/README.md) | Display Lists + Tiles | GPU texture tiles (bitmaps) | Composite |
| [**Composite**](../crp-composit/README.md) | Texture tiles + Property Trees | Compositor Frame (DrawQuads) | Draw |
| [**Draw**](../crp-draw/README.md) | Compositor Frame | Pixels on screen | Display |
**Key distinction**: The **LayoutObject Tree** is created during Style Recalc, not Layout. Layout _annotates_ the LayoutObject tree and produces the immutable **Fragment Tree** as its output. Prepaint then traverses the LayoutObject tree (using Fragment Tree data) to build Property Trees.
## The Critical Rendering Path (CRP)
The **Critical Rendering Path (CRP)** is the sequence of steps the browser undergoes to convert code into a visual frame. While traditionally viewed as a linear flow (DOM → CSSOM → Render Tree → Layout → Paint), modern engines employ a granular multi-threaded architecture designed around a core constraint: the main thread handles both JavaScript execution and rendering pipeline stages, so any work on the main thread delays both script responsiveness and visual updates.
### Design Rationale: Why Multi-Threading Matters
> **Prior to RenderingNG (pre-2021)**: Rendering was deeply coupled. Scrolling could trigger expensive style recalculations. Animation of transforms required full layer tree walks. The single-threaded assumption baked into the original WebKit codebase (dating to 1998) meant rendering work blocked JavaScript and vice versa.
RenderingNG's design addresses this by:
1. **Separating concerns**: Each pipeline stage produces well-defined, immutable outputs
2. **Enabling skip logic**: Stages that aren't needed can be bypassed (e.g., transform animations skip layout and paint)
3. **Offloading work**: Compositor-driven operations don't require main thread involvement
## The RenderingNG Pipeline Stages
The pipeline comprises 12 stages, though several can be skipped when unnecessary. The first six run on the main thread; the remainder run on the compositor thread and Viz process.
### DOM Construction
The browser parses HTML bytes into the Document Object Model (DOM) tree. This process is incremental—the browser starts building the tree before the entire document downloads.
**Blocking Behavior**:
- **JS is Parser Blocking**: Synchronous `` detection in strings |
| **Character references** | Entity decoding | `&` → `&` |
The tokenizer emits five token types: DOCTYPE, start tag, end tag, comment, and character tokens. Each token triggers tree construction actions.
### Stage 2: Tree Construction
Tree construction uses **insertion modes** to determine how tokens modify the DOM. The specification defines 23 insertion modes including:
- `initial`, `before html`, `before head`, `in head`, `after head`
- `in body` (handles most content)
- `in table`, `in row`, `in cell` (table-specific rules)
- `in template` (for `` elements)
The parser maintains two critical data structures:
1. **Stack of open elements**: Tracks the current nesting context
2. **List of active formatting elements**: Handles ``, ``, `` across misnested tags
```html collapse={1-5,8-12}
Hello web performance students!
```

DOM tree construction from HTML parsing: each element becomes a node with parent-child relationships preserved.
## HTML5 Error Recovery
Unlike XML parsers that reject malformed input, the HTML parser **always produces a DOM tree**. The specification defines exact error recovery behavior for every malformed pattern, ensuring consistent results across browsers.
### The Adoption Agency Algorithm
When formatting elements like `` or `` are improperly nested, the adoption agency algorithm restructures the DOM to match user intent:
```html
One two three four
five
One two three four five
```
The algorithm earned its name because "elements change parents"—nodes are reparented to produce valid structure. The spec notes this was chosen over alternatives including the "incest algorithm" and "Heisenberg algorithm."
### Foster Parenting
When content appears inside `` where it's not allowed, the parser uses **foster parenting** to place it before the table:
```html
Some text
```
The foster parent is typically the element before the table in the stack of open elements. This explains why stray text inside tables appears above them in the rendered output.
### Common Parse Errors
The specification defines 70+ parse error codes. Common scenarios:
| Error | Input | Recovery Behavior |
| --------------------------- | --------------------- | ------------------------ |
| `duplicate-attribute` | `` | Second attribute ignored |
| `end-tag-with-attributes` | `
` | Attributes ignored |
| `missing-end-tag-name` | `>` | Treated as bogus comment |
| `unexpected-null-character` | `\0
` | Replaced with U+FFFD |
---
## Why Incremental Parsing Matters
Unlike [CSSOM construction](../crp-cssom-construction/README.md), DOM construction doesn't require the complete document. The browser parses and builds incrementally, enabling:
- **Early resource discovery**: The preload scanner finds ` ` and `
```
Because the parser cannot predict what a script will write, it must pause, execute the script, then continue with any newly injected content. This is why scripts are **parser-blocking** by default.
### The Full Blocking Chain
1. HTML parser encounters a `
```
The browser must wait for CSS to finish to build the CSSOM, so it can safely execute the script, which in turn blocks the parser. This indirect blocking is a common performance bottleneck.
**Design rationale**: Scripts might call `getComputedStyle()` or access `element.offsetWidth`, which require resolved styles. Running a script before CSSOM completion could return incorrect values, leading to layout bugs.
---
## JavaScript Loading Strategies

Timeline comparison: default scripts block parsing; async/defer enable parallel download.
### Default (Parser-Blocking)
```html
```
- Blocks HTML parsing until download and execution complete
- Preserves document order
- **Use for**: Legacy scripts that require `document.write()` (avoid if possible)
### Async
```html
```
- Downloads in parallel with parsing
- Executes immediately upon download (interrupts parser briefly)
- **Order NOT preserved**—whichever script downloads first runs first
- **Use for**: Independent third-party scripts (analytics, ads, widgets)
**Edge case**: If an async script downloads before parsing reaches it, execution still interrupts the parser. This is why async scripts can cause unpredictable layout shifts if they modify the DOM.
### Defer
```html
```
- Downloads in parallel with parsing
- Executes after the DOM is fully parsed but before `DOMContentLoaded`
- **Order preserved**—scripts execute in document order regardless of download completion order
- **Use for**: Primary application scripts
**The DOMContentLoaded timing**: Deferred scripts execute in the gap between DOM completion and `DOMContentLoaded` firing. Event listeners for `DOMContentLoaded` will not run until all deferred scripts complete.
### Module Scripts
```html
```
- **Deferred by default** (no need to add `defer`)
- Supports ES Module features: `import`/`export`, top-level `await`
- Executes once per URL (singleton behavior)—importing the same module twice returns the same instance
- **Strict mode always enabled**
- **CORS required** for cross-origin modules (unlike classic scripts)
Adding `async` to a module script makes it execute immediately when ready, like async classic scripts:
```html
```
### Modulepreload
> **Browser support (2023+)**: Chrome 66+, Firefox 115+, Safari 17+
` ` preloads ES modules with parsing and compilation:
```html
```
Unlike `rel="preload"`, modulepreload:
- **Parses and compiles** the module ahead of time (preload only caches bytes)
- **Uses correct credentials mode** (`omit` by default for modules)
- **Can optionally preload the dependency tree** (browser-dependent behavior)
**Best practice**: List all dependencies explicitly rather than relying on browser tree-walking, which varies by implementation.
### Summary Table
| Mode | Parser Blocking | Order Preserved | When Executes | Best For |
| -------------- | --------------- | --------------- | --------------------- | ---------------------- |
| Default | Yes | Yes | Immediately | Legacy scripts (avoid) |
| `async` | No | No | When downloaded | Analytics, ads |
| `defer` | No | Yes | After DOM, before DCL | App scripts |
| `module` | No | Yes | After DOM, before DCL | Modern apps |
| `module async` | No | No | When downloaded | Independent ES modules |
---
## The Preload Scanner
The **preload scanner** (also called "speculative parser" or "lookahead pre-parser") is one of the most significant browser optimizations ever implemented. When Mozilla, WebKit, and IE added preload scanners in 2008, they measured **~20% improvement** in page load times.
### How It Works
When the main parser blocks on a script, a lightweight secondary parser scans ahead through the remaining HTML to discover external resources. It doesn't build a DOM—it only extracts resource URLs and initiates fetches.
```
Main Parser:
```
The declarative approach allows parallel loading with CSS; the injected approach waits for preceding resources.
**Above-the-fold lazy loading**: Using JavaScript lazy-loading on viewport-visible images defeats the scanner and delays Largest Contentful Paint (LCP):
```html
```
### When to Use `rel="preload"` Hints
Use preload hints only when resources are genuinely hidden from the scanner:
```html
```
**Caution**: Overusing preload can backfire. Preloaded resources compete for bandwidth with scanner-discovered resources. Only preload what's truly critical and invisible to the scanner.
---
## Edge Cases and Gotchas
### Script Execution Order Complexities
When mixing loading strategies, execution order can be surprising:
```html
```
The synchronous script (`c.js`) executes first because it blocks parsing. Deferred scripts maintain their order relative to each other. Async scripts race independently.
### Inline Scripts Cannot Be Deferred
`defer` and `async` have no effect on inline scripts:
```html
```
Module scripts are the exception—inline modules are deferred:
```html
```
### DOMContentLoaded Timing
`DOMContentLoaded` fires after:
1. HTML parsing completes
2. All deferred scripts execute (in order)
3. All module scripts execute (in order)
It does **not** wait for:
- Async scripts (may fire before or after)
- Stylesheets (unless they block a script that blocks DOMContentLoaded)
- Images, iframes, or other subresources
### Parser Reentrancy
A script can call `document.write()` during parsing, which injects tokens into the current position. This creates reentrancy:
```html
```
If the injected content includes a script, that script runs before the outer script completes. The parser maintains a **script nesting level** to handle this complexity.
### Template Element Parsing
`` elements have special parsing rules. Their content is parsed but not rendered—it exists in a separate **document fragment**:
```html
```
Scripts inside templates only execute when the template is cloned and inserted into the main document.
---
## Conclusion
DOM construction is a highly optimized but sensitive process. The HTML5 parser's error-tolerant design ensures every document produces a DOM, but this comes with complex algorithms like adoption agency and foster parenting that can produce surprising results from malformed markup.
Parser-blocking scripts and their indirect dependency on CSSOM remain the primary bottlenecks in the Critical Rendering Path. The preload scanner mitigates these delays by discovering resources early, but only for resources visible in the initial HTML markup.
Modern loading strategies should be the default:
- Use `defer` for application scripts that need ordered execution
- Use `type="module"` for ES Module-based applications
- Use `async` only for truly independent third-party scripts
- Avoid `document.write()` entirely—Chrome actively blocks it on slow connections
The goal is to keep the parser unblocked so DOM construction and resource discovery can proceed as quickly as possible.
---
## Appendix
### Prerequisites
- Understanding of HTTP request-response cycle
- Familiarity with the DOM and how JavaScript interacts with it
- Basic knowledge of CSS and the cascade
### Terminology
- **DOM (Document Object Model)**: Tree representation of HTML structure; nodes have properties and methods for manipulation
- **CSSOM (CSS Object Model)**: Tree representation of parsed CSS rules; required for style calculation
- **CRP (Critical Rendering Path)**: Sequence of steps from bytes to pixels: DOM → CSSOM → Style → Layout → Paint → Composite
- **FOUC (Flash of Unstyled Content)**: Visual artifact when content renders before CSS loads
- **Preload Scanner**: Secondary parser that discovers resources while the primary parser is blocked
- **Parser-Blocking**: Resource that halts HTML parsing (synchronous scripts)
- **Render-Blocking**: Resource that halts first paint (CSS)
- **Insertion Mode**: Parser state determining how tokens are processed during tree construction
- **Adoption Agency Algorithm**: Error recovery for misnested formatting elements like `` and ``
- **Foster Parenting**: Error recovery for content misplaced inside `` elements
- **DCL (DOMContentLoaded)**: Event fired when HTML parsing and deferred scripts complete
### Summary
- The HTML parser is a state machine with 80+ tokenization states and 23 insertion modes
- Error recovery algorithms (adoption agency, foster parenting) ensure malformed HTML produces consistent DOM trees
- Scripts block parsing by default because `document.write()` can modify the token stream
- CSS blocks script execution (not parsing) to ensure correct computed style queries
- The preload scanner achieves ~20% faster loads by discovering resources during parser blocks
- `defer` and `type="module"` are the preferred loading strategies for application scripts
- Chrome blocks `document.write()` on slow connections (since Chrome 55)
### References
- [WHATWG HTML Spec: Parsing HTML documents](https://html.spec.whatwg.org/multipage/parsing.html) — Canonical parsing algorithm, tokenization states, tree construction rules
- [WHATWG HTML Spec: Scripting](https://html.spec.whatwg.org/multipage/scripting.html) — Script element behavior, async/defer/module semantics
- [Chrome DevRel: Intervening against document.write()](https://developer.chrome.com/blog/removing-document-write) — Chrome's document.write intervention details
- [web.dev: Don't fight the browser preload scanner](https://web.dev/articles/preload-scanner) — Preload scanner patterns and anti-patterns
- [web.dev: Modulepreload](https://web.dev/articles/modulepreload) — ES module preloading mechanics
- [MDN: Critical Rendering Path](https://developer.mozilla.org/en-US/docs/Web/Performance/Critical_rendering_path) — Overview of DOM and CSSOM construction
- [MDN: rel="modulepreload"](https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Attributes/rel/modulepreload) — Modulepreload attribute reference
- [Andy Davies: How the Browser Pre-loader Makes Pages Load Faster](https://andydavies.me/blog/2013/10/22/how-the-browser-pre-loader-makes-pages-load-faster/) — Historical context on preload scanner implementation
---
## Critical Rendering Path: CSSOM Construction
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/browser-runtime-internals/critical-rendering-path/crp-cssom-construction
**Category:** Browser & Runtime Internals / Critical Rendering Path
**Description:** The CSS Object Model (CSSOM) is the browser engine’s internal representation of all CSS rules—a tree structure of stylesheets, rule objects, and declaration blocks. Unlike DOM construction, which is incremental, CSSOM construction must complete entirely before rendering can proceed. This render-blocking behavior exists because the CSS cascade requires the full rule set to resolve which declarations win.
# Critical Rendering Path: CSSOM Construction
The **CSS Object Model (CSSOM)** is the browser engine's internal representation of all CSS rules—a tree structure of stylesheets, rule objects, and declaration blocks. Unlike [DOM construction](../crp-dom-construction/README.md), which is incremental, CSSOM construction must complete entirely before rendering can proceed. This render-blocking behavior exists because the CSS cascade requires the full rule set to resolve which declarations win.

CSSOM tree structure: CSSStyleSheet objects contain ordered lists of CSSRule objects. The browser must process all rules before computing styles because later rules can override earlier ones via the cascade.
## Abstract
CSSOM construction transforms CSS bytes into a queryable object model through a two-stage pipeline:
1. **Tokenization** — CSS bytes become tokens (identifiers, numbers, strings, delimiters) via a state machine defined in CSS Syntax Level 3
2. **Parsing** — Tokens become a tree of `CSSStyleSheet` → `CSSRule` → declaration objects
**Why render-blocking?** The cascade algorithm requires all rules to determine the winning declaration for each property. Rendering with partial CSS would cause Flash of Unstyled Content (FOUC) as later rules override earlier ones.
**Key interactions:**
| Resource | Blocks Parsing? | Blocks Rendering? | Blocks JS Execution? |
| --------------------- | --------------- | ------------------- | --------------------------- |
| CSS (default) | No | **Yes** | **Yes** (if script follows) |
| CSS (`media="print"`) | No | No | No |
| CSS (`@import`) | No | **Yes** (waterfall) | **Yes** |
**The critical constraint**: Scripts can query styles via `getComputedStyle()`, so the browser blocks script execution until pending stylesheets complete. This creates a dependency chain where CSS indirectly blocks DOM construction when scripts are involved.
## The CSS Parsing Pipeline
### Stage 1: Tokenization
The CSS tokenizer converts a stream of code points into tokens. The [W3C CSS Syntax Module Level 3](https://www.w3.org/TR/css-syntax-3/) defines this process:
> "CSS syntax describes how to correctly transform a stream of Unicode code points into a sequence of CSS tokens (tokenization)."
**Preprocessing** (before tokenization):
- CR, FF, and CR+LF sequences normalize to single LF
- NULL characters and surrogate code points become U+FFFD (replacement character)
- Encoding detected from: HTTP header → BOM → `@charset` → referrer encoding → UTF-8 default
**Token types:**
| Token | Example | Purpose |
| -------------------- | ----------------------------- | ------------------------ |
| `` | `color`, `background-image` | Property names, keywords |
| `` | `rgb(`, `calc(`, `var(` | Function calls |
| `` | `@media`, `@import`, `@layer` | At-rules |
| `` | `#header`, `#fff` | IDs and hex colors |
| `` | `"Open Sans"`, `'icon.png'` | Quoted values |
| `` | `42`, `3.14`, `-1` | Numeric values |
| `` | `16px`, `2em`, `100vh` | Numbers with units |
| `` | `50%`, `100%` | Percentage values |
### Stage 2: Parsing
The parser transforms tokens into CSS structures following the grammar:
- **At-rules**: `@` + name + prelude + optional block (e.g., `@media screen { ... }`)
- **Qualified rules**: prelude (selector) + declaration block (e.g., `.nav { color: blue; }`)
- **Declarations**: property + `:` + value + optional `!important`
**Error recovery** is a defining characteristic:
> "When errors occur in CSS, the parser attempts to recover gracefully, throwing away only the minimum amount of content before returning to parsing as normal."
This design choice ensures forward compatibility—new CSS features are invalid syntax to older parsers, but the stylesheet continues functioning:
```css collapse={1-3,9-11}
/* Modern browser: uses container query */
/* Older browser: ignores @container, uses .card default */
@container (min-width: 400px) {
.card {
grid-template-columns: 1fr 2fr;
}
}
.card {
display: grid;
gap: 1rem;
}
```
### The Resulting CSSOM Structure
The parser produces a tree of JavaScript-accessible objects defined in the [W3C CSSOM specification](https://www.w3.org/TR/cssom-1/):
```
document.styleSheets (StyleSheetList)
└── CSSStyleSheet
├── cssRules (CSSRuleList)
│ ├── CSSStyleRule (selector + declarations)
│ ├── CSSMediaRule (condition + nested rules)
│ ├── CSSImportRule (href + optional media)
│ └── CSSLayerBlockRule (@layer + nested rules)
├── media (MediaList)
└── disabled (boolean)
```
**Key interfaces:**
- `CSSStyleSheet.cssRules` — live collection of rules; modifications update the CSSOM immediately
- `CSSStyleSheet.insertRule(rule, index)` — insert rule at position
- `CSSStyleSheet.deleteRule(index)` — remove rule at position
- `CSSStyleRule.selectorText` — the selector string
- `CSSStyleRule.style` — `CSSStyleDeclaration` with property access
---
## Why CSSOM Must Be Complete Before Rendering
The design choice to make CSS render-blocking trades initial latency for visual stability. The cascade algorithm cannot produce correct results with partial input.
### The Cascade Requires All Rules
Consider this scenario:
```css
/* Rule 1: loaded first */
.button {
background: red;
}
/* ... hundreds of rules ... */
/* Rule 2: loaded later */
.button {
background: blue;
}
```
If the browser rendered with only Rule 1 present, the button would flash red before turning blue when Rule 2 arrives. The cascade resolution depends on:
1. **Origin** — User agent vs. user vs. author stylesheets
2. **Importance** — `!important` inverts normal precedence
3. **Cascade Layers** — `@layer` ordering (CSS Cascade Level 5)
4. **Specificity** — ID > class > type selector weight
5. **Source Order** — Later declarations win at equal specificity
Without the complete rule set, the browser cannot determine the winning declaration.
### Layout Stability Concerns
CSS properties like `display`, `position`, and `float` fundamentally change element geometry. Rendering with partial CSS would cause:
- **Content reflow** — Text wrapping changes as `width` constraints arrive
- **Layout shifts** — Elements repositioning as positioning rules load
- **Visual jank** — Accumulated shifts creating a poor user experience
The browser's choice: delay First Contentful Paint (FCP) until CSSOM completes rather than present an unstable initial frame.
---
## Interaction with JavaScript
CSSOM construction creates a synchronization point between stylesheets and scripts. This happens because JavaScript can read computed styles.
### Why Scripts Wait for CSSOM
Scripts frequently query style information:
```javascript collapse={1-2,8-10}
// These all require resolved styles
const element = document.querySelector(".sidebar")
const width = element.offsetWidth // Layout property
const color = getComputedStyle(element).color // Computed style
const rect = element.getBoundingClientRect() // Geometry
// If CSSOM isn't complete, these return wrong values
```
The browser enforces a rule: **script execution is blocked while there are pending stylesheets in the document**.
This creates the blocking chain:
```
HTML Parser → → CSS download starts
→ continues parsing...
→
```
**Use case**: When a script dynamically inserts stylesheets, the browser doesn't block rendering by default. Adding `blocking="render"` to the inserted stylesheet prevents FOUC.
---
## The `@import` Problem
`@import` rules create a request waterfall that defeats browser optimizations.
### Why @import Hurts Performance
```css
/* main.css - browser must download this first */
@import url("reset.css");
@import url("components.css");
/* ... other rules ... */
```
The browser cannot discover `reset.css` and `components.css` until `main.css` downloads and parses. This creates a sequential waterfall:
```
[========= main.css =========]
[======= reset.css =======]
[=== components.css ===]
```
With ` ` elements, the preload scanner discovers all stylesheets immediately:
```
[========= main.css =========]
[======= reset.css =======] (parallel)
[=== components.css ===] (parallel)
```
### Real-World Impact
[HTTP Archive data](https://calendar.perfplanet.com/2024/the-curious-performance-case-of-css-import/) (16.27 million websites):
- 18.86% of sites use `@import` (3.06 million)
- WooCommerce sites using `@import` showed **37% worse mobile P75 LCP**
- Removing `@import` from Vipio.com improved FCP by **32.7%** (2782ms → 1872ms)
**Recommendation**: Replace `@import` with ` ` elements. The only valid use case is dynamically loading stylesheets based on conditions the HTML cannot express.
### @import Evaluation Timing
`@import` rules must appear before any other rules in a stylesheet (except `@charset` and `@layer`). The browser:
1. Parses the parent stylesheet
2. Encounters `@import`
3. Initiates fetch for imported stylesheet
4. **Blocks CSSOM completion** until imported stylesheet loads and parses
5. Continues with remaining parent rules
Nested `@import` (imported file contains another `@import`) compounds the waterfall.
---
## Developer Optimizations
### Critical CSS Inlining
Extract CSS required for above-the-fold content and inline it in the HTML:
```html collapse={1-2,14-16}
```
**Target**: Keep critical CSS under **14 KB compressed**—the maximum data in the first TCP roundtrip.
**Trade-off**: Inlined CSS cannot be cached separately. For repeat visits, external stylesheets (cached) may perform better.
### Preload for CSS
` ` elevates resource priority and enables early discovery:
```html
```
**When to use preload for CSS**:
- Stylesheets in `@import` chains (defeats waterfall)
- Stylesheets added dynamically by JavaScript
- Stylesheets in Shadow DOM (not discoverable by parser)
**When NOT to use preload**:
- Stylesheets already in `` as ` ` — already discovered by preload scanner
### Constructable Stylesheets
Modern browsers support creating stylesheets programmatically without DOM manipulation:
```javascript collapse={1-2,10-14}
// Create and populate stylesheet
const sheet = new CSSStyleSheet()
sheet.replaceSync(`
.component { padding: 1rem; }
.component--active { background: #e0e0e0; }
`)
// Apply to document
document.adoptedStyleSheets = [...document.adoptedStyleSheets, sheet]
// Apply to Shadow DOM
const shadow = element.attachShadow({ mode: "open" })
shadow.adoptedStyleSheets = [sheet]
```
**Benefits**:
- **Shared styles**: One stylesheet instance across multiple shadow roots
- **No FOUC**: Styles apply synchronously after `replaceSync()`
- **No DOM nodes**: No `
${initialResults}
${deferredContent}
`
}
```
**Performance optimizations:**
| Technique | Impact | Implementation |
| --------------------- | ------------------------ | ----------------------------------- |
| Server-side rendering | FCP < 500ms | Render first 3 results on server |
| Critical CSS inlining | No render blocking | Extract above-fold styles |
| Lazy loading | Reduced initial payload | Load images/rich snippets on scroll |
| Prefetching | Faster result clicks | Prefetch top result on hover |
| Service worker | Offline + instant repeat | Cache static assets, query history |
### Autocomplete UX
```typescript collapse={1-8, 45-55}
// autocomplete.ts
class AutocompleteController {
private debounceMs = 100
private minChars = 2
private cache: Map = new Map()
async handleInput(query: string): Promise {
if (query.length < this.minChars) {
return []
}
// Check cache first
const cached = this.cache.get(query)
if (cached) {
return cached
}
// Debounce rapid keystrokes
await this.debounce()
// Fetch suggestions
const suggestions = await this.fetchSuggestions(query)
// Cache for repeat queries
this.cache.set(query, suggestions)
// Prefetch likely next queries
this.prefetchNextCharacter(query)
return suggestions
}
private prefetchNextCharacter(query: string): void {
// Prefetch common next characters
const commonNextChars = ["a", "e", "i", "o", "s", "t", " "]
for (const char of commonNextChars) {
const nextQuery = query + char
if (!this.cache.has(nextQuery)) {
// Low-priority background fetch
requestIdleCallback(() => this.fetchSuggestions(nextQuery))
}
}
}
}
```
**Autocomplete latency budget:**
```
Total: 100ms target
├── Network RTT: 30ms (edge servers)
├── Server processing: 20ms
├── Trie lookup: 5ms
├── Ranking: 10ms
├── Response serialization: 5ms
└── Client rendering: 30ms
```
### Infinite Scroll vs Pagination
Google uses traditional pagination rather than infinite scroll. Design rationale:
| Factor | Pagination | Infinite Scroll |
| ----------------- | ------------------------------ | ------------------------ |
| User mental model | Clear position in results | Lost context |
| Sharing results | "Page 2" is meaningful | No way to share position |
| Back button | Works as expected | Loses scroll position |
| Performance | Bounded DOM size | Unbounded growth |
| SEO results | Users evaluate before clicking | Scroll past quickly |
## Infrastructure Design
### Cloud-Agnostic Components
| Component | Purpose | Requirements |
| ------------------- | ---------------------------- | ----------------------------------- |
| Distributed storage | Page content, index | Petabyte scale, strong consistency |
| Distributed compute | Index building, ranking | Horizontal scaling, fault tolerance |
| Message queue | Crawl job distribution | At-least-once, priority queues |
| Cache layer | Query results, posting lists | Sub-ms latency, high throughput |
| CDN | Static assets, edge serving | Global distribution |
| DNS | Geographic routing | Low latency, health checking |
### Google's Internal Infrastructure
| Component | Google Service | Purpose |
| ---------- | -------------------------------- | ----------------------------------------- |
| Storage | Bigtable + Colossus | Structured data + distributed file system |
| Compute | Borg | Container orchestration |
| MapReduce | MapReduce / Flume | Batch processing |
| RPC | Stubby (gRPC predecessor) | Service communication |
| Monitoring | Borgmon (Prometheus inspiration) | Metrics and alerting |
| Consensus | Chubby (ZooKeeper inspiration) | Distributed locking |
### AWS Reference Architecture

**Service sizing (for ~10K QPS, 1B documents):**
| Service | Configuration | Cost Estimate |
| ----------- | -------------------------- | ------------- |
| OpenSearch | 20 × i3.2xlarge data nodes | ~$50K/month |
| ECS Fargate | 50 × 4vCPU/8GB tasks | ~$15K/month |
| ElastiCache | 10 × r6g.xlarge nodes | ~$5K/month |
| DynamoDB | On-demand, ~100K WCU | ~$10K/month |
| S3 | 100TB storage | ~$2K/month |
**Note:** This is a simplified reference. Google's actual infrastructure is 1000x larger and uses custom hardware/software unavailable commercially.
### Self-Hosted Open Source Stack
| Component | Technology | Notes |
| ------------- | --------------------------- | ------------------------------- |
| Search engine | Elasticsearch / Solr | Proven at billion-doc scale |
| Storage | Cassandra / ScyllaDB | Wide-column store like Bigtable |
| Crawler | Apache Nutch / StormCrawler | Distributed web crawling |
| Queue | Kafka | Crawl job distribution |
| Compute | Kubernetes | Container orchestration |
| Cache | Redis Cluster | Query and posting list cache |
## Variations
### News Search (Freshness-Critical)
News search prioritizes freshness over traditional ranking signals.
```typescript
// news-ranking.ts
function computeNewsScore(doc: NewsDocument, query: Query): number {
const baseRelevance = computeTextRelevance(doc, query)
const authorityScore = doc.sourceAuthority // CNN > random blog
const freshnessScore = computeFreshnessDecay(doc.publishedAt)
// Freshness dominates for news queries
return baseRelevance * 0.3 + authorityScore * 0.2 + freshnessScore * 0.5
}
function computeFreshnessDecay(publishedAt: Date): number {
const ageHours = (Date.now() - publishedAt.getTime()) / (1000 * 60 * 60)
// Exponential decay: half-life of ~6 hours for breaking news
return Math.exp(-ageHours / 8)
}
```
**News-specific infrastructure:**
- Dedicated "fresh" index updated in real-time
- RSS/Atom feed crawling every few minutes
- Publisher push APIs for instant indexing
- Separate ranking model trained on news engagement
### Image Search
Image search combines visual features with text signals.
```typescript
// image-search.ts
interface ImageDocument {
imageUrl: string
pageUrl: string
altText: string
surroundingText: string
visualFeatures: number[] // CNN embeddings
safeSearchScore: number
}
function rankImageResult(image: ImageDocument, query: Query): number {
// Text signals from alt text and page context
const textScore = computeTextRelevance(image.altText + " " + image.surroundingText, query)
// Visual similarity to query (if query has image)
const visualScore = query.hasImage ? cosineSimilarity(image.visualFeatures, query.imageFeatures) : 0
// Page authority
const pageScore = getPageRank(image.pageUrl)
return textScore * 0.4 + visualScore * 0.3 + pageScore * 0.3
}
```
### Local Search
Location-aware search requires geographic indexing.
```typescript
// local-search.ts
interface LocalBusiness {
id: string
name: string
category: string
location: { lat: number; lng: number }
rating: number
reviewCount: number
}
function rankLocalResult(business: LocalBusiness, query: Query, userLocation: Location): number {
const relevanceScore = computeTextRelevance(business.name + " " + business.category, query)
// Distance decay: closer is better
const distance = haversineDistance(userLocation, business.location)
const distanceScore = 1 / (1 + distance / 5) // 5km reference distance
// Quality signals
const qualityScore = business.rating * Math.log(business.reviewCount + 1)
return relevanceScore * 0.3 + distanceScore * 0.4 + qualityScore * 0.3
}
```
**Local search infrastructure:**
- Geospatial index (R-tree or geohash-based)
- Business database integration (Google My Business)
- Real-time hours/availability from APIs
- User location from GPS, IP, or explicit setting
## Conclusion
Web search design requires solving four interconnected problems at planetary scale:
1. **Crawling** — Discovering and fetching content from billions of URLs while respecting server limits. Prioritization determines which pages stay fresh; adaptive politeness prevents overloading origin servers. The crawler is never "done"—the web changes continuously.
2. **Indexing** — Building data structures that enable sub-second query response. Inverted indexes map terms to documents; sharding distributes the index across thousands of machines. Compression (delta encoding) reduces storage 5-10x while maintaining query speed.
3. **Ranking** — Combining hundreds of signals to surface relevant results. PageRank provides baseline authority from link structure; BERT understands semantic meaning; RankBrain matches queries to documents in embedding space. No single signal dominates—the combination matters.
4. **Serving** — Processing 100K+ QPS with sub-second latency. Fan-out to all shards, aggregate results, apply final ranking—all within 200ms. Caching handles the long tail; early termination stops when good results are found.
**What this design optimizes for:**
- Query latency: p50 < 200ms through caching, early termination, and parallel shard queries
- Index freshness: Minutes for news, hours for regular content through tiered crawling
- Result relevance: Multiple ranking systems (PageRank + BERT + RankBrain) cover different relevance aspects
- Horizontal scale: Sharded architecture scales to 400B+ documents
**What it sacrifices:**
- Simplicity: Thousands of components, multiple ranking systems, complex coordination
- Cost: Massive infrastructure (estimated millions of servers)
- Real-time indexing: Minutes to hours delay for most content (news excepted)
**Known limitations:**
- Long-tail queries may have poor results (insufficient training data)
- Adversarial SEO requires constant ranking updates
- Fresh content from new sites may take weeks to surface
- Personalization creates filter bubbles
## Appendix
### Prerequisites
- Information retrieval fundamentals (TF-IDF, inverted indexes)
- Distributed systems concepts (sharding, replication, consensus)
- Basic machine learning (embeddings, neural networks)
- Graph algorithms (PageRank, link analysis)
### Terminology
- **Inverted Index** — Data structure mapping terms to documents containing them
- **Posting List** — List of documents (with positions/frequencies) for a single term
- **PageRank** — Algorithm measuring page importance based on link structure
- **BERT** — Bidirectional Encoder Representations from Transformers; understands word context
- **RankBrain** — Google's ML system for query-document matching via embeddings
- **Crawl Budget** — Maximum pages a crawler will fetch from a domain in a time period
- **robots.txt** — File specifying crawler access rules for a website
- **QDF** — Query Deserves Freshness; flag indicating time-sensitive queries
- **SERP** — Search Engine Results Page
- **Canonical URL** — Preferred URL when multiple URLs have duplicate content
### Summary
- Web search processes **8.5B queries/day** across **400B+ indexed pages** with **sub-second latency**
- **Inverted indexes** enable O(1) term lookup; **sharding** distributes load across thousands of machines
- **PageRank** measures page authority via link analysis; **BERT/RankBrain** add semantic understanding
- **Crawl prioritization** balances freshness vs. coverage; politeness respects server limits
- **Query processing** includes spell correction (680M param DNN), intent understanding, and query expansion
- **Tiered indexing** keeps hot data in memory; cold data on disk for cost efficiency
- **Early termination** and **caching** reduce tail latency; hedged requests handle slow shards
### References
- [The Anatomy of a Large-Scale Hypertextual Web Search Engine](http://infolab.stanford.edu/~backrub/google.html) - Brin & Page (1998), original Google paper
- [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/bigtable-a-distributed-storage-system-for-structured-data/) - Google (2006), storage architecture
- [How Search Works](https://www.google.com/search/howsearchworks/) - Google official documentation
- [Google Search Ranking Systems Guide](https://developers.google.com/search/docs/appearance/ranking-systems-guide) - Official ranking system documentation
- [A Peek Behind Colossus](https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system) - Google's distributed file system
- [BERT: Pre-training of Deep Bidirectional Transformers](https://arxiv.org/abs/1810.04805) - Devlin et al., BERT paper
- [The PageRank Citation Ranking](http://ilpubs.stanford.edu:8090/422/) - Page et al., original PageRank paper
- [Web Search for a Planet](https://research.google/pubs/web-search-for-a-planet-the-google-cluster-architecture/) - Google (2003), cluster architecture
- [Crawl Budget Management](https://developers.google.com/search/docs/crawling-indexing/large-site-managing-crawl-budget) - Google crawl documentation
---
## Design Collaborative Document Editing (Google Docs)
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-google-docs-collaboration
**Category:** System Design / System Design Problems
**Description:** A comprehensive system design for real-time collaborative document editing covering synchronization algorithms, presence broadcasting, conflict resolution, storage patterns, and offline support. This design addresses sub-second convergence for concurrent edits while maintaining document history and supporting 10-50 simultaneous editors.
# Design Collaborative Document Editing (Google Docs)
A comprehensive system design for real-time collaborative document editing covering synchronization algorithms, presence broadcasting, conflict resolution, storage patterns, and offline support. This design addresses sub-second convergence for concurrent edits while maintaining document history and supporting 10-50 simultaneous editors.

High-level architecture: WebSocket-based real-time sync with operation log persistence and periodic snapshots.
## Abstract
Collaborative document editing requires solving three interrelated problems: **real-time synchronization** (all users see changes within milliseconds), **conflict resolution** (concurrent edits don't corrupt the document), and **durability** (no edit is ever lost).
**Core architectural decisions:**
| Decision | Choice | Rationale |
| -------------- | ----------------------------------- | --------------------------------------------------- |
| Sync algorithm | OT with server ordering | Avoids TP2 complexity; proven at Google scale |
| Transport | WebSocket | Full-duplex, 1-10ms latency after handshake |
| Persistence | Event-sourced operation log | Enables revision history, undo, and conflict replay |
| Presence | Ephemeral broadcast | Cursors don't need durability; memory-only |
| Offline | Operation queue with reconciliation | Local-first editing, sync on reconnect |
**Key trade-offs accepted:**
- Server dependency for ordering (no true P2P) in exchange for correctness guarantees
- Unbounded operation log growth requiring periodic snapshots
- Higher memory on collaboration servers (one process per active document)
**What this design optimizes:**
- Sub-100ms operation propagation to all connected clients
- Guaranteed convergence regardless of network conditions
- Full revision history with efficient retrieval
## Requirements
### Functional Requirements
| Requirement | Priority | Notes |
| ----------------------------- | -------- | ----------------------------------- |
| Real-time text editing | Core | Character-level granularity |
| Concurrent multi-user editing | Core | 10-50 simultaneous editors |
| Live cursor/selection display | Core | See where others are editing |
| Revision history | Core | View/restore any previous version |
| Rich text formatting | Core | Bold, italic, headings, lists |
| Comments and suggestions | Extended | Anchored to text ranges |
| Offline editing | Extended | Queue operations, sync on reconnect |
| Tables, images, embeds | Extended | Block-level elements |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ------------------------ | --------------- | ----------------------------------------- |
| Availability | 99.9% (3 nines) | User-facing, but brief outages acceptable |
| Edit propagation latency | p99 < 200ms | Real-time feel requires sub-second |
| Document load time | p99 < 2s | Cold start with full history |
| Concurrent editors | 50 per document | Google Sheets supports ~50; Docs ~10 |
| Operation durability | 99.999% | No edit should ever be lost |
| Revision retention | Indefinite | Full history for compliance |
### Scale Estimation
**Users:**
- Monthly Active Users (MAU): 500M (Google Docs scale)
- Daily Active Users (DAU): 100M (20% of MAU)
- Peak concurrent users: 10M
**Documents:**
- Total documents: 5B
- Active documents (edited in last 30 days): 500M (10%)
- Documents open concurrently at peak: 50M
**Traffic:**
- Operations per active editor: 1-5 per second (typing)
- Average editing session: 15 minutes
- Peak concurrent editors: 50M documents × 3 editors avg = 150M editing sessions
- Operations per second at peak: 150M × 2 ops/sec = 300M ops/sec globally
**Storage:**
- Average operation size: 100 bytes (insert/delete + metadata)
- Operations per document per day: 10,000 (active document)
- Daily operation storage: 500M docs × 10K ops × 100B = 500TB/day
- With snapshots (daily): 500M × 50KB = 25TB/day
## Design Paths
### Path A: Operational Transformation (Server-Ordered)
**Best when:**
- Always-online with reliable connectivity
- Central infrastructure already exists
- Correctness is paramount (financial, legal documents)
- Team has OT implementation experience or uses existing library
**Architecture:**

**Key characteristics:**
- Server assigns canonical operation order
- Clients transform incoming ops against pending local ops
- Single source of truth eliminates TP2 requirement
**Trade-offs:**
- ✅ Proven correct (Google Docs, Wave, CKEditor)
- ✅ Simpler transformation functions (only TP1 needed)
- ✅ Efficient wire format (operations are small)
- ❌ Server round-trip required for each operation batch
- ❌ Limited offline capability (buffer only)
- ❌ Server is single point of failure per document
**Real-world example:** Google Docs uses Jupiter-derived OT since 2010. Every character change is saved as an event in a revision log. The document renders by replaying the log from the start (with periodic checkpoints for performance).
### Path B: CRDT-Based (Decentralized)
**Best when:**
- Offline-first is critical requirement
- P2P scenarios (no server available)
- Multi-device sync with unreliable networks
- Mathematical convergence proofs required
**Architecture:**

**Key characteristics:**
- Operations commute without server coordination
- Each device maintains full CRDT state
- Convergence guaranteed by mathematical properties
**Trade-offs:**
- ✅ True offline support
- ✅ P2P synchronization possible
- ✅ No server bottleneck
- ❌ Higher memory (tombstones, metadata)
- ❌ Slower document loading (replay history)
- ❌ More complex intent preservation for rich text
**Real-world example:** Figma uses a CRDT-inspired approach with server reconciliation. They deliberately stop short of full CRDT to truncate history and reduce overhead. Yjs and Automerge are pure CRDT implementations used by many collaborative editors.
### Path C: Hybrid (Server-Ordered with CRDT Properties)
**Best when:**
- Need offline support but have server infrastructure
- Want CRDT convergence guarantees with OT efficiency
- Building on modern algorithms (Eg-walker, Fugue)
**Architecture:**
- Store append-only operation DAG (like CRDT)
- Use server for canonical ordering (like OT)
- Merge branches using CRDT-like algorithms
- Free memory when not actively merging
**Trade-offs:**
- ✅ Best of both: efficient steady-state, robust merging
- ✅ Order of magnitude better performance than pure CRDT
- ✅ Supports true offline with branch merging
- ❌ Newest approach, less production validation
- ❌ More complex implementation
**Real-world example:** Figma adopted Eg-walker for their code layers feature (2024). Joseph Gentle and Martin Kleppmann proved it achieves O(n log n) merge complexity versus O(n²) for traditional OT.
### Path Comparison
| Factor | OT (Server) | CRDT | Hybrid |
| ------------------- | --------------- | -------------- | ------------ |
| Correctness proof | Straightforward | Mathematical | Mathematical |
| Offline support | Buffer only | Native | Native |
| Server dependency | Required | Optional | Optional |
| Memory overhead | Low | High | Medium |
| Implementation | Moderate | Complex | Complex |
| Production examples | Google Docs | Notion, Linear | Figma Code |
### This Article's Focus
This article focuses on **Path A (OT with server ordering)** because:
1. It's the most battle-tested approach (15+ years in production at Google)
2. Most use cases have reliable connectivity
3. Simpler to implement correctly
4. Existing libraries (ShareDB, ot.js) provide solid foundations
Path B (CRDT) details are covered in [CRDTs for Collaborative Systems](../../core-distributed-patterns/crdt-for-collaborative-systems/README.md).
## High-Level Design
### Component Overview

### WebSocket Gateway
Manages persistent connections between clients and collaboration servers.
**Responsibilities:**
- Connection lifecycle (connect, heartbeat, disconnect)
- Route messages to document processors
- Broadcast presence updates
- Handle reconnection and state recovery
**Design decisions:**
| Decision | Choice | Rationale |
| ---------------- | ------------------- | ----------------------------------------- |
| Protocol | WebSocket | Full-duplex, 2-14 byte overhead vs HTTP |
| Session affinity | Sticky by document | All editors of a document hit same server |
| Heartbeat | 30 second interval | Detect dead connections |
| Reconnection | Exponential backoff | Prevent thundering herd |
**Scaling approach:**
- Horizontal scaling with consistent hashing by document ID
- One server "owns" each active document
- Ownership transfers on server failure via distributed lock
### Document Processor (OT Engine)
The core synchronization component that transforms and orders operations.
**State per active document:**
```typescript
interface DocumentState {
documentId: string
revision: number // Monotonic operation counter
content: DocumentContent // Current document state
pendingOps: Map // Ops awaiting transform
clients: Map // Connected clients
}
interface ClientState {
clientId: string
lastAckedRevision: number
cursor: CursorPosition | null
color: string // For presence display
}
```
**Operation flow:**
1. **Receive**: Client sends operation with base revision
2. **Validate**: Check revision is not stale beyond buffer
3. **Transform**: Transform against all operations since base revision
4. **Apply**: Update document state
5. **Persist**: Write to operation log
6. **Broadcast**: Send transformed operation to all clients
**Memory management:**
- Keep document state in memory while active
- Evict after 5 minutes of inactivity
- Reload from latest snapshot + recent operations
### Presence Service
Handles ephemeral state: cursors, selections, user indicators.
**Design decisions:**
- **No persistence**: Presence is reconstructed on reconnect
- **Throttled broadcast**: Max 20 updates/second per client
- **Coalesced updates**: Batch cursor movements before broadcast
**Data structure:**
```typescript
interface PresenceUpdate {
clientId: string
documentId: string
cursor: {
anchor: number // Selection start (character position)
head: number // Selection end (cursor position)
} | null
user: {
id: string
name: string
avatar: string
color: string // Assigned per-document
}
timestamp: number
}
```
### Document API
Handles document CRUD, access control, and version retrieval.
**Endpoints:**
| Endpoint | Method | Purpose |
| ----------------------------- | ------ | ------------------------------------------- |
| `/documents` | POST | Create document |
| `/documents/{id}` | GET | Load document (latest or specific revision) |
| `/documents/{id}/operations` | GET | Fetch operation range for history |
| `/documents/{id}/snapshot` | POST | Create manual snapshot |
| `/documents/{id}/revisions` | GET | List revision metadata |
| `/documents/{id}/permissions` | PUT | Update access control |
## API Design
### WebSocket Protocol
#### Client → Server Messages
**Send Operation:**
```json
{
"type": "operation",
"documentId": "doc_abc123",
"clientId": "client_xyz",
"baseRevision": 142,
"operation": {
"ops": [{ "retain": 50 }, { "insert": "Hello, " }, { "retain": 100 }, { "delete": 5 }]
},
"timestamp": 1706886400000
}
```
**Update Presence:**
```json
{
"type": "presence",
"documentId": "doc_abc123",
"cursor": { "anchor": 150, "head": 150 },
"selection": null
}
```
#### Server → Client Messages
**Operation Acknowledgment:**
```json
{
"type": "ack",
"documentId": "doc_abc123",
"revision": 143,
"transformedOp": { ... }
}
```
**Broadcast Operation (to other clients):**
```json
{
"type": "remote_operation",
"documentId": "doc_abc123",
"clientId": "client_other",
"revision": 143,
"operation": { ... },
"user": {
"id": "user_123",
"name": "Alice"
}
}
```
**Presence Broadcast:**
```json
{
"type": "remote_presence",
"documentId": "doc_abc123",
"presences": [
{
"clientId": "client_other",
"cursor": { "anchor": 200, "head": 210 },
"user": { "id": "user_123", "name": "Alice", "color": "#4285f4" }
}
]
}
```
### REST API
#### Create Document
**Endpoint:** `POST /api/v1/documents`
**Request:**
```json
{
"title": "Untitled Document",
"content": "",
"folderId": "folder_abc",
"templateId": "template_xyz"
}
```
**Response (201 Created):**
```json
{
"id": "doc_abc123",
"title": "Untitled Document",
"revision": 0,
"createdAt": "2024-02-03T10:00:00Z",
"createdBy": {
"id": "user_123",
"name": "Alice"
},
"permissions": {
"owner": "user_123",
"editors": [],
"viewers": []
},
"wsEndpoint": "wss://collab.example.com/ws/doc_abc123"
}
```
#### Load Document
**Endpoint:** `GET /api/v1/documents/{id}?revision={optional}`
**Response (200 OK):**
```json
{
"id": "doc_abc123",
"title": "Project Proposal",
"revision": 1542,
"content": {
"type": "doc",
"content": [
{
"type": "heading",
"attrs": { "level": 1 },
"content": [{ "type": "text", "text": "Introduction" }]
},
{
"type": "paragraph",
"content": [{ "type": "text", "text": "..." }]
}
]
},
"snapshot": {
"revision": 1500,
"createdAt": "2024-02-03T09:00:00Z"
},
"pendingOperations": 42,
"collaborators": [{ "id": "user_456", "name": "Bob", "online": true }]
}
```
#### List Revisions
**Endpoint:** `GET /api/v1/documents/{id}/revisions?limit=50&before={revision}`
**Response (200 OK):**
```json
{
"revisions": [
{
"revision": 1542,
"timestamp": "2024-02-03T10:30:00Z",
"user": { "id": "user_123", "name": "Alice" },
"summary": "Edited section 3",
"operationCount": 15
},
{
"revision": 1500,
"timestamp": "2024-02-03T09:00:00Z",
"user": { "id": "user_456", "name": "Bob" },
"summary": "Added introduction",
"operationCount": 203,
"isSnapshot": true
}
],
"hasMore": true,
"nextCursor": "rev_1499"
}
```
### Error Responses
| Code | Error | When |
| ---- | ------------------- | --------------------------- |
| 400 | `INVALID_OPERATION` | Operation format invalid |
| 409 | `REVISION_CONFLICT` | Base revision too old |
| 410 | `DOCUMENT_DELETED` | Document was deleted |
| 423 | `DOCUMENT_LOCKED` | Document temporarily locked |
| 429 | `RATE_LIMITED` | Too many operations |
**Revision conflict handling:**
```json
{
"error": "REVISION_CONFLICT",
"message": "Base revision 100 is too old. Current: 150",
"currentRevision": 150,
"missingOperations": "/api/v1/documents/doc_abc/operations?from=100&to=150"
}
```
Client must fetch missing operations, transform local pending operations, and retry.
## Data Modeling
### Document Metadata (PostgreSQL)
```sql
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
title TEXT NOT NULL,
owner_id UUID NOT NULL REFERENCES users(id),
folder_id UUID REFERENCES folders(id),
current_revision BIGINT DEFAULT 0,
latest_snapshot_revision BIGINT,
content_type VARCHAR(50) DEFAULT 'rich_text',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
deleted_at TIMESTAMPTZ,
-- Denormalized for read performance
collaborator_count INT DEFAULT 0,
word_count INT DEFAULT 0,
last_edited_by UUID REFERENCES users(id),
last_edited_at TIMESTAMPTZ
);
-- Access control
CREATE TABLE document_permissions (
document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
user_id UUID REFERENCES users(id),
role VARCHAR(20) NOT NULL, -- 'owner', 'editor', 'commenter', 'viewer'
granted_at TIMESTAMPTZ DEFAULT NOW(),
granted_by UUID REFERENCES users(id),
PRIMARY KEY (document_id, user_id)
);
CREATE INDEX idx_documents_owner ON documents(owner_id, updated_at DESC);
CREATE INDEX idx_documents_folder ON documents(folder_id, updated_at DESC);
CREATE INDEX idx_permissions_user ON document_permissions(user_id);
```
### Operation Log (DynamoDB)
**Table design for append-heavy workload:**
| Partition Key | Sort Key | Attributes |
| ------------- | ---------- | ------------------------------------------------------------ |
| `document_id` | `revision` | `operation`, `client_id`, `user_id`, `timestamp`, `checksum` |
**Schema:**
```json
{
"document_id": "doc_abc123",
"revision": 1542,
"operation": {
"ops": [{ "retain": 50 }, { "insert": "Hello" }]
},
"client_id": "client_xyz",
"user_id": "user_123",
"timestamp": 1706886400000,
"checksum": "sha256:abc123...",
"ttl": null
}
```
**Why DynamoDB:**
- Append-only workload (write-optimized)
- Predictable latency at scale
- Built-in TTL for old operations (after snapshot)
- Range queries by revision efficient
**Capacity planning:**
- Write capacity: 300M ops/sec globally → partition across documents
- Single document: 100 ops/sec max (50 editors × 2 ops/sec)
- Read capacity: Burst on document load, then minimal
### Snapshots (S3)
**Naming convention:**
```
s3://doc-snapshots/{document_id}/{revision}.json.gz
```
**Snapshot content:**
```json
{
"documentId": "doc_abc123",
"revision": 1500,
"createdAt": "2024-02-03T09:00:00Z",
"content": {
"type": "doc",
"content": [...]
},
"metadata": {
"wordCount": 5420,
"characterCount": 32150,
"imageCount": 12
},
"checksum": "sha256:..."
}
```
**Snapshot strategy:**
- Create snapshot every 1000 operations
- Or every 1 hour of activity
- Or on manual request (revision history view)
- Keep all snapshots for compliance
### Active Document Cache (Redis)
**Data structures:**
```redis
# Document state (hash)
HSET doc:{id}:state
revision 1542
content "{serialized_content}"
last_updated 1706886400000
# Connected clients (sorted set by last activity)
ZADD doc:{id}:clients {timestamp} {client_id}
# Pending operations queue (list)
RPUSH doc:{id}:pending "{operation_json}"
# Presence (hash with TTL per client)
HSET doc:{id}:presence:{client_id}
cursor_anchor 150
cursor_head 150
user_name "Alice"
user_color "#4285f4"
EXPIRE doc:{id}:presence:{client_id} 60
```
**Eviction policy:**
- Documents evicted after 5 minutes of no activity
- Presence entries auto-expire after 60 seconds without refresh
## Low-Level Design
### OT Transformation Engine
#### Operation Format
Using a format similar to Quill Delta / Google Wave:
```typescript
type Operation = {
ops: (RetainOp | InsertOp | DeleteOp)[]
}
type RetainOp = {
retain: number
attributes?: Record // For formatting changes
}
type InsertOp = {
insert: string | { image: string } | { embed: any }
attributes?: Record
}
type DeleteOp = {
delete: number
}
```
**Example operations:**
```typescript
// Insert "Hello" at position 0
{
ops: [{ insert: "Hello" }]
}
// Delete 3 characters at position 10
{
ops: [{ retain: 10 }, { delete: 3 }]
}
// Bold characters 5-10
{
ops: [{ retain: 5 }, { retain: 5, attributes: { bold: true } }]
}
```
#### Transformation Functions
```typescript collapse={1-10}
function transform(op1: Operation, op2: Operation, priority: "left" | "right"): [Operation, Operation] {
// op1' = transform(op1, op2) - op1 transformed against op2
// op2' = transform(op2, op1) - op2 transformed against op1
// Guarantee: apply(apply(doc, op1), op2') === apply(apply(doc, op2), op1')
const ops1 = [...op1.ops]
const ops2 = [...op2.ops]
const result1: Op[] = []
const result2: Op[] = []
let i1 = 0,
i2 = 0
while (i1 < ops1.length || i2 < ops2.length) {
const o1 = ops1[i1]
const o2 = ops2[i2]
// Case: insert vs anything - inserts go first
if (o1 && "insert" in o1) {
if (priority === "left") {
result2.push({ retain: insertLength(o1) })
result1.push(o1)
i1++
continue
}
}
if (o2 && "insert" in o2) {
result1.push({ retain: insertLength(o2) })
result2.push(o2)
i2++
continue
}
// Case: retain vs retain
if (o1 && "retain" in o1 && o2 && "retain" in o2) {
const len = Math.min(o1.retain, o2.retain)
result1.push({ retain: len, attributes: o1.attributes })
result2.push({ retain: len, attributes: o2.attributes })
consumeLength(ops1, i1, len)
consumeLength(ops2, i2, len)
continue
}
// Case: delete vs retain
if (o1 && "delete" in o1 && o2 && "retain" in o2) {
const len = Math.min(o1.delete, o2.retain)
result1.push({ delete: len })
// o2 doesn't produce output - deleted content
consumeLength(ops1, i1, len)
consumeLength(ops2, i2, len)
continue
}
// Case: retain vs delete
if (o1 && "retain" in o1 && o2 && "delete" in o2) {
const len = Math.min(o1.retain, o2.delete)
// o1 doesn't produce output - deleted content
result2.push({ delete: len })
consumeLength(ops1, i1, len)
consumeLength(ops2, i2, len)
continue
}
// Case: delete vs delete - both delete same content
if (o1 && "delete" in o1 && o2 && "delete" in o2) {
const len = Math.min(o1.delete, o2.delete)
// Neither produces output - already deleted
consumeLength(ops1, i1, len)
consumeLength(ops2, i2, len)
continue
}
}
return [{ ops: result1 }, { ops: result2 }]
}
```
#### Server-Side Processing
```typescript collapse={1-15}
class DocumentProcessor {
private state: DocumentState
private opLog: OperationLog
private broadcaster: Broadcaster
async processOperation(clientId: string, baseRevision: number, operation: Operation): Promise {
// 1. Validate base revision
if (baseRevision < this.state.revision - MAX_REVISION_LAG) {
throw new RevisionConflictError(this.state.revision)
}
// 2. Transform against all operations since base
let transformedOp = operation
for (let rev = baseRevision + 1; rev <= this.state.revision; rev++) {
const serverOp = await this.opLog.getOperation(this.state.documentId, rev)
;[transformedOp] = transform(transformedOp, serverOp, "right")
}
// 3. Apply to document state
const newContent = applyOperation(this.state.content, transformedOp)
const newRevision = this.state.revision + 1
// 4. Persist operation (async, but before ack)
await this.opLog.append({
documentId: this.state.documentId,
revision: newRevision,
operation: transformedOp,
clientId,
timestamp: Date.now(),
})
// 5. Update in-memory state
this.state.content = newContent
this.state.revision = newRevision
// 6. Broadcast to other clients
this.broadcaster.broadcastOperation(
this.state.documentId,
clientId, // Exclude sender
newRevision,
transformedOp,
)
// 7. Return acknowledgment
return {
revision: newRevision,
transformedOp,
}
}
}
```
### Client-Side State Machine
```typescript collapse={1-12}
type ClientOTState =
| { type: "synchronized"; serverRevision: number }
| { type: "awaitingAck"; serverRevision: number; pending: Operation }
| { type: "awaitingWithBuffer"; serverRevision: number; pending: Operation; buffer: Operation }
class ClientOT {
private state: ClientOTState = { type: "synchronized", serverRevision: 0 }
private document: DocumentContent
onLocalEdit(operation: Operation): void {
switch (this.state.type) {
case "synchronized":
// Send immediately
this.sendToServer(operation, this.state.serverRevision)
this.state = {
type: "awaitingAck",
serverRevision: this.state.serverRevision,
pending: operation,
}
break
case "awaitingAck":
// Buffer - compose with existing buffer or create new
this.state = {
type: "awaitingWithBuffer",
serverRevision: this.state.serverRevision,
pending: this.state.pending,
buffer: operation,
}
break
case "awaitingWithBuffer":
// Compose into buffer
this.state = {
...this.state,
buffer: compose(this.state.buffer, operation),
}
break
}
// Apply locally immediately
this.document = applyOperation(this.document, operation)
}
onServerAck(revision: number): void {
switch (this.state.type) {
case "awaitingAck":
this.state = { type: "synchronized", serverRevision: revision }
break
case "awaitingWithBuffer":
// Send buffered operations
this.sendToServer(this.state.buffer, revision)
this.state = {
type: "awaitingAck",
serverRevision: revision,
pending: this.state.buffer,
}
break
}
}
onRemoteOperation(revision: number, operation: Operation): void {
// Transform remote op against pending/buffer
let transformedRemote = operation
if (this.state.type === "awaitingAck" || this.state.type === "awaitingWithBuffer") {
;[, transformedRemote] = transform(this.state.pending, operation, "left")
// Also transform pending against remote
const [newPending] = transform(this.state.pending, operation, "left")
this.state = { ...this.state, pending: newPending }
}
if (this.state.type === "awaitingWithBuffer") {
;[, transformedRemote] = transform(this.state.buffer, transformedRemote, "left")
const [newBuffer] = transform(this.state.buffer, operation, "left")
this.state = { ...this.state, buffer: newBuffer }
}
// Apply transformed remote operation
this.document = applyOperation(this.document, transformedRemote)
}
}
```
### Snapshot and Compaction
#### Snapshot Worker
```typescript collapse={1-8}
class SnapshotWorker {
private readonly SNAPSHOT_THRESHOLD = 1000 // Operations since last snapshot
private readonly SNAPSHOT_INTERVAL_MS = 3600000 // 1 hour
async processDocument(documentId: string): Promise {
const doc = await this.documentStore.getMetadata(documentId)
const latestSnapshot = await this.snapshotStore.getLatest(documentId)
const opsSinceSnapshot = doc.currentRevision - (latestSnapshot?.revision ?? 0)
const timeSinceSnapshot = Date.now() - (latestSnapshot?.createdAt ?? 0)
if (opsSinceSnapshot < this.SNAPSHOT_THRESHOLD && timeSinceSnapshot < this.SNAPSHOT_INTERVAL_MS) {
return // No snapshot needed
}
// Build document state
let content = latestSnapshot?.content ?? emptyDocument()
const operations = await this.opLog.getRange(documentId, (latestSnapshot?.revision ?? 0) + 1, doc.currentRevision)
for (const op of operations) {
content = applyOperation(content, op.operation)
}
// Store snapshot
await this.snapshotStore.create({
documentId,
revision: doc.currentRevision,
content,
createdAt: Date.now(),
})
// Mark old operations for TTL expiry (keep last 100 for recent history)
await this.opLog.setTTL(documentId, 0, doc.currentRevision - 100, TTL_30_DAYS)
}
}
```
## Frontend Considerations
### Editor Integration
**Rich text editors with OT support:**
| Editor | OT/CRDT Support | Notes |
| ----------- | ----------------- | -------------------------- |
| ProseMirror | Steps (OT-like) | Used by Notion, Atlassian |
| Slate | Plugin-based | Flexible, needs OT library |
| Quill | Delta format | Native OT support |
| TipTap | ProseMirror-based | Modern API |
**Integration pattern (ProseMirror example):**
```typescript collapse={1-15}
class CollaborativeEditor {
private view: EditorView
private otClient: ClientOT
private ws: WebSocket
constructor(container: HTMLElement, documentId: string) {
// Initialize OT client
this.otClient = new ClientOT()
// Connect WebSocket
this.ws = new WebSocket(`wss://collab.example.com/ws/${documentId}`)
this.ws.onmessage = this.handleServerMessage.bind(this)
// Initialize editor with collaboration plugin
this.view = new EditorView(container, {
state: EditorState.create({
plugins: [collab({ version: 0 }), this.cursorPlugin(), this.presencePlugin()],
}),
dispatchTransaction: this.handleLocalChange.bind(this),
})
}
private handleLocalChange(tr: Transaction): void {
const newState = this.view.state.apply(tr)
this.view.updateState(newState)
if (tr.docChanged) {
// Convert ProseMirror steps to OT operations
const steps = sendableSteps(newState)
if (steps) {
const operation = stepsToOperation(steps.steps)
this.otClient.onLocalEdit(operation)
this.ws.send(
JSON.stringify({
type: "operation",
operation,
baseRevision: this.otClient.serverRevision,
}),
)
}
}
}
}
```
### Presence Rendering
**Cursor overlay approach:**
```typescript collapse={1-20}
interface RemoteCursor {
clientId: string
user: { name: string; color: string }
anchor: number
head: number
}
class CursorOverlay {
private cursors: Map = new Map()
updateCursor(cursor: RemoteCursor): void {
this.cursors.set(cursor.clientId, cursor)
this.render()
}
removeCursor(clientId: string): void {
this.cursors.delete(clientId)
this.render()
}
private render(): void {
// Convert character positions to screen coordinates
for (const [clientId, cursor] of this.cursors) {
const coords = this.positionToCoords(cursor.head)
// Render cursor caret
this.renderCaret(clientId, coords, cursor.user.color)
// Render selection highlight if anchor !== head
if (cursor.anchor !== cursor.head) {
this.renderSelection(clientId, cursor.anchor, cursor.head, cursor.user.color)
}
// Render name label
this.renderNameLabel(clientId, coords, cursor.user)
}
}
}
```
**Performance optimizations:**
| Technique | Purpose | Implementation |
| ------------------------- | ----------------------- | --------------------------- |
| Throttle cursor updates | Reduce network traffic | Max 20 updates/sec |
| Batch presence broadcasts | Reduce message count | Collect 50ms, send batch |
| Use CSS transforms | Avoid layout thrashing | `transform: translate()` |
| Virtual cursor layer | Don't modify editor DOM | Absolute positioned overlay |
### Offline Support
**Operation queue for offline editing:**
```typescript collapse={1-10}
class OfflineQueue {
private db: IDBDatabase
private queueName = "pendingOperations"
async enqueue(documentId: string, operation: Operation): Promise {
const tx = this.db.transaction(this.queueName, "readwrite")
const store = tx.objectStore(this.queueName)
await store.add({
documentId,
operation,
timestamp: Date.now(),
id: crypto.randomUUID(),
})
}
async syncPending(documentId: string): Promise {
const pending = await this.getPending(documentId)
for (const item of pending) {
try {
await this.sendOperation(item)
await this.remove(item.id)
} catch (e) {
if (e instanceof RevisionConflictError) {
// Fetch missing ops, transform, retry
await this.handleConflict(documentId, item)
} else {
throw e
}
}
}
}
}
```
## Infrastructure
### Cloud-Agnostic Components
| Component | Purpose | Options |
| ----------------- | --------------------- | ----------------------------- |
| WebSocket Gateway | Real-time connections | Nginx, HAProxy, Envoy |
| Message Queue | Operation streaming | Kafka, RabbitMQ, NATS |
| KV Store | Active document state | Redis, Memcached, KeyDB |
| Document Store | Operation log | Cassandra, ScyllaDB, DynamoDB |
| Object Store | Snapshots, media | MinIO, Ceph, S3-compatible |
| Relational DB | Metadata, ACL | PostgreSQL, CockroachDB |
### AWS Reference Architecture

**Service configurations:**
| Service | Configuration | Rationale |
| ---------------------- | ------------------------------ | -------------------------------- |
| WebSocket (Fargate) | 4 vCPU, 8GB RAM | Memory for active documents |
| API (Fargate) | 2 vCPU, 4GB RAM | Stateless, scale on traffic |
| Workers (Fargate Spot) | 2 vCPU, 4GB RAM | Cost optimization for async work |
| ElastiCache | r6g.xlarge cluster | Sub-ms latency for hot documents |
| RDS PostgreSQL | db.r6g.2xlarge Multi-AZ | Metadata queries, ACL |
| DynamoDB | On-demand | Predictable per-op pricing |
| S3 | Standard + Intelligent-Tiering | Hot snapshots, cold history |
### Scaling Considerations
**WebSocket connection limits:**
- Single server: ~65K connections (Linux file descriptor limit)
- Solution: Consistent hashing by document ID across server pool
- Active documents per server: ~10K (memory constrained)
**Document processor memory:**
- Average document state: 100KB
- Active document with history buffer: 500KB
- 8GB server → ~16K active documents max
**Operation log partitioning:**
- DynamoDB partition key: document_id
- Hot partition: 3000 WCU per partition
- Solution: Document sharding if single doc exceeds limits (rare)
## Conclusion
This design provides real-time collaborative document editing with:
1. **Sub-200ms operation propagation** via WebSocket and server-ordered OT
2. **Strong convergence guarantees** without P2P complexity
3. **Full revision history** through event-sourced operation log
4. **Offline resilience** with IndexedDB operation queue and conflict resolution
**Key architectural decisions:**
- Server-ordered OT eliminates TP2 correctness concerns
- Periodic snapshots bound operation replay cost
- Ephemeral presence avoids persistence overhead for cursors
- Per-document process isolation simplifies scaling
**Known limitations:**
- Server dependency for real-time sync (no true P2P)
- Memory pressure at high concurrent editor counts
- Snapshot creation adds latency for very active documents
**Future enhancements:**
- Hybrid OT/CRDT for better offline support (Eg-walker approach)
- Incremental snapshot deltas to reduce storage
- Smarter presence coalescing for large collaborator counts
## Appendix
### Prerequisites
- Distributed systems fundamentals (eventual consistency, vector clocks)
- Real-time communication patterns (WebSocket, SSE)
- Event sourcing concepts
- Understanding of OT or CRDTs (see related articles)
### Terminology
| Term | Definition |
| ------------- | ----------------------------------------------------------------------------- |
| **OT** | Operational Transformation - algorithm for transforming concurrent operations |
| **TP1/TP2** | Transformation properties ensuring convergence |
| **Revision** | Monotonic counter representing document state version |
| **Operation** | Atomic change to document (insert, delete, format) |
| **Snapshot** | Full document state at a specific revision |
| **Presence** | Ephemeral state like cursors and selections |
| **Tombstone** | Marker for deleted content in CRDT systems |
### Summary
- Real-time collaborative editing requires **synchronization algorithms** (OT or CRDT), **presence broadcasting**, and **event-sourced persistence**
- **Server-ordered OT** dominates production use (Google Docs, CKEditor) because it avoids TP2 correctness issues
- **WebSocket** provides full-duplex communication with 1-10ms latency after handshake
- **Operation log + periodic snapshots** enables full revision history while bounding replay cost
- **Presence is ephemeral**—cursors and selections stored in memory only, reconstructed on reconnect
- Scale to 50 concurrent editors per document with ~200ms operation propagation latency
### References
**Architecture and Implementation:**
- [How Figma's Multiplayer Technology Works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/) - Figma Engineering Blog
- [Making Multiplayer More Reliable](https://www.figma.com/blog/making-multiplayer-more-reliable/) - Figma transaction journal design
- [Realtime Editing of Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/) - Fractional indexing at Figma
- [The Data Model Behind Notion](https://www.notion.com/blog/data-model-behind-notion) - Block-based architecture
- [Sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion) - Database scaling patterns
- [Scaling the Linear Sync Engine](https://linear.app/now/scaling-the-linear-sync-engine) - Local-first sync architecture
**Operational Transformation:**
- [Apache Wave OT Whitepaper](https://svn.apache.org/repos/asf/incubator/wave/whitepapers/operational-transform/operational-transform.html) - Detailed protocol specification
- [Google Drive Blog: What's Different About New Google Docs](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html) - Architecture overview
- [Lessons Learned from CKEditor 5](https://ckeditor.com/blog/lessons-learned-from-creating-a-rich-text-editor-with-real-time-collaboration/) - Production OT for rich text
**Algorithms and Research:**
- [Eg-walker: Collaborative Text Editing](https://arxiv.org/abs/2409.14252) - Gentle & Kleppmann, EuroSys 2025
- [Real Differences between OT and CRDT](https://dl.acm.org/doi/10.1145/3375186) - ACM 2020 comparison
- [Performance of Real-Time Collaborative Editors at Large Scale](https://inria.hal.science/hal-01351229v1/document) - Scaling analysis
**Related Articles:**
- [Operational Transformation](../../core-distributed-patterns/operational-transformation/README.md) - Deep dive into OT algorithms
- [CRDTs for Collaborative Systems](../../core-distributed-patterns/crdt-for-collaborative-systems/README.md) - Alternative approach for offline-first
---
## Design Dropbox File Sync
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-dropbox-file-sync
**Category:** System Design / System Design Problems
**Description:** A system design for a file synchronization service that keeps files consistent across multiple devices. This design addresses the core challenges of efficient data transfer, conflict resolution, and real-time synchronization at scale—handling 500+ petabytes of data across 700 million users.
# Design Dropbox File Sync
A system design for a file synchronization service that keeps files consistent across multiple devices. This design addresses the core challenges of efficient data transfer, conflict resolution, and real-time synchronization at scale—handling 500+ petabytes of data across 700 million users.

High-level architecture: clients sync through API gateway to metadata and block services, with real-time notifications via message queue.
## Abstract
File sync is fundamentally a **distributed state reconciliation** problem with three key insights:
1. **Content-defined chunking** breaks files at content-determined boundaries (not fixed offsets), so insertions don't invalidate all subsequent chunks—enabling ~90% deduplication across file versions
2. **Three-tree model** (local, remote, synced) provides an unambiguous merge base to determine change direction without conflicts
3. **Block-level addressing** (content hash as ID) makes upload idempotent and enables cross-user deduplication at petabyte scale
The critical tradeoff: **eventual consistency with conflict preservation**. Rather than complex merge algorithms, create "conflicted copies" when concurrent edits occur—simple, predictable, and avoids data loss.
## Requirements
### Functional Requirements
| Feature | Priority | Scope |
| -------------------- | -------- | ---------------------- |
| File upload/download | Core | Full implementation |
| Cross-device sync | Core | Full implementation |
| Selective sync | Core | Full implementation |
| File versioning | Core | 30-day history |
| Conflict handling | Core | Conflicted copies |
| Shared folders | High | Full implementation |
| Link sharing | High | Read/write permissions |
| LAN sync | Medium | P2P optimization |
| Offline access | Medium | Client-side |
| Search | Low | Metadata only |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| -------------------------- | ------------------------- | ---------------------------- |
| Availability | 99.99% | Business-critical data |
| Sync latency (same region) | p50 < 2s, p99 < 10s | User perception threshold |
| Upload throughput | 100 MB/s per client | Saturate typical connections |
| Storage durability | 99.9999999999% (12 nines) | Data loss is unacceptable |
| Consistency | Eventual (< 5s typical) | Acceptable for file sync |
| Deduplication ratio | > 2:1 cross-user | Storage cost optimization |
### Scale Estimation
**Users:**
- Registered users: 700M
- DAU: 70M (10% of registered)
- Peak concurrent: 7M
**Files:**
- Files per user: 5,000 average
- Total files: 3.5 trillion
- New files per day: 1.2B
**Storage:**
- Average file size: 150KB
- Total storage: 500PB+
- Daily ingress: 180TB (1.2B × 150KB)
**Traffic:**
- Metadata operations: 10M RPS (reads dominate)
- Block uploads: 500K RPS
- Block downloads: 2M RPS
- Notification connections: 7M concurrent WebSockets
## Design Paths
### Path A: Block-Based Sync (Chosen)
**Best when:**
- Large files that change incrementally (documents, code repositories)
- Cross-user deduplication is valuable
- Bandwidth optimization is critical
**Architecture:** Files split into content-defined chunks (~4MB blocks), each identified by content hash. Only changed blocks transfer.
**Key characteristics:**
- Deduplication at block level
- Delta sync requires only changed blocks
- Block storage can be globally deduplicated
**Trade-offs:**
- ✅ 2,500x bandwidth reduction for incremental changes
- ✅ Cross-user deduplication (identical blocks stored once)
- ✅ Resumable uploads (block-level checkpointing)
- ❌ Complexity in chunking algorithms
- ❌ Small file overhead (metadata > content for tiny files)
- ❌ Client CPU cost for hashing
**Real-world example:** Dropbox uses 4MB blocks with SHA-256 hashing. Magic Pocket stores 500+ PB with 600K+ drives, achieving significant deduplication across their user base.
### Path B: Whole-File Sync
**Best when:**
- Small files only (< 1MB average)
- Simple implementation required
- Files rarely modified (write-once)
**Architecture:** Files uploaded/downloaded atomically. No chunking.
**Trade-offs:**
- ✅ Simple implementation
- ✅ Lower client CPU (no chunking)
- ❌ No delta sync (full re-upload on any change)
- ❌ No cross-file deduplication
- ❌ Large files block on full transfer
**Real-world example:** Simple cloud storage for photos (Google Photos initially) where files are write-once and relatively small.
### Path Comparison
| Factor | Block-Based | Whole-File |
| -------------------- | ----------------------- | ------------------- |
| Delta sync | Yes (block-level) | No |
| Deduplication | Cross-user, cross-file | None |
| Bandwidth efficiency | High (2,500x for edits) | Low |
| Client complexity | High | Low |
| Small file overhead | Higher | None |
| Best for | Documents, code | Photos, small media |
### This Article's Focus
This article focuses on **Path A (Block-Based Sync)** because file sync services primarily handle documents and files that change incrementally, where bandwidth optimization provides the most value.
## High-Level Design
### Client Architecture: Three Trees Model
The sync engine maintains three tree structures representing file state:
```
┌─────────────────────────────────────────────────────────────┐
│ Client Sync Engine │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Local Tree │ Synced Tree │ Remote Tree │
│ (disk state) │ (merge base) │ (server state) │
├─────────────────┼─────────────────┼─────────────────────────┤
│ file.txt (v3) │ file.txt (v2) │ file.txt (v2) │
│ new.txt │ - │ - │
│ - │ deleted.txt │ - │
└─────────────────┴─────────────────┴─────────────────────────┘
```
**Why three trees?** Without a synced tree (merge base), you cannot distinguish:
- "User deleted file locally" vs "File was never synced here"
- "User created file locally" vs "File deleted on server"
**Sync algorithm:**
1. Compare Local vs Synced → detect local changes
2. Compare Remote vs Synced → detect remote changes
3. Apply non-conflicting changes bidirectionally
4. Handle conflicts (see Conflict Resolution section)
**Node identification:** Files identified by unique ID (not path), enabling O(1) atomic directory renames instead of O(n) path updates.
### Chunking: Content-Defined Chunking (CDC)
Fixed-size chunking fails catastrophically when content is inserted:
```
Fixed chunking (4-byte blocks):
Before: [ABCD][EFGH][IJKL]
Insert X at position 2:
After: [ABXC][DEFG][HIJK][L...] ← All blocks change!
Content-defined chunking:
Before: [ABC|DEF|GHIJ] (boundaries at content patterns)
Insert X at position 2:
After: [ABXC|DEF|GHIJ] ← Only first block changes
```
#### Gear Hash Algorithm
Dropbox and modern implementations use **Gear hash** for chunk boundary detection:
```typescript
// Gear hash: FP_i = (FP_{i-1} << 1) + GearTable[byte]
const GEAR_TABLE: Uint32Array = new Uint32Array(256) // Random values
function findChunkBoundary(
data: Uint8Array,
minSize: number,
maxSize: number,
mask: number, // e.g., 0x1FFF for ~8KB average
): number {
let fp = 0
// Skip minimum chunk size (cut-point skipping optimization)
for (let i = 0; i < Math.min(minSize, data.length); i++) {
fp = ((fp << 1) + GEAR_TABLE[data[i]]) >>> 0
}
// Search for boundary
for (let i = minSize; i < Math.min(maxSize, data.length); i++) {
fp = ((fp << 1) + GEAR_TABLE[data[i]]) >>> 0
if ((fp & mask) === 0) {
return i + 1 // Boundary found
}
}
return Math.min(maxSize, data.length) // Force boundary at max
}
```
**Performance:** Gear hash performs 1 ADD + 1 SHIFT + 1 array lookup per byte, vs Rabin's 2 XORs + 2 SHIFTs + 2 lookups. FastCDC achieves **10x faster** than Rabin-based CDC.
**Chunk parameters:**
| Parameter | Typical Value | Rationale |
|-----------|---------------|-----------|
| Min chunk | 2KB | Avoid tiny chunks |
| Average chunk | 8KB (small files) / 4MB (large files) | Balance dedup vs overhead |
| Max chunk | 64KB / 16MB | Bound worst-case |
| Mask bits | 13 (8KB avg) | 2^13 = 8192 expected bytes between boundaries |
**Dropbox specifics:** 4MB blocks, SHA-256 content hash as block identifier.
### Block Storage Architecture

Block storage with three-zone replication. Blocks stored in at least 2 zones within 1 second, third zone async.
**Block addressing:** Content hash (SHA-256) as block ID. Two identical blocks anywhere in the system share storage.
**Storage hierarchy:**
1. **Block** (4MB max): Unit of upload/download, content-addressed
2. **Bucket** (1GB): Aggregation of blocks for efficient disk I/O
3. **Cell** (~50-100PB): Failure domain, independent replication
4. **Zone**: Geographic region
**Durability math:**
- 3-zone replication
- Within each zone: erasure coding or replication
- Target: 99.9999999999% annual durability (< 1 block lost per 100 billion)
### Metadata Service
Metadata operations dominate traffic (10:1 vs block operations). Design for high read throughput.
#### Schema Design
```sql
-- File metadata (sharded by namespace_id)
CREATE TABLE files (
namespace_id BIGINT NOT NULL, -- User/shared folder
file_id UUID NOT NULL, -- Stable identifier
path TEXT NOT NULL, -- Current path (mutable)
blocklist UUID[] NOT NULL, -- Ordered list of block hashes
size BIGINT NOT NULL,
content_hash BYTEA NOT NULL, -- Hash of concatenated blocks
modified_at TIMESTAMPTZ NOT NULL,
revision BIGINT NOT NULL, -- Monotonic version
is_deleted BOOLEAN DEFAULT FALSE,
PRIMARY KEY (namespace_id, file_id)
);
-- Enables path lookups within namespace
CREATE INDEX idx_files_path ON files(namespace_id, path)
WHERE NOT is_deleted;
-- Journal for sync cursor (append-only)
CREATE TABLE journal (
namespace_id BIGINT NOT NULL,
journal_id BIGINT NOT NULL, -- Monotonic cursor
file_id UUID NOT NULL,
operation VARCHAR(10) NOT NULL, -- 'create', 'modify', 'delete', 'move'
timestamp TIMESTAMPTZ NOT NULL,
PRIMARY KEY (namespace_id, journal_id)
);
```
**Sharding strategy:** By `namespace_id` (user account or shared folder). Co-locates all user's files on same shard.
**Journal pattern:** Clients track sync position via `journal_id`. On reconnect, fetch all changes since last cursor—O(changes) not O(files).
#### Caching Strategy
```
┌────────────────────────────────────────────────────────────┐
│ Cache Hierarchy │
├──────────────────┬──────────────────┬──────────────────────┤
│ Client Cache │ Edge Cache │ Origin Cache │
│ (SQLite) │ (Regional) │ (Global) │
├──────────────────┼──────────────────┼──────────────────────┤
│ Full tree state │ Hot metadata │ Frequently accessed │
│ Block cache │ TTL: 5 seconds │ TTL: 30 seconds │
│ Offline access │ Namespace-keyed │ File-id keyed │
└──────────────────┴──────────────────┴──────────────────────┘
```
**Invalidation:** Write-through to cache on metadata mutations. Short TTL acceptable because clients reconcile via journal cursor.
### Notification Service
Real-time sync requires push notifications when remote changes occur.
**Options:**
| Mechanism | Latency | Connections | Use Case |
| ------------ | ------- | -------------------- | ---------------- |
| Polling | 5-30s | Stateless | Simple, legacy |
| Long polling | 1-5s | Semi-persistent | Moderate scale |
| WebSocket | < 100ms | Persistent | Real-time sync |
| SSE | < 100ms | Persistent (one-way) | Server push only |
**Chosen: WebSocket with fallback to long polling**
**Connection scaling:**
- 7M concurrent connections at peak
- Each connection: ~10KB memory
- Total: ~70GB memory for connection state
- Horizontal scaling via connection affinity (consistent hashing on user_id)
**Notification payload (minimal):**
```json
{
"namespace_id": "ns_abc123",
"journal_id": 158293,
"hint": "file_changed" // Client fetches details via API
}
```
Keep payloads minimal—notification triggers sync, doesn't contain data.
## API Design
### Sync Flow APIs
#### List Changes (Cursor-Based)
```
GET /api/v2/files/list_folder/continue
```
**Request:**
```json
{
"cursor": "AAGvR5..." // Opaque cursor encoding (namespace_id, journal_id)
}
```
**Response:**
```json
{
"entries": [
{
"tag": "file",
"id": "id:abc123",
"path_display": "/Documents/report.pdf",
"rev": "015a3e0c4b650000000",
"size": 1048576,
"content_hash": "e3b0c44298fc1c149afbf4c8996fb924...",
"server_modified": "2024-01-15T10:30:00Z"
},
{
"tag": "deleted",
"id": "id:def456",
"path_display": "/old-file.txt"
}
],
"cursor": "AAGvR6...",
"has_more": false
}
```
**Why cursor-based:**
- Stable under concurrent modifications
- Client can disconnect/reconnect and resume exactly
- O(1) database lookup vs O(n) offset skip
#### Upload Session (Block-Based)
**Phase 1: Start session**
```
POST /api/v2/files/upload_session/start
```
```json
{
"session_type": "concurrent", // Allows parallel block uploads
"content_hash": "e3b0c44..." // Optional: skip if file unchanged
}
```
**Phase 2: Append blocks (parallelizable)**
```
POST /api/v2/files/upload_session/append_v2
Content-Type: application/octet-stream
Dropbox-API-Arg: {"cursor": {"session_id": "...", "offset": 0}}
[4MB binary block data]
```
**Phase 3: Finish and commit**
```
POST /api/v2/files/upload_session/finish
```
```json
{
"cursor": { "session_id": "...", "offset": 12582912 },
"commit": {
"path": "/Documents/large-file.zip",
"mode": "overwrite",
"mute": false // true = don't notify other clients immediately
}
}
```
**Response includes block deduplication:**
```json
{
"id": "id:abc123",
"size": 12582912,
"blocks_reused": 2, // Blocks already existed
"blocks_uploaded": 1, // New blocks stored
"bytes_transferred": 4194304 // Only new block data
}
```
### Block Sync Protocol

Upload flow: commit blocklist first, upload only missing blocks, then finalize. Streaming sync allows downloads to begin before upload completes.
**Streaming sync optimization:** Clients can prefetch blocks from partial blocklists before commit finalizes—reduces end-to-end sync time by 2x for large files.
## Low-Level Design: Conflict Resolution
### Conflict Detection
Conflicts occur when both local and remote trees changed the same file since the synced tree state:
```
Timeline:
t0: Synced tree = {file.txt, rev=5}
t1: Local edit → Local tree = {file.txt, rev=5, modified}
t2: Remote edit → Remote tree = {file.txt, rev=6}
t3: Sync attempt → CONFLICT (local modified, remote also modified)
```
**Detection algorithm:**
```python
def detect_conflict(local: Node, remote: Node, synced: Node) -> ConflictType:
local_changed = local != synced
remote_changed = remote != synced
if not local_changed:
return ConflictType.NONE # Apply remote
if not remote_changed:
return ConflictType.NONE # Apply local
if local == remote:
return ConflictType.NONE # Same change, no conflict
# Both changed differently
if local.is_delete and remote.is_delete:
return ConflictType.NONE # Both deleted, no conflict
if local.is_delete or remote.is_delete:
return ConflictType.EDIT_DELETE
return ConflictType.EDIT_EDIT
```
### Conflict Resolution Strategy
**Chosen approach: Conflicted copies**
When conflict detected:
1. Keep the remote version at original path
2. Create local version as `filename (Computer Name's conflicted copy YYYY-MM-DD).ext`
3. User manually resolves by keeping preferred version
**Why not auto-merge:**
- File formats are opaque (binary, proprietary)
- Wrong merge = data corruption (worse than duplicate)
- User knows intent; algorithm cannot
- Simple, predictable behavior
**Alternative strategies (not chosen):**
| Strategy | Pros | Cons | Use Case |
| -------------------------- | ---------------- | -------------------------- | ------------------ |
| Last-write-wins | Simple | Data loss | Logs, non-critical |
| Vector clocks | Tracks causality | Complex, metadata overhead | Distributed DBs |
| CRDTs | Auto-merge | Limited data types | Collaborative text |
| OT (Operational Transform) | Real-time collab | Extreme complexity | Google Docs |
### Edge Cases
**Edit-delete conflict:**
- Remote deleted, local edited → Restore file with local edits
- Local deleted, remote edited → Keep remote version, local delete is lost
**Directory conflicts:**
- Move vs edit: Apply move, file content syncs to new location
- Move vs move: Create conflicted folder name
- Delete folder with unsynced children: Preserve unsynced files in special recovery folder
**Rename loops:**
- A renames folder X→Y, B renames Y→X simultaneously
- Resolution: Arbitrary tiebreaker (lexicographic on device ID)
## Low-Level Design: Delta Sync
For files that change incrementally (e.g., appending to logs, editing documents), transferring only the diff provides massive bandwidth savings.
### Block-Level Delta
When a file changes, recompute chunk boundaries:
```
Before: [Block A][Block B][Block C]
(hash_a) (hash_b) (hash_c)
Edit middle of Block B:
After: [Block A][Block B'][Block C]
(hash_a) (hash_b') (hash_c)
Blocks to upload: only Block B' (hash_b')
Bandwidth saved: 66% (2 of 3 blocks reused)
```
**Content-defined chunking is critical:** Fixed-size chunks would shift all boundaries after an insertion, invalidating all subsequent blocks.
### Sub-Block Delta (Binary Diff)
For further optimization within changed blocks, use rsync-style rolling checksums:
```python
def compute_delta(old_block: bytes, new_block: bytes) -> Delta:
"""Rsync algorithm: find matching regions, send only diffs."""
BLOCK_SIZE = 700 # Rolling checksum window
# Build index of old block's checksums
old_checksums = {}
for i in range(0, len(old_block) - BLOCK_SIZE, BLOCK_SIZE):
weak = adler32(old_block[i:i+BLOCK_SIZE])
strong = sha256(old_block[i:i+BLOCK_SIZE])
old_checksums[weak] = (i, strong)
# Scan new block with rolling checksum
delta = []
i = 0
while i < len(new_block) - BLOCK_SIZE:
weak = rolling_adler32(new_block, i, BLOCK_SIZE)
if weak in old_checksums:
offset, expected_strong = old_checksums[weak]
actual_strong = sha256(new_block[i:i+BLOCK_SIZE])
if actual_strong == expected_strong:
# Match found - emit COPY instruction
delta.append(Copy(source_offset=offset, length=BLOCK_SIZE))
i += BLOCK_SIZE
continue
# No match - emit literal byte
delta.append(Literal(new_block[i]))
i += 1
return delta
```
**Rolling checksum:** Adler-32 based, can update in O(1) as window slides:
```
a(i+1, i+n) = a(i, i+n-1) - old_byte + new_byte
b(i+1, i+n) = b(i, i+n-1) - n*old_byte + a(i+1, i+n)
checksum = b * 65536 + a
```
**Real-world impact:** Binary diff on a 100MB database file with 1KB change: ~2KB transfer instead of 100MB (**50,000x reduction**).
## Low-Level Design: Bandwidth Optimization
### Compression Pipeline
Dropbox's **Broccoli** (modified Brotli) achieves 30% bandwidth savings:
```
┌─────────────────────────────────────────────────────┐
│ Compression Pipeline │
├─────────────────────────────────────────────────────┤
│ 1. Chunk file (CDC) │
│ 2. Compress each chunk independently (parallel) │
│ 3. Concatenate compressed streams │
│ 4. Upload concatenated result │
├─────────────────────────────────────────────────────┤
│ Broccoli modifications: │
│ - Uncompressed meta-block header for context │
│ - Disabled dictionary references across blocks │
│ - Enables parallel compression + concatenation │
└─────────────────────────────────────────────────────┘
```
**Performance impact:**
| Metric | Before Broccoli | After Broccoli |
|--------|-----------------|----------------|
| Upload bandwidth | 100% | ~70% (30% savings) |
| Download bandwidth | 100% | ~85% (15% savings) |
| p50 upload latency | baseline | 35% faster |
| p50 download latency | baseline | 50% faster |
### LAN Sync
When multiple clients on same LAN have the same blocks, transfer locally instead of through cloud:
**Discovery:** UDP broadcast on port 17500 (IANA-reserved for Dropbox)
**Protocol:**
```
GET https:///blocks/{namespace_id}/{block_hash}
Authorization: Bearer
```
**Security:** Per-namespace SSL certificates issued by Dropbox servers, rotated when shared folder membership changes. Prevents unauthorized block access even on local network.
**Bandwidth savings:** 100% of block data stays on LAN for shared team folders.
### Upload Prioritization
Not all files are equal. Prioritize based on:
1. **User-initiated actions** (explicit upload) > Background sync
2. **Small files** > Large files (faster perceived completion)
3. **Recently modified** > Old files
4. **Active documents** > Archives
**Implementation:** Priority queue with aging to prevent starvation.
## Frontend Considerations
### Desktop Client Architecture
```
┌────────────────────────────────────────────────────────────┐
│ Desktop Client │
├──────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ File System │ │ Sync Engine │ │ Network Layer │ │
│ │ Watcher │ │ │ │ │ │
│ │ │ │ Three Trees │ │ HTTP/2 multiplexing │ │
│ │ inotify/ │──│ Reconciler │──│ WebSocket notify │ │
│ │ FSEvents │ │ │ │ Block upload/down │ │
│ │ │ │ Conflict │ │ │ │
│ │ │ │ Handler │ │ Retry + backoff │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Local Database ││
│ │ SQLite: tree state, block cache index, sync cursor ││
│ └─────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────┘
```
**File system watching:**
- macOS: FSEvents (coalesced, efficient)
- Linux: inotify (per-file, watch limits ~8192 default)
- Windows: ReadDirectoryChangesW
**Watch limit handling:** For large folders exceeding inotify limits, fall back to periodic polling with checksums.
### Sync Status UI
Users need visibility into sync state:
```typescript
interface SyncStatus {
state: "synced" | "syncing" | "paused" | "offline" | "error"
pendingFiles: number
pendingBytes: number
currentFile?: {
path: string
progress: number // 0-100
speed: number // bytes/sec
}
errors: SyncError[]
}
```
**Status indicators:**
- ✓ Green checkmark: Fully synced
- ↻ Blue arrows: Syncing in progress
- ⏸ Gray pause: Paused (user-initiated or bandwidth limit)
- ⚠ Yellow warning: Conflicts or errors need attention
- ✕ Red X: Critical error (auth failed, storage full)
### Selective Sync
Large Dropbox accounts may exceed local disk. Allow users to choose which folders sync locally:
```typescript
interface SelectiveSyncConfig {
// Folders to sync (whitelist approach)
includedPaths: string[]
// Or folders to exclude (blacklist approach for "sync everything except")
excludedPaths: string[]
// Smart sync: files appear in finder but download on-demand
smartSyncEnabled: boolean
smartSyncPolicy: "local" | "online-only" | "mixed"
}
```
**Smart Sync (virtual files):**
- Files appear in file browser with cloud icon
- Open file → triggers download
- Configurable: keep local after access vs evict after N days
- Requires OS integration (Windows: Cloud Files API, macOS: File Provider)
## Infrastructure Design
### Cloud-Agnostic Concepts
| Component | Concept | Requirements |
| -------------- | ------------------------------------ | ---------------------------------------- |
| Block storage | Object store with content addressing | High durability, geo-replication |
| Metadata store | Sharded relational DB | Strong consistency, high read throughput |
| Cache | Distributed key-value | Sub-ms latency, TTL support |
| Notification | Pub/sub with persistent connections | Millions of concurrent connections |
| Compute | Container orchestration | Auto-scaling, rolling deploys |
### AWS Reference Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ AWS Infrastructure │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Route 53 │────▶│ CloudFront │────▶│ ALB │ │
│ │ (DNS) │ │ (CDN) │ │ │ │
│ └──────────────┘ └──────────────┘ └──────┬──────┘ │
│ │ │
│ ┌────────────────────────────────────────────────▼──────┐ │
│ │ ECS / EKS │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ API │ │ Metadata │ │ Block │ │ Notify │ │ │
│ │ │ Gateway │ │ Service │ │ Service │ │ Service │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Data Layer ││
│ │ ││
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ Aurora │ │ ElastiCache │ │ S3 │ ││
│ │ │ PostgreSQL │ │ Redis │ │ (Block Store)│ ││
│ │ │ (Metadata) │ │ (Cache) │ │ │ ││
│ │ └──────────────┘ └──────────────┘ └──────────────┘ ││
│ │ ││
│ │ ┌──────────────┐ ┌──────────────┐ ││
│ │ │ DynamoDB │ │ SQS/SNS │ ││
│ │ │ (Journal) │ │ (Events) │ ││
│ │ └──────────────┘ └──────────────┘ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
```
| Component | AWS Service | Configuration |
| ------------- | ------------------------------ | --------------------------------------------- |
| Metadata DB | Aurora PostgreSQL | Multi-AZ, read replicas |
| Block storage | S3 | Cross-region replication, 11 nines durability |
| Cache | ElastiCache Redis | Cluster mode, 100+ nodes |
| Notifications | API Gateway WebSocket + Lambda | 7M concurrent connections |
| Queue | SQS FIFO | Deduplication, ordering |
| CDN | CloudFront | Edge caching for static assets |
### Self-Hosted Alternatives
| Managed Service | Self-Hosted | When to Consider |
| --------------- | -------------------- | ----------------------- |
| Aurora | PostgreSQL + Patroni | Cost at 100+ TB scale |
| S3 | MinIO / Ceph | Data sovereignty, cost |
| ElastiCache | Redis Cluster | Specific Redis modules |
| API Gateway WS | Custom WS server | Connection limits, cost |
**Dropbox's approach:** Built custom storage system (Magic Pocket) at 50+ PB scale—$75M/year savings vs S3.
## Conclusion
File sync at scale requires:
1. **Content-defined chunking** for efficient delta sync and deduplication
2. **Three-tree model** for unambiguous conflict detection
3. **Content-addressed blocks** for idempotent uploads and cross-user deduplication
4. **Conflicted copies** for safe conflict resolution (no data loss)
5. **Real-time notifications** via WebSocket for sync latency
**Key tradeoffs accepted:**
- Eventual consistency (acceptable for file sync, not for transactional data)
- Client complexity (chunking, hashing, tree reconciliation)
- Storage overhead for deduplication metadata
**Not covered:** Team administration, audit logging, compliance features (HIPAA, SOC 2), mobile-specific optimizations.
## Appendix
### Prerequisites
- Understanding of content-addressable storage
- Familiarity with eventual consistency models
- Basic knowledge of compression algorithms
### Summary
- Content-defined chunking (Gear hash/FastCDC) enables delta sync with only changed blocks transferred
- Three-tree model (local, synced, remote) provides unambiguous merge base for bidirectional sync
- Block-level content addressing enables cross-user deduplication at petabyte scale
- Conflicted copy strategy avoids data loss without complex merge algorithms
- WebSocket notifications + cursor-based APIs enable sub-second sync latency
### References
- [Dropbox: Rewriting the heart of our sync engine](https://dropbox.tech/infrastructure/rewriting-the-heart-of-our-sync-engine) - Three-tree model and Rust rewrite
- [Dropbox: Streaming File Synchronization](https://dropbox.tech/infrastructure/streaming-file-synchronization) - Block sync protocol details
- [Dropbox: Inside the Magic Pocket](https://dropbox.tech/infrastructure/inside-the-magic-pocket) - Storage infrastructure at 500+ PB scale
- [Dropbox: Broccoli - Syncing faster by syncing less](https://dropbox.tech/infrastructure/-broccoli--syncing-faster-by-syncing-less) - Compression optimization
- [Dropbox: Inside LAN Sync](https://dropbox.tech/infrastructure/inside-lan-sync) - P2P sync protocol
- [FastCDC: A Fast and Efficient Content-Defined Chunking Approach](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf) - USENIX ATC 2016
- [LBFS: A Low-bandwidth Network File System](https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf) - Rabin fingerprinting for chunking
- [The rsync algorithm](https://rsync.samba.org/tech_report/) - Rolling checksum delta sync
---
## Design Google Calendar
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-google-calendar
**Category:** System Design / System Design Problems
**Description:** A comprehensive system design for a calendar and scheduling application handling recurring events, timezone complexity, and real-time collaboration. This design addresses event recurrence at scale (RRULE expansion), global timezone handling across DST boundaries, availability aggregation for meeting scheduling, and multi-client synchronization with conflict resolution.
# Design Google Calendar
A comprehensive system design for a calendar and scheduling application handling recurring events, timezone complexity, and real-time collaboration. This design addresses event recurrence at scale (RRULE expansion), global timezone handling across DST boundaries, availability aggregation for meeting scheduling, and multi-client synchronization with conflict resolution.

High-level architecture: Clients connect through an API gateway to core services backed by a hybrid data layer with async processing for notifications and recurrence expansion.
## Abstract
Calendar systems solve three interconnected problems: **temporal data modeling** (representing events, recurrence rules, and exceptions), **timezone arithmetic** (displaying the same event correctly across global participants), and **availability computation** (finding meeting slots across multiple calendars).
The core data model stores **recurring event masters** with RRULE strings (RFC 5545) rather than individual instances. Expansion happens in a **hybrid approach**: materialize instances 30-90 days ahead for query performance, expand dynamically beyond that window. Exceptions (cancellations, single-instance modifications) are stored separately and merged at read time.
**Timezone handling** requires storing events in local time with named IANA timezone identifiers—never raw UTC offsets. This ensures a "9 AM daily standup" remains at 9 AM local time across DST transitions.
**Conflict-free synchronization** uses sync-tokens (RFC 6578) for incremental updates. Each calendar has a monotonically increasing token; clients send their last token and receive only changes since that state. For concurrent edits, the server maintains the event history and uses last-write-wins with user notification for conflicts.
## Requirements
### Functional Requirements
| Feature | Priority | Scope |
| ------------------------------------------------ | -------- | ------------ |
| Single events (create, read, update, delete) | Core | Full |
| Recurring events (RRULE support) | Core | Full |
| Event exceptions (cancel/modify single instance) | Core | Full |
| Time zone handling with DST | Core | Full |
| Meeting invitations (RSVP workflow) | Core | Full |
| Free/busy queries | Core | Full |
| Calendar sharing and delegation | Core | Full |
| Reminders and notifications | Core | Full |
| Multi-client sync (CalDAV) | Core | Full |
| Calendar search | High | Full |
| Meeting room/resource booking | High | Overview |
| Video conferencing integration | Medium | Brief |
| Task management (VTODO) | Low | Out of scope |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ------------------------------ | --------------- | ----------------------------------------------------------- |
| Availability | 99.99% | Calendar access is mission-critical for business operations |
| Read latency (calendar view) | p99 < 200ms | Month view may expand hundreds of recurring events |
| Write latency (event creation) | p99 < 500ms | Acceptable for user-initiated actions |
| Sync latency | < 5 seconds | Changes should propagate across devices quickly |
| Data consistency | Eventual (< 5s) | Strong consistency not required for calendar data |
| Data retention | 10+ years | Historical calendar data has legal/compliance value |
### Scale Estimation
**Users:**
- MAU: 500M (Google Workspace + consumer Gmail)
- DAU: 100M
- Peak concurrent: 10M (10% of DAU)
**Events:**
- Average events per user: 50 active recurring + 200 single events
- Total events: 500M users × 250 events = 125B event records
- But with recurrence masters (not instances): ~25B records
**Traffic:**
- Calendar loads: 100M DAU × 10 loads/day = 1B/day = ~12K RPS
- Event writes: 100M DAU × 2 writes/day = 200M/day = ~2.3K RPS
- Free/busy queries: 100M DAU × 0.5/day = 50M/day = ~580 RPS
- Peak multiplier: 3x → 36K RPS reads, 7K RPS writes
**Storage:**
- Event master: ~2KB average (metadata, description, RRULE, attendees)
- 25B events × 2KB = 50TB primary storage
- With indexes, replicas, and history: ~200TB total
## Design Paths
### Path A: RRULE-Centric (Store Rules, Expand on Read)
**Best when:**
- Events have long or infinite recurrence (daily standups forever)
- Storage cost is a primary concern
- Updates to recurring series are frequent
**Key characteristics:**
- Store only the recurrence rule in the events table
- Expand instances dynamically when querying a date range
- Cache expansion results in Redis for frequently accessed calendars
**Trade-offs:**
- ✅ Minimal storage (one record per recurring series)
- ✅ Updating series changes all future instances instantly
- ✅ Supports infinite recurrence naturally
- ❌ CPU-intensive expansion for complex RRULEs
- ❌ Slow queries spanning long date ranges
- ❌ Exception handling adds query complexity
**Real-world example:** Many open-source CalDAV servers (Radicale, DAViCal) use this approach because storage efficiency matters more than query speed for personal calendars.
### Path B: Instance-Centric (Materialize All Instances)
**Best when:**
- Queries span arbitrary date ranges frequently
- Meeting scheduling and free/busy aggregation are critical
- Most events have bounded recurrence (end dates)
**Key characteristics:**
- Pre-expand all instances into a separate table
- Recurring series modifications trigger batch updates to instances
- Indexes on start_time enable fast range queries
**Trade-offs:**
- ✅ O(1) range queries—just filter by date
- ✅ Simple free/busy aggregation (SUM over intervals)
- ✅ Exception instances are just rows with modified fields
- ❌ Storage explosion (daily event for 10 years = 3,650 rows)
- ❌ Series updates require updating thousands of rows
- ❌ Cannot support infinite recurrence
**Real-world example:** Microsoft Outlook's Exchange uses materialization for corporate calendars where meeting scheduling performance is paramount.
### Path C: Hybrid (Chosen Approach)
**Best when:**
- Mix of short-term and long-term recurring events
- Need both fast queries and storage efficiency
- Workload varies (view calendar vs. schedule meetings)
**Key characteristics:**
- Store recurrence rules in the master events table
- Materialize instances for a rolling window (30-90 days)
- Expand dynamically beyond the materialized window
- Background jobs refresh materialized instances nightly
**Trade-offs:**
- ✅ Fast queries within the materialized window
- ✅ Reasonable storage (30-90 instances per series, not thousands)
- ✅ Can support infinite recurrence (expand on demand)
- ✅ Series updates only touch instances within window
- ❌ More complex architecture (two code paths)
- ❌ Stale data possible if background jobs lag
### Path Comparison
| Factor | Path A (RRULE) | Path B (Instance) | Path C (Hybrid) |
| ------------------- | ------------------ | --------------------- | ------------------ |
| Storage | Minimal | High | Moderate |
| Read latency | High (expansion) | Low | Low within window |
| Write complexity | Low | High (batch updates) | Moderate |
| Infinite recurrence | Yes | No | Yes |
| Free/busy speed | Slow | Fast | Fast within window |
| Best for | Personal calendars | Enterprise scheduling | General-purpose |
### This Article's Focus
This article implements **Path C (Hybrid)** because Google Calendar serves both consumer users (long-running personal recurring events) and enterprise users (meeting-heavy scheduling). The hybrid approach optimizes for the common case (viewing this week/month) while supporting edge cases (events repeating forever).
## High-Level Design
### Service Architecture
#### Event Service
Handles CRUD operations for events and recurring masters:
- Create/update/delete single events
- Create/update/delete recurring series (stores RRULE)
- Create exceptions (modified or cancelled instances)
- Query events by date range (calls Recurrence Service for expansion)
#### Recurrence Service
Expands RRULE strings into concrete instances:
- Parse RRULE using RFC 5545 grammar
- Generate instances within a date range
- Apply EXDATE (exclusions) and RDATE (additions)
- Merge with exception instances from database
- Cache expansions in Redis (TTL = 1 hour)
#### Scheduling Service
Handles meeting coordination:
- Aggregate free/busy across attendees
- Find available meeting slots
- Send invitations (iTIP REQUEST method)
- Process RSVPs (iTIP REPLY method)
- Resource (room) availability and booking
#### Sync Service
Manages multi-client synchronization:
- Implement CalDAV protocol (RFC 4791)
- Maintain sync-tokens per calendar
- Push notifications for real-time updates (WebSocket/FCM)
- Handle conflict detection and resolution
#### Notification Service
Delivers reminders and alerts:
- Schedule reminders based on event VALARM
- Deliver via push notification, email, SMS
- Handle timezone-aware scheduling (reminder at 9 AM local time)
- Batch notification delivery for efficiency
### Data Flow: Creating a Recurring Event

### Data Flow: Querying Calendar View

## API Design
### Event Resource
#### Create Event
**Endpoint:** `POST /api/v1/calendars/{calendarId}/events`
```json collapse={1-3, 29-35}
// Headers
Authorization: Bearer {access_token}
Content-Type: application/json
// Request body
{
"summary": "Weekly Team Standup",
"description": "Discuss blockers and priorities",
"start": {
"dateTime": "2024-01-15T09:00:00",
"timeZone": "America/New_York"
},
"end": {
"dateTime": "2024-01-15T09:30:00",
"timeZone": "America/New_York"
},
"recurrence": ["RRULE:FREQ=WEEKLY;BYDAY=MO,WE,FR"],
"attendees": [
{"email": "alice@example.com"},
{"email": "bob@example.com", "optional": true}
],
"reminders": {
"useDefault": false,
"overrides": [
{"method": "popup", "minutes": 10},
{"method": "email", "minutes": 60}
]
},
"conferenceData": {
"createRequest": {"requestId": "unique-request-id"}
},
"visibility": "default",
"transparency": "opaque"
}
```
**Response (201 Created):**
```json collapse={1-5, 35-50}
{
"kind": "calendar#event",
"etag": "\"3148476458000000\"",
"id": "abc123xyz",
"status": "confirmed",
"htmlLink": "https://calendar.example.com/event?eid=abc123xyz",
"created": "2024-01-10T15:30:00.000Z",
"updated": "2024-01-10T15:30:00.000Z",
"summary": "Weekly Team Standup",
"description": "Discuss blockers and priorities",
"creator": {
"email": "organizer@example.com",
"self": true
},
"organizer": {
"email": "organizer@example.com",
"self": true
},
"start": {
"dateTime": "2024-01-15T09:00:00-05:00",
"timeZone": "America/New_York"
},
"end": {
"dateTime": "2024-01-15T09:30:00-05:00",
"timeZone": "America/New_York"
},
"recurrence": ["RRULE:FREQ=WEEKLY;BYDAY=MO,WE,FR"],
"iCalUID": "abc123xyz@calendar.example.com",
"sequence": 0,
"attendees": [
{ "email": "alice@example.com", "responseStatus": "needsAction" },
{ "email": "bob@example.com", "responseStatus": "needsAction", "optional": true }
],
"reminders": {
"useDefault": false,
"overrides": [
{ "method": "popup", "minutes": 10 },
{ "method": "email", "minutes": 60 }
]
},
"conferenceData": {
"conferenceId": "meet123",
"conferenceSolution": {
"name": "Google Meet",
"iconUri": "https://..."
},
"entryPoints": [{ "entryPointType": "video", "uri": "https://meet.example.com/meet123" }]
}
}
```
**Error Responses:**
- `400 Bad Request`: Invalid RRULE syntax, missing required fields
- `401 Unauthorized`: Missing or invalid auth token
- `403 Forbidden`: No write access to calendar
- `409 Conflict`: Event conflicts with existing event (if strict mode)
- `429 Too Many Requests`: Rate limit exceeded
**Rate Limits:** 600 requests/minute per user, 10,000/minute per project
#### Query Events
**Endpoint:** `GET /api/v1/calendars/{calendarId}/events`
**Query Parameters:**
| Parameter | Type | Description |
| -------------- | ------- | ----------------------------------------------------- |
| `timeMin` | ISO8601 | Lower bound (inclusive) for event end time |
| `timeMax` | ISO8601 | Upper bound (exclusive) for event start time |
| `singleEvents` | boolean | If true, expand recurring events into instances |
| `orderBy` | string | `startTime` (requires singleEvents=true) or `updated` |
| `maxResults` | integer | Maximum entries returned (default: 250, max: 2500) |
| `pageToken` | string | Token for pagination |
| `syncToken` | string | Token from previous sync for incremental updates |
| `showDeleted` | boolean | Include cancelled events (for sync) |
**Design Decision: Pagination Strategy**
**Why cursor-based (pageToken/syncToken), not offset-based:**
- Calendar data is highly dynamic (events created/deleted constantly)
- Offset pagination breaks when data changes between pages
- Sync tokens enable efficient incremental sync (only fetch changes)
**Sync flow:**
1. Initial full sync: `GET /events?timeMin=...&timeMax=...` → returns `nextSyncToken`
2. Incremental sync: `GET /events?syncToken={token}` → returns changed items + new `syncToken`
3. If sync token expires (410 Gone): perform full sync again
#### Modify Single Instance of Recurring Event
**Endpoint:** `PUT /api/v1/calendars/{calendarId}/events/{recurringEventId}/instances/{instanceId}`
This creates an **exception instance** that overrides the recurring pattern for one occurrence.
```json
{
"start": {
"dateTime": "2024-01-17T10:00:00",
"timeZone": "America/New_York"
},
"end": {
"dateTime": "2024-01-17T10:30:00",
"timeZone": "America/New_York"
}
}
```
The `instanceId` encodes the original instance date (e.g., `abc123xyz_20240117T140000Z`).
**Design Decision: How Exceptions Are Stored**
The exception is stored as a separate row linked to the recurring master via `recurring_event_id` with the `original_start_time` preserved. This allows:
- Querying the modified instance by its new time
- Reverting to the original time by deleting the exception
- Identifying which instance was modified (via `original_start_time`)
### Free/Busy Query
**Endpoint:** `POST /api/v1/freeBusy`
```json
{
"timeMin": "2024-01-15T00:00:00Z",
"timeMax": "2024-01-22T00:00:00Z",
"items": [
{ "id": "alice@example.com" },
{ "id": "bob@example.com" },
{ "id": "conference-room-a@resource.example.com" }
]
}
```
**Response:**
```json collapse={1-3, 25-30}
{
"kind": "calendar#freeBusy",
"timeMin": "2024-01-15T00:00:00Z",
"timeMax": "2024-01-22T00:00:00Z",
"calendars": {
"alice@example.com": {
"busy": [
{ "start": "2024-01-15T14:00:00Z", "end": "2024-01-15T15:00:00Z" },
{ "start": "2024-01-16T09:00:00Z", "end": "2024-01-16T10:00:00Z" }
]
},
"bob@example.com": {
"busy": [{ "start": "2024-01-15T14:00:00Z", "end": "2024-01-15T14:30:00Z" }]
},
"conference-room-a@resource.example.com": {
"busy": [{ "start": "2024-01-15T10:00:00Z", "end": "2024-01-15T11:00:00Z" }],
"errors": []
}
},
"groups": {}
}
```
**Design Decision: Free/Busy Privacy**
Free/busy queries return only time intervals, not event details. This allows users to share availability without exposing meeting contents. The `transparency` field on events controls whether they appear as busy:
- `opaque` (default): Shows as busy
- `transparent`: Doesn't block time (e.g., "Working from home" all-day event)
## Data Modeling
### Event Schema
**Primary Store:** PostgreSQL (ACID for writes, complex queries for recurrence)
```sql collapse={1-5, 45-55}
-- Users and calendars (simplified)
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email VARCHAR(255) UNIQUE NOT NULL,
timezone VARCHAR(50) DEFAULT 'UTC',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE calendars (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
owner_id UUID NOT NULL REFERENCES users(id),
name VARCHAR(255) NOT NULL,
timezone VARCHAR(50) NOT NULL,
sync_token BIGINT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Event master table (stores both single and recurring events)
CREATE TABLE events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
calendar_id UUID NOT NULL REFERENCES calendars(id),
ical_uid VARCHAR(255) NOT NULL, -- RFC 5545 UID for iCal interop
summary VARCHAR(500),
description TEXT,
location VARCHAR(500),
-- Time fields stored in local time with timezone
start_datetime TIMESTAMP NOT NULL,
end_datetime TIMESTAMP NOT NULL,
start_timezone VARCHAR(50) NOT NULL,
end_timezone VARCHAR(50) NOT NULL,
is_all_day BOOLEAN DEFAULT FALSE,
-- Recurrence (NULL for single events)
recurrence_rule TEXT, -- RRULE string, e.g., "FREQ=WEEKLY;BYDAY=MO,WE,FR"
recurrence_exceptions TEXT[], -- EXDATE array
recurrence_additions TEXT[], -- RDATE array
-- Metadata
status VARCHAR(20) DEFAULT 'confirmed', -- confirmed, tentative, cancelled
visibility VARCHAR(20) DEFAULT 'default', -- default, public, private
transparency VARCHAR(20) DEFAULT 'opaque', -- opaque, transparent
sequence INTEGER DEFAULT 0, -- Increment on updates (iCal SEQUENCE)
-- Organizer and creator
organizer_email VARCHAR(255),
creator_email VARCHAR(255),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
deleted_at TIMESTAMPTZ, -- Soft delete
UNIQUE(calendar_id, ical_uid)
);
-- Indexes for common query patterns
CREATE INDEX idx_events_calendar_time ON events(calendar_id, start_datetime, end_datetime)
WHERE deleted_at IS NULL;
CREATE INDEX idx_events_updated ON events(calendar_id, updated_at)
WHERE deleted_at IS NULL;
CREATE INDEX idx_events_recurring ON events(calendar_id)
WHERE recurrence_rule IS NOT NULL AND deleted_at IS NULL;
```
**Design Decision: Local Time Storage**
Why store `start_datetime` as local time with a separate `start_timezone` instead of UTC?
1. **DST correctness**: A "9 AM daily standup" should always be at 9 AM local time. If stored as UTC, it would shift by an hour during DST transitions.
2. **RRULE expansion**: The RRULE `BYDAY=MO` means Monday in the event's timezone, not UTC Monday.
3. **Display simplicity**: No conversion needed when displaying in the organizer's timezone.
**Trade-off**: Queries that span multiple timezones require conversion. The materialized instances table stores computed UTC times for efficient range queries.
### Materialized Instances
```sql collapse={1-3, 30-35}
-- Materialized instances for query performance
-- Regenerated nightly for rolling 90-day window
CREATE TABLE event_instances (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID NOT NULL REFERENCES events(id) ON DELETE CASCADE,
calendar_id UUID NOT NULL REFERENCES calendars(id),
-- Instance timing (UTC for efficient range queries)
instance_start_utc TIMESTAMPTZ NOT NULL,
instance_end_utc TIMESTAMPTZ NOT NULL,
-- Original occurrence date (for exception matching)
original_start_utc TIMESTAMPTZ NOT NULL,
-- Instance-specific overrides (NULL = inherit from master)
summary_override VARCHAR(500),
description_override TEXT,
location_override VARCHAR(500),
start_override TIMESTAMP,
end_override TIMESTAMP,
timezone_override VARCHAR(50),
-- Exception status
status VARCHAR(20) NOT NULL DEFAULT 'confirmed', -- confirmed, cancelled
is_exception BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Primary query index: calendar + date range
CREATE INDEX idx_instances_calendar_range
ON event_instances(calendar_id, instance_start_utc, instance_end_utc)
WHERE status != 'cancelled';
-- Free/busy aggregation index
CREATE INDEX idx_instances_freebusy
ON event_instances(calendar_id, instance_start_utc, instance_end_utc)
WHERE status = 'confirmed';
-- Exception lookup (find if this occurrence has been modified)
CREATE INDEX idx_instances_exception
ON event_instances(event_id, original_start_utc)
WHERE is_exception = TRUE;
```
### Attendees and RSVPs
```sql collapse={1-3, 25-30}
-- Attendees for meetings
CREATE TABLE event_attendees (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_id UUID NOT NULL REFERENCES events(id) ON DELETE CASCADE,
email VARCHAR(255) NOT NULL,
display_name VARCHAR(255),
-- Response status (RFC 5545 PARTSTAT)
response_status VARCHAR(20) DEFAULT 'needsAction',
-- needsAction, declined, tentative, accepted
-- Role
is_organizer BOOLEAN DEFAULT FALSE,
is_optional BOOLEAN DEFAULT FALSE,
is_resource BOOLEAN DEFAULT FALSE, -- Conference room, equipment
-- Response metadata
response_comment TEXT,
responded_at TIMESTAMPTZ,
UNIQUE(event_id, email)
);
CREATE INDEX idx_attendees_email ON event_attendees(email, event_id);
CREATE INDEX idx_attendees_event ON event_attendees(event_id);
```
### Database Selection Matrix
| Data Type | Store | Rationale |
| -------------------- | --------------------- | ------------------------------------------------- |
| Events and instances | PostgreSQL | ACID, complex RRULE queries, date range filtering |
| Free/busy cache | Redis Sorted Sets | Sub-ms latency, TTL, efficient range queries |
| Full-text search | Elasticsearch | Event content search, attendee search |
| Attachments | Object Storage (S3) | Large files, CDN delivery |
| Notification queue | Redis Streams / Kafka | High throughput, at-least-once delivery |
| Sync tokens | PostgreSQL | Transactional consistency with events |
### Sharding Strategy
**Primary shard key:** `calendar_id`
**Rationale:**
- Co-locates all events for a calendar (most queries filter by calendar)
- Calendar view queries hit single shard
- Cross-calendar queries (free/busy) require scatter-gather, but these are less frequent
**Shard distribution:**
- Hash-based sharding on `calendar_id`
- 256 logical shards, distributed across physical nodes
- Rebalancing via consistent hashing
## Low-Level Design
### Recurrence Expansion Algorithm
The recurrence service expands RRULE strings into concrete instances. RFC 5545 defines the algorithm, but edge cases require careful handling.
#### RRULE Parsing and Expansion
```typescript collapse={1-10, 45-60}
// Using a library like rrule.js or python-dateutil for parsing
import { RRule, RRuleSet, rrulestr } from "rrule"
interface RecurrenceExpansionRequest {
rruleString: string // e.g., "FREQ=WEEKLY;BYDAY=MO,WE,FR"
dtstart: Date // Series start in local time
timezone: string // IANA timezone
rangeStart: Date // Query range start (UTC)
rangeEnd: Date // Query range end (UTC)
exdates?: Date[] // Excluded dates
rdates?: Date[] // Additional dates
}
function expandRecurrence(req: RecurrenceExpansionRequest): Date[] {
// Parse the RRULE with timezone awareness
const rule = RRule.fromString(req.rruleString)
const rruleSet = new RRuleSet()
rruleSet.rrule(rule)
// Add exclusions (EXDATE)
for (const exdate of req.exdates ?? []) {
rruleSet.exdate(exdate)
}
// Add additional dates (RDATE)
for (const rdate of req.rdates ?? []) {
rruleSet.rdate(rdate)
}
// Expand within range
// CRITICAL: between() uses the RRULE's timezone for DST handling
const instances = rruleSet.between(req.rangeStart, req.rangeEnd, true)
return instances
}
// Example: Weekly standup at 9 AM, Mon/Wed/Fri
const instances = expandRecurrence({
rruleString: "FREQ=WEEKLY;BYDAY=MO,WE,FR",
dtstart: new Date("2024-01-15T09:00:00"),
timezone: "America/New_York",
rangeStart: new Date("2024-01-01T00:00:00Z"),
rangeEnd: new Date("2024-03-31T23:59:59Z"),
exdates: [new Date("2024-01-17T09:00:00")], // Skip Jan 17
})
// Returns: [Jan 15, Jan 19, Jan 22, Jan 24, Jan 26, ...]
```
#### DST Edge Cases
**Spring Forward (2 AM → 3 AM):**
When an event is scheduled at 2:30 AM on the night clocks spring forward, the time doesn't exist.
```typescript collapse={1-5}
// Handling non-existent times during spring forward
function adjustForDST(localTime: Date, timezone: string): Date {
const { DateTime } = require("luxon")
const dt = DateTime.fromJSDate(localTime, { zone: timezone })
if (!dt.isValid && dt.invalidReason === "time zone offset transition") {
// Time doesn't exist—shift forward to the next valid time
return dt.plus({ hours: 1 }).toJSDate()
}
return localTime
}
```
**Fall Back (2 AM occurs twice):**
When clocks fall back, the 1:00-2:00 AM hour repeats. The iCalendar spec recommends using the first occurrence.
**Design Decision:** Follow the VTIMEZONE specification by storing and expanding in local time with TZID. The TZID references the IANA database, which contains the complete DST rules. Libraries like Luxon, date-fns-tz, and moment-timezone handle this correctly.
### Free/Busy Aggregation
Free/busy aggregation is the core of meeting scheduling. It must be fast (< 100ms for 10 attendees over 1 week) and respect privacy.
#### Redis-Based Free/Busy Cache
```typescript collapse={1-8, 40-50}
import { Redis } from "ioredis"
interface BusyInterval {
start: number // Unix timestamp
end: number
eventId?: string // Only for the calendar owner
}
// Store busy intervals as sorted set members
// Key: freebusy:{calendarId}
// Score: start timestamp
// Member: JSON { start, end, eventId }
async function updateFreeBusy(redis: Redis, calendarId: string, instances: EventInstance[]): Promise {
const key = `freebusy:${calendarId}`
const pipeline = redis.pipeline()
// Clear existing entries in the affected range
const rangeStart = Math.min(...instances.map((i) => i.startUtc.getTime() / 1000))
const rangeEnd = Math.max(...instances.map((i) => i.endUtc.getTime() / 1000))
pipeline.zremrangebyscore(key, rangeStart, rangeEnd)
// Add new busy intervals
for (const instance of instances) {
if (instance.status === "confirmed" && instance.transparency === "opaque") {
const interval: BusyInterval = {
start: instance.startUtc.getTime() / 1000,
end: instance.endUtc.getTime() / 1000,
eventId: instance.eventId,
}
pipeline.zadd(key, interval.start, JSON.stringify(interval))
}
}
// Set TTL to 7 days (refresh weekly)
pipeline.expire(key, 7 * 24 * 60 * 60)
await pipeline.exec()
}
async function queryFreeBusy(
redis: Redis,
calendarId: string,
rangeStart: Date,
rangeEnd: Date,
): Promise {
const key = `freebusy:${calendarId}`
const start = rangeStart.getTime() / 1000
const end = rangeEnd.getTime() / 1000
// Get all intervals that START within the range
const members = await redis.zrangebyscore(key, start, end)
return members.map((m) => JSON.parse(m) as BusyInterval).filter((interval) => interval.end > start) // Exclude ended before range
}
```
#### Finding Available Slots
```typescript collapse={1-5, 50-60}
interface TimeSlot {
start: Date
end: Date
}
function findAvailableSlots(
busyIntervalsByAttendee: Map,
rangeStart: Date,
rangeEnd: Date,
duration: number, // minutes
workingHours?: { start: number; end: number }, // e.g., { start: 9, end: 17 }
): TimeSlot[] {
// Merge all busy intervals
const allBusy: BusyInterval[] = []
for (const intervals of busyIntervalsByAttendee.values()) {
allBusy.push(...intervals)
}
// Sort by start time
allBusy.sort((a, b) => a.start - b.start)
// Merge overlapping intervals
const merged: BusyInterval[] = []
for (const interval of allBusy) {
if (merged.length === 0 || merged[merged.length - 1].end < interval.start) {
merged.push({ ...interval })
} else {
merged[merged.length - 1].end = Math.max(merged[merged.length - 1].end, interval.end)
}
}
// Find gaps that fit the duration
const durationSec = duration * 60
const available: TimeSlot[] = []
let cursor = rangeStart.getTime() / 1000
for (const busy of merged) {
if (busy.start - cursor >= durationSec) {
available.push({
start: new Date(cursor * 1000),
end: new Date(busy.start * 1000),
})
}
cursor = Math.max(cursor, busy.end)
}
// Check final gap
const endSec = rangeEnd.getTime() / 1000
if (endSec - cursor >= durationSec) {
available.push({
start: new Date(cursor * 1000),
end: rangeEnd,
})
}
// Filter by working hours if specified
if (workingHours) {
return available.filter((slot) => {
const startHour = slot.start.getHours()
return startHour >= workingHours.start && startHour < workingHours.end
})
}
return available
}
```
**Time Complexity:** O(N log N) for sorting, O(N) for merging, where N = total busy intervals across all attendees.
### Sync Token Implementation
Sync tokens enable efficient incremental sync for CalDAV clients and mobile apps.
```sql collapse={1-5}
-- Track changes for sync
CREATE TABLE calendar_changes (
id BIGSERIAL PRIMARY KEY,
calendar_id UUID NOT NULL REFERENCES calendars(id),
event_id UUID NOT NULL,
change_type VARCHAR(10) NOT NULL, -- 'created', 'updated', 'deleted'
changed_at TIMESTAMPTZ DEFAULT NOW(),
sync_token BIGINT NOT NULL -- Matches calendars.sync_token at time of change
);
CREATE INDEX idx_changes_sync ON calendar_changes(calendar_id, sync_token);
-- On event change, record it
CREATE OR REPLACE FUNCTION record_event_change()
RETURNS TRIGGER AS $$
BEGIN
-- Increment calendar's sync token
UPDATE calendars SET sync_token = sync_token + 1 WHERE id = NEW.calendar_id;
-- Record the change
INSERT INTO calendar_changes (calendar_id, event_id, change_type, sync_token)
SELECT NEW.calendar_id, NEW.id, TG_OP, sync_token FROM calendars WHERE id = NEW.calendar_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
```
**Sync flow:**
1. **Initial sync:** Client receives all events + current `syncToken` (e.g., 15)
2. **Incremental sync:** Client sends `syncToken=15`, server returns changes where `sync_token > 15` + new token (e.g., 23)
3. **Token expiration:** If changes for token 15 have been purged (older than 30 days), return 410 Gone → client performs full sync
### Invitation Workflow (iTIP/iMIP)
When an organizer invites attendees, the system generates iTIP REQUEST messages:

**iMIP Email Format:**
```text
Content-Type: multipart/alternative; boundary="boundary"
--boundary
Content-Type: text/plain
You've been invited to: Weekly Team Standup
When: Monday, January 15, 2024 9:00 AM - 9:30 AM (EST)
--boundary
Content-Type: text/calendar; method=REQUEST
BEGIN:VCALENDAR
VERSION:2.0
METHOD:REQUEST
BEGIN:VEVENT
UID:abc123xyz@calendar.example.com
DTSTART;TZID=America/New_York:20240115T090000
DTEND;TZID=America/New_York:20240115T093000
SUMMARY:Weekly Team Standup
ORGANIZER:mailto:organizer@example.com
ATTENDEE;PARTSTAT=NEEDS-ACTION:mailto:attendee@example.com
END:VEVENT
END:VCALENDAR
--boundary--
```
## Frontend Considerations
### Calendar View Performance
**Problem:** A month view showing 30+ days with recurring events may need to display hundreds of event instances.
**Solution: Virtual Scrolling + Batched Loading**
```typescript collapse={1-10, 35-45}
// Load events in batches as user scrolls
interface CalendarViewState {
visibleRange: { start: Date; end: Date }
loadedRanges: Array<{ start: Date; end: Date }>
events: Map
}
function useCalendarEvents(calendarId: string) {
const [state, setState] = useState({
visibleRange: getCurrentWeek(),
loadedRanges: [],
events: new Map(),
})
// Load events for visible range + buffer
useEffect(() => {
const rangeToLoad = expandRange(state.visibleRange, { days: 7 }) // ±1 week buffer
if (!isRangeCovered(rangeToLoad, state.loadedRanges)) {
fetchEvents(calendarId, rangeToLoad).then((newEvents) => {
setState((prev) => ({
...prev,
loadedRanges: mergeRanges([...prev.loadedRanges, rangeToLoad]),
events: new Map([...prev.events, ...newEvents.map((e) => [e.id, e])]),
}))
})
}
}, [state.visibleRange, calendarId])
return state.events
}
```
**Key optimizations:**
- Request `singleEvents=true` from API to get pre-expanded instances
- Cache responses by date range (events within a range don't change often)
- Use `ETag` / `If-None-Match` for conditional requests
- Virtualize day cells in month view (render only visible weeks)
### Real-Time Updates
**Strategy:** WebSocket for active browser tabs, push notifications for background/mobile.
```typescript collapse={1-5, 25-35}
// Real-time sync via WebSocket
const useCalendarSync = (calendarId: string) => {
const queryClient = useQueryClient()
useEffect(() => {
const ws = new WebSocket(`wss://api.calendar.com/sync/${calendarId}`)
ws.onmessage = (event) => {
const change = JSON.parse(event.data)
switch (change.type) {
case "event.created":
case "event.updated":
queryClient.setQueryData(["events", calendarId], (old: CalendarEvent[]) => upsertEvent(old, change.event))
break
case "event.deleted":
queryClient.setQueryData(["events", calendarId], (old: CalendarEvent[]) =>
old.filter((e) => e.id !== change.eventId),
)
break
}
}
return () => ws.close()
}, [calendarId, queryClient])
}
```
### Timezone Display
**User expectations:**
- Event times shown in user's local timezone by default
- Option to view in event's original timezone
- All-day events should span the full day in any timezone
```typescript collapse={1-5}
// Convert and display event times
function formatEventTime(event: CalendarEvent, userTimezone: string): string {
const { DateTime } = require("luxon")
if (event.isAllDay) {
// All-day events: show date only, no timezone conversion
return DateTime.fromISO(event.start.date).toLocaleString(DateTime.DATE_MED)
}
// Timed events: convert to user's timezone
const start = DateTime.fromISO(event.start.dateTime, { zone: event.start.timeZone })
const userStart = start.setZone(userTimezone)
// Show original timezone if different
if (event.start.timeZone !== userTimezone) {
return `${userStart.toLocaleString(DateTime.TIME_SIMPLE)} (${userStart.toFormat("ZZZZ")})`
}
return userStart.toLocaleString(DateTime.TIME_SIMPLE)
}
```
### Drag-and-Drop Rescheduling
**Optimistic updates with rollback:**
```typescript collapse={1-5, 30-40}
// Drag event to new time slot
async function handleEventDrop(eventId: string, newStart: Date, newEnd: Date) {
const previousEvent = queryClient.getQueryData(["event", eventId])
// Optimistic update
queryClient.setQueryData(["event", eventId], (old: CalendarEvent) => ({
...old,
start: { dateTime: newStart.toISOString(), timeZone: old.start.timeZone },
end: { dateTime: newEnd.toISOString(), timeZone: old.end.timeZone },
}))
try {
await updateEvent(eventId, { start: newStart, end: newEnd })
} catch (error) {
// Rollback on failure
queryClient.setQueryData(["event", eventId], previousEvent)
toast.error("Failed to reschedule event")
}
}
// For recurring event instance: prompt user for scope
function handleRecurringEventDrop(eventId: string, instanceDate: Date, newTime: Date) {
showDialog({
title: "Edit recurring event",
options: [
{ label: "This event only", action: () => updateInstance(eventId, instanceDate, newTime) },
{ label: "This and future events", action: () => splitSeries(eventId, instanceDate, newTime) },
{ label: "All events", action: () => updateSeries(eventId, newTime) },
],
})
}
```
## Infrastructure Design
### Cloud-Agnostic Concepts
| Component | Requirement | Options |
| -------------------- | ------------------------ | ------------------------------ |
| **Primary Database** | ACID, complex queries | PostgreSQL, MySQL |
| **Cache** | Sub-ms reads, TTL | Redis, Memcached |
| **Search** | Full-text, aggregations | Elasticsearch, OpenSearch |
| **Message Queue** | At-least-once, ordering | Kafka, RabbitMQ, Redis Streams |
| **Object Storage** | Attachments, large files | S3-compatible (MinIO) |
| **Job Scheduler** | Cron, delayed jobs | Temporal, Celery, pg-boss |
### AWS Reference Architecture

| Component | AWS Service | Configuration |
| ------------------ | ----------------- | ---------------------------------------- |
| API Service | ECS Fargate | 2-50 tasks, 1 vCPU / 2GB each |
| Background Workers | ECS Fargate Spot | 5-20 tasks, Spot for cost |
| Primary Database | RDS PostgreSQL | db.r6g.xlarge, Multi-AZ, 1TB gp3 |
| Read Replicas | RDS Read Replicas | 2 replicas across AZs |
| Cache | ElastiCache Redis | cache.r6g.large, 3-node cluster |
| Search | OpenSearch | m6g.large.search, 3-node |
| Message Queue | Amazon SQS / MSK | SQS for simplicity, MSK for ordering |
| Object Storage | S3 + CloudFront | Intelligent-Tiering, CDN for attachments |
| Notifications | Lambda + SNS | Push via FCM/APNs |
### Self-Hosted Alternatives
| Managed Service | Self-Hosted | When to Self-Host |
| --------------- | -------------------- | -------------------------------------------- |
| RDS PostgreSQL | PostgreSQL on EC2 | Cost at scale, specific extensions (pg_cron) |
| ElastiCache | Redis on EC2 | Redis modules (RedisJSON, RediSearch) |
| OpenSearch | Elasticsearch on EC2 | Cost, specific plugins |
| MSK | Kafka on EC2 | Cost at scale, Kafka Streams |
## Conclusion
This design prioritizes the **hybrid approach** for recurring events—materializing instances within a rolling window while supporting on-demand expansion for arbitrary ranges. This balances storage efficiency with query performance for the most common use cases (viewing this week/month).
Key architectural decisions:
1. **Local time + TZID storage**: Events stored in local time with named timezones, ensuring DST correctness for recurring events.
2. **Sync tokens for incremental sync**: Monotonically increasing tokens per calendar enable efficient CalDAV/mobile sync without polling.
3. **Redis-cached free/busy**: Pre-computed busy intervals in sorted sets provide sub-100ms scheduling queries.
4. **iTIP/iMIP for interoperability**: Standards-based invitation workflow ensures email-based RSVP works across calendar providers.
**Limitations and future improvements:**
- **Conflict detection**: Current design uses last-write-wins; could implement operational transforms for real-time collaborative editing.
- **AI scheduling**: Could add ML-based suggestions for optimal meeting times based on attendee patterns.
- **Calendar federation**: Cross-organization free/busy queries require additional privacy controls and federation protocols.
## Appendix
### Prerequisites
- Distributed systems fundamentals (CAP theorem, eventual consistency)
- Database design (indexing, sharding, replication)
- REST API design principles
- Basic understanding of timezone concepts (UTC, offsets, DST)
### Terminology
- **RRULE**: Recurrence Rule—RFC 5545 syntax for defining repeating patterns (e.g., `FREQ=WEEKLY;BYDAY=MO`)
- **EXDATE**: Exception Date—dates excluded from a recurring series
- **iTIP**: iCalendar Transport-Independent Interoperability Protocol—defines methods for scheduling (REQUEST, REPLY, CANCEL)
- **iMIP**: iCalendar Message-Based Interoperability Protocol—iTIP over email
- **CalDAV**: Calendaring Extensions to WebDAV—protocol for calendar access and sync
- **Sync Token**: Opaque string representing calendar state for incremental synchronization
- **TZID**: Timezone Identifier—IANA timezone name (e.g., `America/New_York`)
### Summary
- Calendar systems require a **hybrid recurrence model**: store RRULE masters, materialize instances for a rolling window (30-90 days), expand dynamically beyond
- **Time storage must be local time with TZID**, not UTC, to handle DST transitions correctly for recurring events
- **Free/busy aggregation** is optimized via Redis sorted sets with pre-computed busy intervals
- **Sync tokens** enable efficient incremental sync—clients receive only changes since their last sync
- **iTIP/iMIP** provide interoperability with other calendar systems via standardized invitation workflows
- Scale to 500M users requires PostgreSQL sharding by calendar_id, Redis caching, and async notification delivery
### References
- [RFC 5545 - iCalendar Specification](https://datatracker.ietf.org/doc/html/rfc5545) - Core data format for calendar interchange
- [RFC 4791 - CalDAV](https://datatracker.ietf.org/doc/html/rfc4791) - Calendar access protocol
- [RFC 5546 - iTIP](https://datatracker.ietf.org/doc/html/rfc5546) - Scheduling protocol (REQUEST, REPLY, CANCEL)
- [RFC 6047 - iMIP](https://datatracker.ietf.org/doc/html/rfc6047) - Email transport for calendar invitations
- [RFC 6578 - Collection Synchronization](https://datatracker.ietf.org/doc/html/rfc6578) - Sync token mechanism for WebDAV
- [IANA Time Zone Database](https://www.iana.org/time-zones) - Authoritative timezone data
- [Google Calendar API Documentation](https://developers.google.com/calendar/api) - Reference implementation patterns
- [rrule.js](https://github.com/jakubroztocil/rrule) - JavaScript library for RRULE expansion
---
## Design an Issue Tracker (Jira/Linear)
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-issue-tracker
**Category:** System Design / System Design Problems
**Description:** A comprehensive system design for an issue tracking and project management tool covering API design for dynamic workflows, efficient kanban board pagination, drag-and-drop ordering without full row updates, concurrent edit handling, and real-time synchronization. This design addresses the challenges of project-specific column configurations while maintaining consistent user-defined ordering across views.
# Design an Issue Tracker (Jira/Linear)
A comprehensive system design for an issue tracking and project management tool covering API design for dynamic workflows, efficient kanban board pagination, drag-and-drop ordering without full row updates, concurrent edit handling, and real-time synchronization. This design addresses the challenges of project-specific column configurations while maintaining consistent user-defined ordering across views.

High-level architecture: API gateway routing to domain services, with WebSocket-based real-time sync and Redis pub/sub for broadcast.
## Abstract
Issue tracking systems solve three interconnected problems: **flexible workflows** (each project defines its own statuses and transitions), **efficient ordering** (issues maintain user-defined positions without expensive reindexing), and **concurrent editing** (multiple users can update the same issue simultaneously).
**Core architectural decisions:**
| Decision | Choice | Rationale |
| ------------------ | ------------------------------------- | ----------------------------------------------- |
| Ordering algorithm | Fractional indexing (LexoRank) | O(1) insertions without row updates |
| API style | GraphQL with REST fallback | Flexible field selection for varied board views |
| Pagination | Per-column cursor-based | Ensures all columns load incrementally |
| Concurrency | Optimistic locking with version field | Low conflict rate in practice |
| Real-time sync | WebSocket + last-write-wins | Sub-200ms propagation, simple conflict model |
| Workflow storage | Polymorphic per-project | Projects own their status definitions |
**Key trade-offs accepted:**
- Denormalized board state in Redis for fast reads, with async consistency
- LexoRank strings grow unbounded, requiring periodic rebalancing
- Last-write-wins may lose concurrent edits (acceptable for most fields)
**What this design optimizes:**
- Drag-and-drop reordering updates exactly one row
- Board loads show issues across all columns immediately
- Workflow changes don't require schema migrations
## Requirements
### Functional Requirements
| Requirement | Priority | Notes |
| ----------------------------- | -------- | -------------------------------------------- |
| Create/edit/delete issues | Core | Title, description, assignee, type, priority |
| Project-specific workflows | Core | Custom statuses and transitions per project |
| Kanban board view | Core | Drag-drop between columns and within columns |
| Issue ordering within columns | Core | Persist user-defined order |
| Real-time updates | Core | See changes from other users immediately |
| Search and filter | Core | Full-text search, JQL-style queries |
| Comments and activity | Extended | Threaded comments, activity timeline |
| Attachments | Extended | File upload and preview |
| Sprints/iterations | Extended | Time-boxed groupings |
| Custom fields | Extended | Project-specific metadata |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| -------------------------- | --------------- | ---------------------------------- |
| Availability | 99.9% (3 nines) | User-facing, productivity critical |
| Board load time | p99 < 500ms | Must feel instant |
| Issue update latency | p99 < 200ms | Drag-drop must be responsive |
| Real-time propagation | p99 < 300ms | Collaborative editing feel |
| Search latency | p99 < 100ms | Autocomplete responsiveness |
| Concurrent users per board | 100 | Team collaboration scenario |
### Scale Estimation
**Users:**
- Total users: 10M (Jira-scale)
- Daily Active Users (DAU): 2M (20%)
- Peak concurrent users: 500K
**Projects and Issues:**
- Projects: 1M
- Issues per project (active): 1,000 avg, 100,000 max
- Total issues: 1B
- Issues per board view: 200-500 typical
**Traffic:**
- Board loads: 2M DAU × 10 loads/day = 20M/day = ~230 RPS
- Issue updates: 2M DAU × 20 updates/day = 40M/day = ~460 RPS
- Peak multiplier: 3x → 700 RPS board loads, 1,400 RPS updates
**Storage:**
- Issue size: 5KB avg (metadata + description)
- Total issue storage: 1B × 5KB = 5TB
- Attachments: 50TB (separate object storage)
- Activity log: 20TB (append-only)
## Design Paths
### Path A: Server-Authoritative with REST API
**Best when:**
- Team familiar with REST patterns
- Simpler infrastructure requirements
- Offline support not critical
- Moderate real-time requirements
**Architecture:**

**Trade-offs:**
- ✅ Simple mental model
- ✅ Standard tooling and caching
- ✅ Easy to debug
- ❌ Over-fetching/under-fetching without careful design
- ❌ Multiple round trips for complex operations
- ❌ Real-time requires separate WebSocket layer
**Real-world example:** Jira Cloud uses REST API with LexoRank for ordering and WebSocket for real-time updates.
### Path B: Local-First with Sync Engine
**Best when:**
- Offline support is critical
- Sub-100ms UI responsiveness required
- Team can invest in sync infrastructure
- Users on unreliable networks
**Architecture:**

**Trade-offs:**
- ✅ Instant UI response (local-first)
- ✅ Full offline support
- ✅ Minimal network traffic (deltas only)
- ❌ Complex sync logic
- ❌ Conflict resolution complexity
- ❌ Larger client-side footprint
**Real-world example:** Linear loads all issues into IndexedDB on startup, achieving 0ms search latency. Their sync engine uses last-write-wins for most fields with CRDTs for rich text descriptions.
### Path C: GraphQL with Optimistic Updates
**Best when:**
- Varied client needs (web, mobile, integrations)
- Complex data relationships
- Need flexibility without over-fetching
- Subscriptions for real-time
**Architecture:**
```graphql
mutation MoveIssue($input: MoveIssueInput!) {
moveIssue(input: $input) {
issue {
id
status {
id
name
}
rank
updatedAt
}
}
}
subscription OnBoardUpdate($boardId: ID!) {
boardUpdated(boardId: $boardId) {
issue {
id
status {
id
}
rank
}
action
}
}
```
**Trade-offs:**
- ✅ Flexible queries for different views
- ✅ Built-in subscriptions for real-time
- ✅ Single endpoint simplifies client
- ❌ Caching more complex
- ❌ Rate limiting harder
- ❌ Learning curve for teams
**Real-world example:** Linear uses GraphQL for all API operations—the same schema powers their web app, mobile app, and public API.
### Path Comparison
| Factor | REST | Local-First | GraphQL |
| ------------------------- | -------- | ----------- | -------- |
| Implementation complexity | Low | High | Medium |
| UI responsiveness | Medium | Excellent | Good |
| Offline support | Limited | Native | Limited |
| Client flexibility | Low | Low | High |
| Real-time complexity | Separate | Built-in | Built-in |
| Caching | Simple | Complex | Medium |
### This Article's Focus
This article focuses on **Path C (GraphQL with REST fallback)** because:
1. Flexible field selection suits varied board configurations
2. Subscriptions provide native real-time support
3. REST endpoints can coexist for webhooks and simple integrations
4. Most modern issue trackers (Linear, Notion) use this approach
## High-Level Design
### Component Overview

### Issue Service
Handles core issue CRUD operations and ordering.
**Responsibilities:**
- Create, read, update, delete issues
- Rank calculation for ordering
- Status transitions with workflow validation
- Optimistic locking for concurrent updates
**Key design decisions:**
| Decision | Choice | Rationale |
| ----------- | ----------------------- | ------------------------------------------ |
| Primary key | UUID | Distributed ID generation, no coordination |
| Ordering | LexoRank string | O(1) reordering without cascading updates |
| Versioning | Monotonic version field | Optimistic locking for concurrent edits |
### Project Service
Manages project configuration including workflows.
**Responsibilities:**
- Project CRUD
- Workflow definition per project
- Status and transition management
- Board configuration (columns, filters)
**Design decision:** Each project owns its workflow definition. Statuses are project-scoped, not global. This allows teams to customize without affecting others.
### Board Service
Optimizes board view queries by maintaining denormalized state.
**Responsibilities:**
- Cache board state in Redis
- Compute issue counts per column
- Handle board-level operations (collapse column, set WIP limits)
**Why separate service:** Board queries require joining issues, statuses, and users. Denormalizing into Redis achieves sub-50ms board loads.
### Workflow Service
Enforces workflow rules and transitions.
**Responsibilities:**
- Validate status transitions
- Execute transition side effects (webhooks, automations)
- Maintain workflow history
**Transition validation flow:**

## API Design
### GraphQL Schema (Core Types)
```graphql
type Issue {
id: ID!
key: String! # e.g., "PROJ-123"
title: String!
description: String
status: Status!
assignee: User
reporter: User!
priority: Priority!
issueType: IssueType!
rank: String! # LexoRank for ordering
version: Int! # Optimistic locking
project: Project!
comments(first: Int, after: String): CommentConnection!
activity(first: Int, after: String): ActivityConnection!
createdAt: DateTime!
updatedAt: DateTime!
}
type Status {
id: ID!
name: String!
category: StatusCategory! # TODO, IN_PROGRESS, DONE
color: String!
position: Int! # Column order
}
type Project {
id: ID!
key: String!
name: String!
workflow: Workflow!
statuses: [Status!]!
issueTypes: [IssueType!]!
}
type Workflow {
id: ID!
name: String!
statuses: [Status!]!
transitions: [Transition!]!
}
type Transition {
id: ID!
name: String!
fromStatus: Status
toStatus: Status!
conditions: [TransitionCondition!]
}
enum StatusCategory {
TODO
IN_PROGRESS
DONE
}
enum Priority {
LOWEST
LOW
MEDIUM
HIGH
HIGHEST
}
```
### Board Query with Per-Column Pagination
The key challenge: fetch issues across multiple columns where each column can have different numbers of issues.
**Naive approach (problematic):**
```graphql
# BAD: Fetches all issues, client groups by status
query {
issues(projectId: "proj-1", first: 100) {
nodes {
id
status {
id
}
}
}
}
# Problem: If 90 issues are in "To Do", other columns appear empty
```
**Per-column pagination approach:**
```graphql
type BoardColumn {
status: Status!
issues(first: Int!, after: String): IssueConnection!
totalCount: Int!
}
type Board {
id: ID!
project: Project!
columns: [BoardColumn!]!
}
query GetBoard($projectId: ID!, $issuesPerColumn: Int!) {
board(projectId: $projectId) {
columns {
status {
id
name
color
}
totalCount
issues(first: $issuesPerColumn) {
nodes {
id
key
title
assignee {
id
name
avatar
}
priority
rank
}
pageInfo {
hasNextPage
endCursor
}
}
}
}
}
```
**Response structure:**
```json
{
"data": {
"board": {
"columns": [
{
"status": { "id": "status-1", "name": "To Do", "color": "#grey" },
"totalCount": 45,
"issues": {
"nodes": [
/* first 20 issues */
],
"pageInfo": { "hasNextPage": true, "endCursor": "cursor-abc" }
}
},
{
"status": { "id": "status-2", "name": "In Progress", "color": "#blue" },
"totalCount": 12,
"issues": {
"nodes": [
/* first 12 issues - no more pages */
],
"pageInfo": { "hasNextPage": false, "endCursor": "cursor-xyz" }
}
},
{
"status": { "id": "status-3", "name": "Done", "color": "#green" },
"totalCount": 89,
"issues": {
"nodes": [
/* first 20 issues */
],
"pageInfo": { "hasNextPage": true, "endCursor": "cursor-def" }
}
}
]
}
}
}
```
**Load more for specific column:**
```graphql
query LoadMoreIssues($statusId: ID!, $after: String!) {
column(statusId: $statusId) {
issues(first: 20, after: $after) {
nodes {
id
key
title
rank
}
pageInfo {
hasNextPage
endCursor
}
}
}
}
```
### Issue Mutations
**Move Issue (status change + reorder):**
```graphql
input MoveIssueInput {
issueId: ID!
toStatusId: ID!
rankAfterId: ID # Issue to position after (null = top)
rankBeforeId: ID # Issue to position before (null = bottom)
version: Int! # For optimistic locking
}
type MoveIssuePayload {
issue: Issue
error: MoveIssueError
}
type MoveIssueError {
code: MoveIssueErrorCode!
message: String!
}
enum MoveIssueErrorCode {
ISSUE_NOT_FOUND
INVALID_TRANSITION
VERSION_CONFLICT
PERMISSION_DENIED
}
mutation MoveIssue($input: MoveIssueInput!) {
moveIssue(input: $input) {
issue {
id
status {
id
name
}
rank
version
updatedAt
}
error {
code
message
}
}
}
```
**Update Issue:**
```graphql
input UpdateIssueInput {
issueId: ID!
title: String
description: String
assigneeId: ID
priority: Priority
version: Int!
}
mutation UpdateIssue($input: UpdateIssueInput!) {
updateIssue(input: $input) {
issue {
id
title
description
assignee {
id
name
}
priority
version
updatedAt
}
error {
code
message
}
}
}
```
### Real-time Subscriptions
```graphql
type BoardEvent {
issue: Issue!
action: BoardAction!
previousStatusId: ID # For status changes
previousRank: String # For reorders
}
enum BoardAction {
CREATED
UPDATED
MOVED
DELETED
}
subscription OnBoardChange($projectId: ID!) {
boardChanged(projectId: $projectId) {
issue {
id
key
title
status {
id
}
rank
assignee {
id
name
}
version
}
action
previousStatusId
}
}
```
### REST API Fallback
For webhooks and simple integrations:
**Move Issue:**
```http
PATCH /api/v1/issues/{issueId}/move
Content-Type: application/json
If-Match: "version-5"
{
"statusId": "status-3",
"rankAfterId": "issue-456",
"rankBeforeId": null
}
```
**Response:**
```http
HTTP/1.1 200 OK
ETag: "version-6"
{
"id": "issue-123",
"key": "PROJ-123",
"status": { "id": "status-3", "name": "Done" },
"rank": "0|i002bc",
"version": 6,
"updatedAt": "2024-02-03T10:00:00Z"
}
```
**Error Responses:**
| Code | Error | When |
| ---- | --------------------- | ----------------------------------------- |
| 400 | `INVALID_TRANSITION` | Workflow doesn't allow this status change |
| 404 | `NOT_FOUND` | Issue or target status doesn't exist |
| 409 | `VERSION_CONFLICT` | Version mismatch (concurrent edit) |
| 412 | `PRECONDITION_FAILED` | ETag mismatch |
## Data Modeling
### Core Schema (PostgreSQL)
```sql
-- Projects with embedded workflow reference
CREATE TABLE projects (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
key VARCHAR(10) UNIQUE NOT NULL, -- e.g., "PROJ"
name VARCHAR(255) NOT NULL,
description TEXT,
owner_id UUID NOT NULL REFERENCES users(id),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Statuses are project-scoped
CREATE TABLE statuses (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
name VARCHAR(100) NOT NULL,
category VARCHAR(20) NOT NULL, -- 'todo', 'in_progress', 'done'
color VARCHAR(7) DEFAULT '#808080',
position INT NOT NULL, -- Column order
is_initial BOOLEAN DEFAULT FALSE, -- Default for new issues
UNIQUE (project_id, name)
);
CREATE INDEX idx_statuses_project ON statuses(project_id, position);
-- Workflow transitions define allowed status changes
CREATE TABLE workflow_transitions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
from_status_id UUID REFERENCES statuses(id) ON DELETE CASCADE, -- NULL = any
to_status_id UUID NOT NULL REFERENCES statuses(id) ON DELETE CASCADE,
name VARCHAR(100) NOT NULL,
opsbar_sequence INT DEFAULT 10, -- UI ordering
UNIQUE (project_id, from_status_id, to_status_id)
);
-- Issue types (Epic, Story, Task, Bug)
CREATE TABLE issue_types (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
name VARCHAR(50) NOT NULL,
icon VARCHAR(50),
color VARCHAR(7),
UNIQUE (project_id, name)
);
-- Issues with LexoRank ordering
CREATE TABLE issues (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(id),
issue_type_id UUID NOT NULL REFERENCES issue_types(id),
status_id UUID NOT NULL REFERENCES statuses(id),
-- Issue key: computed from project key + sequence
issue_number INT NOT NULL,
title VARCHAR(500) NOT NULL,
description TEXT,
assignee_id UUID REFERENCES users(id),
reporter_id UUID NOT NULL REFERENCES users(id),
priority VARCHAR(20) DEFAULT 'medium',
-- LexoRank for ordering within status
-- Format: "0|hzzzzz" (bucket | alphanumeric)
rank VARCHAR(255) NOT NULL,
-- Optimistic locking
version INT DEFAULT 1,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (project_id, issue_number)
);
-- Primary query: issues by status, ordered by rank
CREATE INDEX idx_issues_board ON issues(project_id, status_id, rank);
-- Secondary: issues by assignee
CREATE INDEX idx_issues_assignee ON issues(assignee_id, updated_at DESC);
-- Issue key lookup
CREATE INDEX idx_issues_key ON issues(project_id, issue_number);
-- Comments
CREATE TABLE comments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
issue_id UUID NOT NULL REFERENCES issues(id) ON DELETE CASCADE,
author_id UUID NOT NULL REFERENCES users(id),
body TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_comments_issue ON comments(issue_id, created_at);
-- Activity log (append-only)
CREATE TABLE activity_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
issue_id UUID NOT NULL REFERENCES issues(id) ON DELETE CASCADE,
user_id UUID NOT NULL REFERENCES users(id),
action_type VARCHAR(50) NOT NULL, -- 'status_change', 'assignment', etc.
old_value JSONB,
new_value JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_activity_issue ON activity_log(issue_id, created_at DESC);
```
### Database Selection Rationale
| Data Type | Store | Rationale |
| ---------------- | ------------------ | -------------------------------------- |
| Issues, Projects | PostgreSQL | ACID, complex queries, JOIN capability |
| Board cache | Redis | Sub-ms reads, TTL for staleness |
| Search index | Elasticsearch | Full-text search, faceted filtering |
| Activity log | PostgreSQL → Kafka | Append-only, stream processing |
| Attachments | S3 | Cost-effective blob storage |
### Denormalized Board Cache (Redis)
**Why cache:** Board queries join issues, statuses, and users. Caching avoids expensive JOINs on every load.
**Structure:**
```redis
# Board metadata
HSET board:{project_id}:meta
columns_json "[{\"status_id\":\"s1\",\"name\":\"To Do\"}...]"
total_issues 156
last_updated 1706886400000
# Per-column issue list (sorted set by rank)
ZADD board:{project_id}:column:{status_id} {rank_score} {issue_id}
# Issue card data (hash - denormalized for fast read)
HSET issue:{issue_id}:card
key "PROJ-123"
title "Implement login"
status_id "status-2"
assignee_id "user-456"
assignee_name "Alice"
priority "high"
rank "0|i000ab"
version 5
```
**Cache invalidation strategy:**
- Write-through: Update cache immediately after DB write
- TTL: 5 minutes as safety net
- Pub/Sub: Broadcast invalidation to all service instances
## Low-Level Design: LexoRank Ordering
### Why LexoRank?
Traditional integer-based ordering has a fundamental problem:
```
Before: [A:1, B:2, C:3, D:4]
Insert X between B and C:
After: [A:1, B:2, X:3, C:4, D:5] ← Must update C, D
```
With N items and frequent reorders, this is O(N) updates per operation.
**LexoRank solution:** Use lexicographically sortable strings where you can always find a value between any two existing values.
```
Before: [A:"aaa", B:"bbb", C:"ccc"]
Insert X between B and C:
After: [A:"aaa", B:"bbb", X:"bbc", C:"ccc"] ← Only X updated
```
### LexoRank Format
Jira's LexoRank uses the format: `bucket|value`
```
0|hzzzzz
│ └─ Alphanumeric value (base-36)
└── Bucket (0, 1, or 2)
```
**Bucket rotation:** Three buckets enable rebalancing without locking. While bucket 0 rebalances, new operations use bucket 1.
### Rank Calculation Algorithm
```typescript collapse={1-15}
// Simplified LexoRank implementation
const LEXORANK_CHARS = "0123456789abcdefghijklmnopqrstuvwxyz"
const BASE = LEXORANK_CHARS.length // 36
interface LexoRank {
bucket: number
value: string
}
function parseLexoRank(rank: string): LexoRank {
const [bucket, value] = rank.split("|")
return { bucket: parseInt(bucket), value }
}
function formatLexoRank(rank: LexoRank): string {
return `${rank.bucket}|${rank.value}`
}
function getMidpoint(a: string, b: string): string {
// Ensure same length by padding with '0's
const maxLen = Math.max(a.length, b.length)
const aPadded = a.padEnd(maxLen, "0")
const bPadded = b.padEnd(maxLen, "0")
// Convert to numbers (treating as base-36)
let result = ""
let carry = 0
for (let i = maxLen - 1; i >= 0; i--) {
const aVal = LEXORANK_CHARS.indexOf(aPadded[i])
const bVal = LEXORANK_CHARS.indexOf(bPadded[i])
const sum = aVal + bVal + carry
const mid = Math.floor(sum / 2)
carry = sum % 2
result = LEXORANK_CHARS[mid] + result
}
// If a and b are adjacent, extend with midpoint
if (result === aPadded) {
result += LEXORANK_CHARS[Math.floor(BASE / 2)] // 'i'
}
return result.replace(/0+$/, "") // Trim trailing zeros
}
function calculateNewRank(before: string | null, after: string | null, bucket: number = 0): string {
if (!before && !after) {
// First item - use middle of range
return formatLexoRank({ bucket, value: "i" })
}
if (!before) {
// Insert at top - find value before 'after'
const afterRank = parseLexoRank(after!)
const newValue = getMidpoint("0", afterRank.value)
return formatLexoRank({ bucket, value: newValue })
}
if (!after) {
// Insert at bottom - find value after 'before'
const beforeRank = parseLexoRank(before)
const newValue = getMidpoint(beforeRank.value, "z")
return formatLexoRank({ bucket, value: newValue })
}
// Insert between two items
const beforeRank = parseLexoRank(before)
const afterRank = parseLexoRank(after)
const newValue = getMidpoint(beforeRank.value, afterRank.value)
return formatLexoRank({ bucket, value: newValue })
}
```
### Rebalancing Strategy
LexoRank strings grow as items are repeatedly inserted between adjacent values:
```
Initial: "i"
After 1: "ii"
After 2: "iii"
...
After 50: "iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii"
```
**Jira's rebalancing thresholds:**
1. Rank length > 64 chars → Schedule rebalance within 12 hours
2. Second trigger within 12 hours → Immediate rebalance
3. Rank length > 254 chars → Disable ranking until complete
**Rebalancing algorithm:**
```typescript collapse={1-5}
async function rebalanceColumn(projectId: string, statusId: string): Promise {
// 1. Lock column for writes (or use different bucket)
const lockKey = `rebalance:${projectId}:${statusId}`
await redis.set(lockKey, "1", "EX", 300) // 5 min lock
try {
// 2. Fetch all issues ordered by current rank
const issues = await db.query(
`
SELECT id, rank
FROM issues
WHERE project_id = $1 AND status_id = $2
ORDER BY rank
`,
[projectId, statusId],
)
// 3. Assign evenly-spaced new ranks
const newBucket = (parseInt(issues[0]?.rank?.split("|")[0] || "0") + 1) % 3
const step = Math.floor(BASE / (issues.length + 1))
const updates = issues.map((issue, index) => {
const position = step * (index + 1)
const newValue = position.toString(36).padStart(6, "0")
return {
id: issue.id,
newRank: `${newBucket}|${newValue}`,
}
})
// 4. Batch update
await db.transaction(async (tx) => {
for (const { id, newRank } of updates) {
await tx.query("UPDATE issues SET rank = $1 WHERE id = $2", [newRank, id])
}
})
// 5. Invalidate cache
await invalidateBoardCache(projectId)
} finally {
await redis.del(lockKey)
}
}
```
## Low-Level Design: Concurrent Edit Handling
### Optimistic Locking Flow

### Implementation
```typescript collapse={1-20}
interface UpdateIssueInput {
issueId: string
title?: string
description?: string
assigneeId?: string
version: number
}
interface UpdateResult {
success: boolean
issue?: Issue
error?: { code: string; message: string }
}
async function updateIssue(input: UpdateIssueInput): Promise {
const { issueId, version, ...updates } = input
// Build dynamic UPDATE query
const setClause = Object.entries(updates)
.filter(([_, v]) => v !== undefined)
.map(([k, _], i) => `${toSnakeCase(k)} = $${i + 3}`)
.join(", ")
const values = Object.values(updates).filter((v) => v !== undefined)
const result = await db.query(
`
UPDATE issues
SET ${setClause}, version = version + 1, updated_at = NOW()
WHERE id = $1 AND version = $2
RETURNING *
`,
[issueId, version, ...values],
)
if (result.rowCount === 0) {
// Check if issue exists
const exists = await db.query("SELECT version FROM issues WHERE id = $1", [issueId])
if (exists.rowCount === 0) {
return {
success: false,
error: { code: "NOT_FOUND", message: "Issue not found" },
}
}
const currentVersion = exists.rows[0].version
return {
success: false,
error: {
code: "VERSION_CONFLICT",
message: `Version mismatch. Expected ${version}, current is ${currentVersion}`,
},
}
}
// Broadcast change
await publishBoardEvent(result.rows[0].project_id, {
action: "UPDATED",
issue: result.rows[0],
})
return { success: true, issue: result.rows[0] }
}
```
### Conflict Resolution Strategies
| Strategy | Use Case | Trade-off |
| --------------------- | --------------------------------------- | ------------------------------- |
| **Last-Write-Wins** | Most fields (title, assignee, priority) | May lose edits, but simple |
| **Field-Level Merge** | Non-conflicting field updates | More complex, preserves more |
| **Manual Resolution** | Description (rich text) | Best fidelity, worst UX |
| **CRDT** | Concurrent rich text editing | Complex, best for collaboration |
**Field-level merge example:**
```typescript
// Client 1 updates title (version 5 → 6)
// Client 2 updates assignee (version 5 → conflict)
// Instead of rejecting, merge if fields don't overlap
async function mergeUpdate(input: UpdateIssueInput, currentIssue: Issue): Promise {
const { version, ...updates } = input
// Find which fields changed since client's version
const changedFields = await getChangedFieldsSince(input.issueId, version, currentIssue.version)
// Check for conflicts
const conflictingFields = Object.keys(updates).filter((f) => changedFields.includes(f))
if (conflictingFields.length > 0) {
return {
success: false,
error: {
code: "FIELD_CONFLICT",
message: `Conflicting fields: ${conflictingFields.join(", ")}`,
},
}
}
// No conflicts - apply update to latest version
return updateIssue({
...input,
version: currentIssue.version,
})
}
```
### Move Operation (Status + Rank)
Moving an issue involves two atomic changes: status and rank.
```typescript collapse={1-10}
interface MoveIssueInput {
issueId: string
toStatusId: string
rankAfterId?: string
rankBeforeId?: string
version: number
}
async function moveIssue(input: MoveIssueInput): Promise {
const { issueId, toStatusId, rankAfterId, rankBeforeId, version } = input
return db.transaction(async (tx) => {
// 1. Lock and fetch current issue
const issue = await tx.query("SELECT * FROM issues WHERE id = $1 FOR UPDATE", [issueId])
if (!issue.rows[0]) {
return { success: false, error: { code: "NOT_FOUND", message: "Issue not found" } }
}
if (issue.rows[0].version !== version) {
return {
success: false,
error: { code: "VERSION_CONFLICT", message: "Concurrent modification" },
}
}
const currentIssue = issue.rows[0]
// 2. Validate transition
const transitionValid = await validateTransition(tx, currentIssue.project_id, currentIssue.status_id, toStatusId)
if (!transitionValid) {
return {
success: false,
error: { code: "INVALID_TRANSITION", message: "Workflow does not allow this transition" },
}
}
// 3. Calculate new rank
let newRank: string
if (rankAfterId) {
const afterIssue = await tx.query("SELECT rank FROM issues WHERE id = $1", [rankAfterId])
const beforeIssue = rankBeforeId ? await tx.query("SELECT rank FROM issues WHERE id = $1", [rankBeforeId]) : null
newRank = calculateNewRank(afterIssue.rows[0]?.rank, beforeIssue?.rows[0]?.rank)
} else if (rankBeforeId) {
const beforeIssue = await tx.query("SELECT rank FROM issues WHERE id = $1", [rankBeforeId])
newRank = calculateNewRank(null, beforeIssue.rows[0]?.rank)
} else {
// Default: bottom of column
const lastInColumn = await tx.query(
`
SELECT rank FROM issues
WHERE project_id = $1 AND status_id = $2
ORDER BY rank DESC LIMIT 1
`,
[currentIssue.project_id, toStatusId],
)
newRank = calculateNewRank(lastInColumn.rows[0]?.rank, null)
}
// 4. Update issue
const result = await tx.query(
`
UPDATE issues
SET status_id = $1, rank = $2, version = version + 1, updated_at = NOW()
WHERE id = $3
RETURNING *
`,
[toStatusId, newRank, issueId],
)
// 5. Log activity
await tx.query(
`
INSERT INTO activity_log (issue_id, user_id, action_type, old_value, new_value)
VALUES ($1, $2, 'status_change', $3, $4)
`,
[
issueId,
getCurrentUserId(),
JSON.stringify({ status_id: currentIssue.status_id }),
JSON.stringify({ status_id: toStatusId }),
],
)
// 6. Broadcast (after commit)
setImmediate(() => {
publishBoardEvent(currentIssue.project_id, {
action: "MOVED",
issue: result.rows[0],
previousStatusId: currentIssue.status_id,
})
})
return { success: true, issue: result.rows[0] }
})
}
```
## Low-Level Design: Workflow and Status Management
### Workflow Data Model
Each project has its own workflow, defined by statuses and transitions.

### Fetching Workflow Configuration
```graphql
query GetProjectWorkflow($projectId: ID!) {
project(id: $projectId) {
workflow {
statuses {
id
name
category
color
position
}
transitions {
id
name
fromStatus {
id
}
toStatus {
id
}
}
}
}
}
```
**Response structure:**
```json
{
"project": {
"workflow": {
"statuses": [
{ "id": "s1", "name": "To Do", "category": "TODO", "color": "#808080", "position": 1 },
{ "id": "s2", "name": "In Progress", "category": "IN_PROGRESS", "color": "#0052cc", "position": 2 },
{ "id": "s3", "name": "In Review", "category": "IN_PROGRESS", "color": "#8777d9", "position": 3 },
{ "id": "s4", "name": "Done", "category": "DONE", "color": "#36b37e", "position": 4 }
],
"transitions": [
{ "id": "t1", "name": "Start Progress", "fromStatus": { "id": "s1" }, "toStatus": { "id": "s2" } },
{ "id": "t2", "name": "Submit for Review", "fromStatus": { "id": "s2" }, "toStatus": { "id": "s3" } },
{ "id": "t3", "name": "Approve", "fromStatus": { "id": "s3" }, "toStatus": { "id": "s4" } },
{ "id": "t4", "name": "Reject", "fromStatus": { "id": "s3" }, "toStatus": { "id": "s2" } },
{ "id": "t5", "name": "Reopen", "fromStatus": { "id": "s4" }, "toStatus": { "id": "s1" } }
]
}
}
}
```
### Workflow Mutation API
```graphql
# Add a new status
mutation AddStatus($input: AddStatusInput!) {
addStatus(input: $input) {
status {
id
name
category
position
}
}
}
# Add a transition
mutation AddTransition($input: AddTransitionInput!) {
addTransition(input: $input) {
transition {
id
name
fromStatus {
id
}
toStatus {
id
}
}
}
}
# Reorder statuses (columns)
mutation ReorderStatuses($input: ReorderStatusesInput!) {
reorderStatuses(input: $input) {
statuses {
id
position
}
}
}
```
### Client-Side Workflow Validation
To provide instant feedback, clients cache workflow rules:
```typescript collapse={1-10}
interface WorkflowCache {
statuses: Map
transitions: Map> // fromStatusId → Set
}
class WorkflowValidator {
private cache: WorkflowCache
constructor(workflow: Workflow) {
this.cache = {
statuses: new Map(workflow.statuses.map((s) => [s.id, s])),
transitions: new Map(),
}
// Build transition map
for (const t of workflow.transitions) {
const fromId = t.fromStatus?.id || "*" // null = any status
if (!this.cache.transitions.has(fromId)) {
this.cache.transitions.set(fromId, new Set())
}
this.cache.transitions.get(fromId)!.add(t.toStatus.id)
}
}
canTransition(fromStatusId: string, toStatusId: string): boolean {
// Check specific transition
if (this.cache.transitions.get(fromStatusId)?.has(toStatusId)) {
return true
}
// Check wildcard (from any status)
if (this.cache.transitions.get("*")?.has(toStatusId)) {
return true
}
return false
}
getAvailableTransitions(fromStatusId: string): Status[] {
const specific = this.cache.transitions.get(fromStatusId) || new Set()
const wildcard = this.cache.transitions.get("*") || new Set()
const available = new Set([...specific, ...wildcard])
return Array.from(available)
.map((id) => this.cache.statuses.get(id)!)
.filter(Boolean)
}
}
```
## Frontend Considerations
### Board State Management
**Normalized data structure:**
```typescript
interface BoardState {
// Entities by ID
issues: Record
statuses: Record
users: Record
// Ordering
columnOrder: string[] // Status IDs in display order
issueOrder: Record // statusId → issueIds in rank order
// Pagination
columnCursors: Record
columnHasMore: Record
// UI state
draggingIssueId: string | null
dropTargetColumn: string | null
dropTargetIndex: number | null
}
```
**Why normalized:**
- Moving an issue updates two arrays, not nested objects
- React reference equality works for memoization
- Easier to apply real-time updates
### Optimistic Updates for Drag-and-Drop
```typescript collapse={1-20}
function useMoveIssue() {
const [boardState, setBoardState] = useState(initialState)
const pendingMoves = useRef>(new Map())
const moveIssue = async (issueId: string, toStatusId: string, toIndex: number) => {
const issue = boardState.issues[issueId]
const fromStatusId = issue.statusId
// 1. Save previous state for rollback
const previousState = structuredClone(boardState)
pendingMoves.current.set(issueId, { previousState })
// 2. Optimistic update
setBoardState((state) => {
const newState = { ...state }
// Remove from old column
newState.issueOrder = {
...state.issueOrder,
[fromStatusId]: state.issueOrder[fromStatusId].filter((id) => id !== issueId),
}
// Add to new column at index
const newColumnOrder = [...(state.issueOrder[toStatusId] || [])]
newColumnOrder.splice(toIndex, 0, issueId)
newState.issueOrder[toStatusId] = newColumnOrder
// Update issue status
newState.issues = {
...state.issues,
[issueId]: { ...issue, statusId: toStatusId },
}
return newState
})
// 3. Server request
const rankAfterId = toIndex > 0 ? boardState.issueOrder[toStatusId]?.[toIndex - 1] : null
const rankBeforeId = boardState.issueOrder[toStatusId]?.[toIndex] || null
try {
const result = await api.moveIssue({
issueId,
toStatusId,
rankAfterId,
rankBeforeId,
version: issue.version,
})
if (!result.success) {
throw new Error(result.error?.message || "Move failed")
}
// 4. Update with server-assigned rank and version
setBoardState((state) => ({
...state,
issues: {
...state.issues,
[issueId]: { ...state.issues[issueId], ...result.issue },
},
}))
pendingMoves.current.delete(issueId)
} catch (error) {
// 5. Rollback on failure
const pending = pendingMoves.current.get(issueId)
if (pending) {
setBoardState(pending.previousState)
pendingMoves.current.delete(issueId)
}
toast.error("Failed to move issue. Please try again.")
}
}
return { boardState, moveIssue }
}
```
### Real-time Update Handling
```typescript collapse={1-15}
function useBoardSubscription(projectId: string) {
const [boardState, setBoardState] = useState(initialState)
useEffect(() => {
const subscription = graphqlClient
.subscribe({
query: BOARD_CHANGED_SUBSCRIPTION,
variables: { projectId },
})
.subscribe({
next: ({ data }) => {
const event = data.boardChanged
setBoardState((state) => {
// Skip if this is our own optimistic update
if (pendingMoves.current.has(event.issue.id)) {
return state
}
switch (event.action) {
case "MOVED":
return handleRemoteMove(state, event)
case "UPDATED":
return handleRemoteUpdate(state, event)
case "CREATED":
return handleRemoteCreate(state, event)
case "DELETED":
return handleRemoteDelete(state, event)
default:
return state
}
})
},
})
return () => subscription.unsubscribe()
}, [projectId])
return boardState
}
function handleRemoteMove(state: BoardState, event: BoardEvent): BoardState {
const { issue, previousStatusId } = event
const newState = { ...state }
// Remove from previous column
if (previousStatusId && state.issueOrder[previousStatusId]) {
newState.issueOrder = {
...state.issueOrder,
[previousStatusId]: state.issueOrder[previousStatusId].filter((id) => id !== issue.id),
}
}
// Add to new column in correct position based on rank
const currentColumnOrder = state.issueOrder[issue.statusId] || []
const insertIndex = findInsertIndex(currentColumnOrder, issue.rank, state.issues)
const newColumnOrder = [...currentColumnOrder]
newColumnOrder.splice(insertIndex, 0, issue.id)
newState.issueOrder[issue.statusId] = newColumnOrder
// Update issue data
newState.issues = {
...state.issues,
[issue.id]: issue,
}
return newState
}
```
### Column Virtualization
For boards with many issues per column, virtualize the issue list:
```typescript collapse={1-10}
import { useVirtualizer } from '@tanstack/react-virtual';
function VirtualizedColumn({
statusId,
issueIds
}: {
statusId: string;
issueIds: string[]
}) {
const parentRef = useRef(null);
const virtualizer = useVirtualizer({
count: issueIds.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 80, // Estimated card height
overscan: 5 // Render 5 extra items for smooth scrolling
});
return (
{virtualizer.getVirtualItems().map((virtualItem) => (
))}
);
}
```
## Infrastructure
### Cloud-Agnostic Components
| Component | Purpose | Options |
| -------------- | --------------------- | ------------------------------------- |
| API Gateway | Request routing, auth | Kong, Nginx, Traefik |
| GraphQL Server | Query execution | Apollo Server, Mercurius |
| Message Queue | Event streaming | Kafka, RabbitMQ, NATS |
| Cache | Board state, sessions | Redis, Memcached, KeyDB |
| Search | Full-text search | Elasticsearch, Meilisearch, Typesense |
| Object Storage | Attachments | MinIO, Ceph, S3-compatible |
| Database | Primary store | PostgreSQL, CockroachDB |
### AWS Reference Architecture

**Service configurations:**
| Service | Configuration | Rationale |
| ------------------- | ---------------------- | -------------------------------------- |
| GraphQL (Fargate) | 2 vCPU, 4GB RAM | Stateless, scale on request rate |
| WebSocket (Fargate) | 2 vCPU, 4GB RAM | Connection-bound, ~10K per instance |
| Workers (Spot) | 1 vCPU, 2GB RAM | Cost optimization for async |
| RDS PostgreSQL | db.r6g.xlarge Multi-AZ | Primary store, read replicas for scale |
| ElastiCache | r6g.large cluster | Board cache, pub/sub |
| OpenSearch | m6g.large.search × 3 | Search index, 3 nodes for HA |
### Scaling Considerations
**Read-heavy workload:**
- Read replicas for PostgreSQL
- Redis caching for board state
- CDN for static assets
**WebSocket connections:**
- Sticky sessions to WebSocket servers
- Redis pub/sub for cross-instance broadcast
- ~10K connections per 4GB instance
**Search indexing:**
- Async indexing via Kafka
- Dedicated OpenSearch domain
- Index aliases for zero-downtime reindexing
## Conclusion
This design provides a flexible issue tracking system with:
1. **O(1) reordering** via LexoRank eliminates cascading updates
2. **Per-column pagination** ensures all columns load incrementally
3. **Optimistic locking** handles concurrent edits with minimal conflict
4. **Project-scoped workflows** allow team customization without global impact
5. **Real-time sync** via GraphQL subscriptions provides sub-300ms propagation
**Key architectural decisions:**
- LexoRank for ordering trades storage (growing strings) for write efficiency
- Per-column pagination over global pagination ensures balanced board views
- Last-write-wins is acceptable for most fields; CRDTs reserved for rich text
- Denormalized Redis cache trades consistency for read performance
**Known limitations:**
- LexoRank requires periodic rebalancing (background job)
- Last-write-wins may lose concurrent edits on same field
- Large boards (>1000 issues) need virtualization
**Future enhancements:**
- Local-first architecture for offline support (Linear-style sync engine)
- Field-level CRDTs for conflict-free concurrent editing
- GraphQL federation for microservices decomposition
## Appendix
### Prerequisites
- Distributed systems fundamentals (eventual consistency, optimistic locking)
- GraphQL basics (queries, mutations, subscriptions)
- React state management patterns
- SQL and database design
### Terminology
| Term | Definition |
| --------------------------- | ------------------------------------------------------------------------ |
| **LexoRank** | Lexicographically sortable string for ordering without cascading updates |
| **Optimistic locking** | Concurrency control using version numbers to detect conflicts |
| **Workflow** | Set of statuses and allowed transitions between them |
| **Fractional indexing** | Using real numbers (or strings) for ordering with O(1) insertions |
| **Cursor-based pagination** | Using opaque cursors instead of offsets for stable pagination |
| **Last-write-wins (LWW)** | Conflict resolution where the latest timestamp wins |
### Summary
- **LexoRank ordering** enables O(1) drag-and-drop without updating other rows
- **Per-column pagination** with cursor-based approach ensures balanced board loading
- **Optimistic locking** with version field detects concurrent modifications
- **Project-scoped workflows** allow custom statuses without schema changes
- **GraphQL subscriptions** provide real-time updates with sub-300ms propagation
- **Denormalized Redis cache** trades consistency for fast board reads
### References
**Issue Tracker APIs:**
- [Jira Software Cloud REST API](https://developer.atlassian.com/cloud/jira/software/rest/intro/) - Board and agile endpoints
- [Jira Cloud Platform REST API](https://developer.atlassian.com/cloud/jira/platform/rest/v3/) - Issue and workflow endpoints
- [Linear Developers - GraphQL API](https://linear.app/developers/graphql) - GraphQL schema and patterns
- [Asana Developers](https://developers.asana.com/reference/) - Task and section ordering
**Ordering Algorithms:**
- [Figma Blog - Realtime Editing of Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/) - Fractional indexing at scale
- [Understanding LexoRank](https://support.atlassian.com/jira/kb/understanding-and-managing-lexorank-in-jira-server/) - Jira's ranking system
- [LexoRank Explained](https://tmcalm.nl/blog/lexorank-jira-ranking-system-explained/) - Detailed algorithm walkthrough
- [rocicorp/fractional-indexing](https://github.com/rocicorp/fractional-indexing) - Reference implementation
**Sync and Real-time:**
- [Scaling the Linear Sync Engine](https://linear.app/now/scaling-the-linear-sync-engine) - Local-first architecture
- [Reverse Engineering Linear's Sync Engine](https://github.com/wzhudev/reverse-linear-sync-engine) - Technical deep-dive
- [Conflict-free Replicated Data Types](https://crdt.tech/) - CRDT resources
**System Design:**
- [Optimistic Concurrency Control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) - Concurrency patterns
- [Cursor-based Pagination](https://relay.dev/graphql/connections.htm) - Relay connection specification
---
## Design a Payment System
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-payment-system
**Category:** System Design / System Design Problems
**Description:** Building a payment processing platform that handles card transactions, bank transfers, and digital wallets with PCI DSS compliance, idempotent processing, and real-time fraud detection. Payment systems operate under unique constraints: zero tolerance for duplicate charges, regulatory mandates (PCI DSS), and sub-second fraud decisions. This design covers the complete payment lifecycle—authorization, capture, settlement—plus reconciliation, refunds, and multi-gateway routing.
# Design a Payment System
Building a payment processing platform that handles card transactions, bank transfers, and digital wallets with PCI DSS compliance, idempotent processing, and real-time fraud detection. Payment systems operate under unique constraints: zero tolerance for duplicate charges, regulatory mandates (PCI DSS), and sub-second fraud decisions. This design covers the complete payment lifecycle—authorization, capture, settlement—plus reconciliation, refunds, and multi-gateway routing.

Payment system architecture: Client apps submit payments through an idempotent API, fraud engine scores in real-time, smart router selects optimal processor, authorization flows through card networks, and all movements are recorded in a double-entry ledger.
## Abstract
Payment system design revolves around four competing constraints:
1. **Exactly-once processing** — Network failures and retries must never result in duplicate charges. Idempotency keys + request fingerprinting make every operation safely retriable.
2. **PCI compliance scope reduction** — Cardholder data (PAN, CVV) must never touch your servers if avoidable. Tokenization at the edge (via Stripe Elements, Adyen Web Components) keeps sensitive data out of your environment.
3. **Latency under fraud scrutiny** — Fraud decisions must complete in <100ms to avoid checkout abandonment, while evaluating 1000+ signals per transaction.
4. **Financial accuracy** — Every fund movement (authorization hold, capture, refund, chargeback) must be recorded in a double-entry ledger. Reconciliation ensures external settlements match internal records.
The mental model: **tokenize → authorize → capture → settle → reconcile**. Each stage has distinct timing, failure modes, and rollback procedures.
| Design Decision | Trade-off |
| ------------------- | ------------------------------------------------------- |
| Edge tokenization | Removes PAN from scope; adds client SDK complexity |
| Idempotency keys | Safe retries; requires key management and storage |
| Smart routing | Higher auth rates; multi-processor operational overhead |
| Async settlement | Handles scale; delayed confirmation visibility |
| Double-entry ledger | Audit-ready; write amplification |
## Requirements
### Functional Requirements
| Feature | Scope | Notes |
| ----------------------- | -------- | ---------------------------------------- |
| Card payments | Core | Visa, Mastercard, Amex via card networks |
| Bank transfers | Core | ACH (US), SEPA (EU), wire transfers |
| Digital wallets | Core | Apple Pay, Google Pay (tokenized) |
| Authorization + Capture | Core | Separate or combined (auth-capture) |
| Refunds | Core | Full and partial, with reason codes |
| Recurring payments | Core | Subscription billing with retry logic |
| Multi-currency | Extended | FX conversion at capture time |
| Split payments | Extended | Marketplace payouts |
| Disputes/Chargebacks | Extended | Evidence submission, representment |
| 3D Secure | Core | SCA compliance for EU/PSD2 |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ---------------------- | ----------- | ------------------------------------------------ |
| Availability | 99.99% | Revenue-critical; Stripe maintains 99.999% |
| Authorization latency | p99 < 2s | Card network round-trip + fraud scoring |
| Fraud decision latency | p99 < 100ms | Inline with authorization; cannot delay checkout |
| Duplicate charge rate | 0% | Non-negotiable; idempotency required |
| Data consistency | Strong | Financial data requires ACID guarantees |
| PCI DSS compliance | Level 1 | Required for >6M transactions/year |
| Settlement accuracy | 100% | Reconciliation must match external records |
### Scale Estimation
**Traffic Profile:**
| Metric | Typical | Peak (Black Friday) |
| ----------------- | --------- | ------------------- |
| Transactions/day | 10M | 50M |
| TPS (average) | 115 TPS | 580 TPS |
| TPS (peak) | 500 TPS | 2,000 TPS |
| Auth requests/sec | 1,000 RPS | 5,000 RPS |
**Reference: Visa processes 1,700-8,500 TPS average, with peak capacity of 65,000+ TPS.**
**Storage:**
```
Transactions: 10M/day × 2KB = 20GB/day
Yearly: 7.3TB
With 7-year retention: ~50TB
Ledger entries: 10M × 4 entries (avg) × 500B = 20GB/day
Event stream: 10M × 1KB = 10GB/day
```
**Latency Budget:**
```
Total authorization: 2000ms budget
├── API processing: 50ms
├── Fraud scoring: 100ms
├── Tokenization lookup: 20ms
├── Network to processor: 50ms
├── Processor to card network: 500ms
├── Issuer decision: 800ms
├── Response path: 480ms
```
## Design Paths
### Path A: Integrated Payment Gateway (Build In-House)
**Best when:**
- High transaction volume (>$1B annually)—interchange savings justify engineering cost
- Regulatory requirements demand data residency
- Unique payment flows that don't fit third-party APIs
**Architecture:**

**Key characteristics:**
- Direct acquiring bank relationships
- Full control over routing decisions
- In-house tokenization and vault
- Custom fraud rules and ML models
**Trade-offs:**
- ✅ Lower per-transaction cost at scale (save 0.1-0.3%)
- ✅ Full customization of payment flows
- ✅ Data residency control
- ❌ PCI DSS Level 1 scope (audit, penetration testing, quarterly scans)
- ❌ 12-18 month build time minimum
- ❌ Requires dedicated security and compliance team
**Real-world example:** Shopify built Shop Pay in-house, processing $12B+ in GMV (2023). Justified by volume and unique merchant financing features. Required dedicated payments engineering team of 50+.
### Path B: Third-Party Payment Platform (Stripe, Adyen)
**Best when:**
- Speed to market is critical
- Transaction volume <$500M annually
- Engineering focus should be on product, not payments infrastructure
**Architecture:**

**Key characteristics:**
- PCI scope reduced to SAQ-A (minimal questionnaire)
- Built-in fraud detection (Stripe Radar)
- Payment method coverage (cards, wallets, BNPL)
- Automatic card network compliance updates
**Trade-offs:**
- ✅ Days to integrate, not months
- ✅ PCI compliance handled by provider
- ✅ Built-in fraud detection
- ✅ Global payment method coverage
- ❌ Higher per-transaction fees (2.9% + $0.30 typical)
- ❌ Less control over routing and failover
- ❌ Vendor lock-in risk
**Real-world example:** Figma uses Stripe for all payments. At their scale (~$600M ARR), the 2.9% fee is acceptable given engineering leverage—zero payment engineers needed on staff.
### Path C: Hybrid with Payment Orchestration
**Best when:**
- Multiple payment providers needed (regional coverage, redundancy)
- Authorization rate optimization is critical
- Gradual migration from one provider to another
**Architecture:**

**Key characteristics:**
- Single API, multiple backend processors
- Intelligent routing based on card type, geography, cost
- Automatic failover on processor outage
- A/B testing payment flows
**Trade-offs:**
- ✅ Redundancy and failover
- ✅ Route optimization (Adyen reports 26% cost savings with smart routing)
- ✅ Gradual provider migration
- ❌ Additional integration layer complexity
- ❌ Token portability challenges between providers
- ❌ Orchestration platform cost
**Real-world example:** eBay uses Adyen's intelligent payment routing, achieving 26% average cost savings on US debit transactions and 0.22% uplift in authorization rates.
### Path Comparison
| Factor | Path A (Build) | Path B (Third-Party) | Path C (Orchestration) |
| -------------------- | ------------------------- | -------------------- | ------------------------ |
| Time to market | 12-18 months | Days-weeks | 1-3 months |
| PCI scope | Level 1 (full) | SAQ-A (minimal) | SAQ-A (minimal) |
| Per-transaction cost | Lowest at scale | Highest (2.9%+) | Middle |
| Engineering effort | High (50+ FTEs) | Low (1-2 FTEs) | Medium (5-10 FTEs) |
| Customization | Full | Limited | Medium |
| Best for | High-volume, unique needs | Startups, SMBs | Enterprise, multi-region |
### This Article's Focus
This article focuses on **Path B (Third-Party) with elements of Path C (Smart Routing)** because:
1. Most engineering teams should not build payment infrastructure
2. Third-party platforms handle PCI compliance, fraud, and card network changes
3. Smart routing concepts apply regardless of implementation
The architecture sections show how to integrate third-party providers while maintaining control over critical concerns like idempotency, reconciliation, and ledger accuracy.
## High-Level Design
### Component Overview
| Component | Responsibility | Technology |
| ---------------------- | ----------------------------- | -------------------------------- |
| Payment API | Idempotent payment operations | REST API + Idempotency keys |
| Token Service | Map payment methods to tokens | Stripe.js / Adyen Web Components |
| Smart Router | Select optimal processor | Rule engine + ML routing |
| Fraud Engine | Real-time risk scoring | ML model (Stripe Radar) |
| Authorization Service | Card network communication | Processor SDK |
| Capture Service | Settlement initiation | Async job processor |
| Ledger Service | Double-entry bookkeeping | PostgreSQL + event sourcing |
| Reconciliation Service | Match internal vs external | Batch jobs + anomaly detection |
| Webhook Handler | Process async events | Idempotent consumer |
### Payment Lifecycle

### Authorization vs Capture Timing
| Pattern | Use Case | Auth Window |
| ------------------------- | ------------------------------- | ------------------------------- |
| Auth + immediate capture | Digital goods, subscriptions | N/A (single request) |
| Auth then capture | E-commerce (ship then charge) | 7 days (Visa), 30 days (others) |
| Auth with delayed capture | Hotels, car rentals | Up to 31 days |
| Incremental auth | Hotels (room service additions) | Within original auth window |
**Design note:** Visa shortened online Merchant-Initiated Transaction (MIT) windows from 7 to 5 days as of April 2024. Always capture within the network's window or risk auth expiration.
## API Design
### Create Payment
```http
POST /api/v1/payments
Idempotency-Key: pay_abc123_user_456
Authorization: Bearer {api_key}
Content-Type: application/json
{
"amount": 9999,
"currency": "usd",
"payment_method_token": "pm_tok_visa_4242",
"capture_method": "automatic",
"description": "Order #12345",
"metadata": {
"order_id": "ord_789",
"customer_email": "user@example.com"
},
"idempotency_key": "pay_abc123_user_456"
}
```
**Response (201 Created):**
```json
{
"id": "pay_xyz789",
"object": "payment",
"amount": 9999,
"currency": "usd",
"status": "succeeded",
"payment_method": {
"id": "pm_tok_visa_4242",
"type": "card",
"card": {
"brand": "visa",
"last4": "4242",
"exp_month": 12,
"exp_year": 2025
}
},
"captured": true,
"receipt_url": "https://pay.example.com/receipts/pay_xyz789",
"created_at": "2024-03-15T10:00:00Z",
"metadata": {
"order_id": "ord_789"
}
}
```
**Error Responses:**
| Code | Condition | Response |
| ----------------------- | -------------------------------------------- | ---------------------------------------------------------------------------- |
| `400 Bad Request` | Invalid amount, currency | `{"error": {"code": "invalid_amount"}}` |
| `402 Payment Required` | Card declined | `{"error": {"code": "card_declined", "decline_code": "insufficient_funds"}}` |
| `409 Conflict` | Idempotency key reused with different params | `{"error": {"code": "idempotency_conflict"}}` |
| `429 Too Many Requests` | Rate limit exceeded | `{"error": {"code": "rate_limited"}}` |
### Authorize Only (Separate Capture)
```http
POST /api/v1/payments
Idempotency-Key: auth_abc123
{
"amount": 9999,
"currency": "usd",
"payment_method_token": "pm_tok_visa_4242",
"capture_method": "manual"
}
```
**Response:**
```json
{
"id": "pay_xyz789",
"status": "requires_capture",
"amount_capturable": 9999,
"capture_before": "2024-03-22T10:00:00Z"
}
```
### Capture Payment
```http
POST /api/v1/payments/{payment_id}/capture
Idempotency-Key: cap_abc123
{
"amount_to_capture": 9999
}
```
**Partial capture:** Capture less than authorized amount. Remaining authorization is automatically released.
### Refund Payment
```http
POST /api/v1/payments/{payment_id}/refunds
Idempotency-Key: ref_abc123
{
"amount": 2500,
"reason": "customer_request",
"metadata": {
"support_ticket": "TKT-456"
}
}
```
**Response:**
```json
{
"id": "ref_abc456",
"payment_id": "pay_xyz789",
"amount": 2500,
"status": "pending",
"reason": "customer_request",
"estimated_arrival": "2024-03-20"
}
```
**Refund timing:** Card refunds take 5-10 business days to appear on customer statement. ACH refunds take 3-5 business days.
### Webhook Events
```http
POST /webhooks/payments
Stripe-Signature: t=1234567890,v1=abc123...
{
"id": "evt_123",
"type": "payment_intent.succeeded",
"data": {
"object": {
"id": "pi_xyz",
"amount": 9999,
"status": "succeeded"
}
},
"created": 1234567890
}
```
**Critical webhook events:**
| Event | Action Required |
| ------------------------------- | --------------------------------------- |
| `payment_intent.succeeded` | Mark order as paid, trigger fulfillment |
| `payment_intent.payment_failed` | Notify customer, retry logic |
| `charge.refunded` | Update order status, adjust inventory |
| `charge.dispute.created` | Alert fraud team, gather evidence |
| `payout.paid` | Reconcile settlement |
## Data Modeling
### Payment Schema (PostgreSQL)
```sql collapse={1-5, 45-55}
-- Core payment record
CREATE TABLE payments (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
external_id VARCHAR(100) UNIQUE NOT NULL,
idempotency_key VARCHAR(255) UNIQUE NOT NULL,
-- Amount
amount_cents BIGINT NOT NULL CHECK (amount_cents > 0),
currency VARCHAR(3) NOT NULL,
amount_captured_cents BIGINT DEFAULT 0,
amount_refunded_cents BIGINT DEFAULT 0,
-- Status
status VARCHAR(30) NOT NULL DEFAULT 'pending',
capture_method VARCHAR(20) NOT NULL,
-- Payment method (tokenized reference)
payment_method_id UUID REFERENCES payment_methods(id),
payment_method_type VARCHAR(20) NOT NULL,
-- Customer
customer_id UUID REFERENCES customers(id),
-- Processor details
processor VARCHAR(30) NOT NULL,
processor_payment_id VARCHAR(100),
auth_code VARCHAR(20),
decline_code VARCHAR(50),
-- Risk
risk_score INTEGER,
risk_level VARCHAR(20),
-- Metadata
description TEXT,
metadata JSONB DEFAULT '{}',
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
authorized_at TIMESTAMPTZ,
captured_at TIMESTAMPTZ,
canceled_at TIMESTAMPTZ,
-- Constraints
CONSTRAINT valid_status CHECK (status IN (
'pending', 'requires_action', 'requires_capture',
'processing', 'succeeded', 'failed', 'canceled'
))
);
-- Indexes for common queries
CREATE INDEX idx_payments_customer ON payments(customer_id, created_at DESC);
CREATE INDEX idx_payments_status ON payments(status, created_at DESC);
CREATE INDEX idx_payments_processor ON payments(processor_payment_id);
CREATE INDEX idx_payments_idempotency ON payments(idempotency_key);
```
### Double-Entry Ledger Schema
```sql
-- Accounts in the chart of accounts
CREATE TABLE accounts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
code VARCHAR(20) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
type VARCHAR(20) NOT NULL, -- asset, liability, revenue, expense
currency VARCHAR(3) NOT NULL,
is_active BOOLEAN DEFAULT true
);
-- Ledger entries (immutable)
CREATE TABLE ledger_entries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
transaction_id UUID NOT NULL,
account_id UUID NOT NULL REFERENCES accounts(id),
entry_type VARCHAR(10) NOT NULL, -- debit or credit
amount_cents BIGINT NOT NULL CHECK (amount_cents > 0),
currency VARCHAR(3) NOT NULL,
description TEXT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW(),
-- Reference to source
payment_id UUID REFERENCES payments(id),
refund_id UUID REFERENCES refunds(id),
payout_id UUID REFERENCES payouts(id)
);
CREATE INDEX idx_ledger_transaction ON ledger_entries(transaction_id);
CREATE INDEX idx_ledger_account ON ledger_entries(account_id, created_at DESC);
CREATE INDEX idx_ledger_payment ON ledger_entries(payment_id);
```
### Ledger Entry Examples
**Authorization (hold funds):**
```
Transaction: AUTH-001
├── DEBIT accounts_receivable $100.00
└── CREDIT authorization_hold $100.00
```
**Capture (recognize revenue):**
```
Transaction: CAP-001
├── DEBIT authorization_hold $100.00
└── CREDIT revenue $100.00
```
**Settlement (receive cash):**
```
Transaction: SET-001
├── DEBIT cash $97.10 (after fees)
├── DEBIT processing_fees $2.90
└── CREDIT accounts_receivable $100.00
```
**Refund:**
```
Transaction: REF-001
├── DEBIT revenue $50.00
└── CREDIT accounts_receivable $50.00
```
### Token Vault Schema
```sql
-- Minimal schema for tokenized payment methods
-- Actual PAN/CVV stored in PCI-compliant vault (Stripe, external)
CREATE TABLE payment_methods (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(id),
processor VARCHAR(30) NOT NULL,
processor_token VARCHAR(100) NOT NULL, -- Stripe pm_xxx
-- Non-sensitive metadata
type VARCHAR(20) NOT NULL, -- card, bank_account, wallet
card_brand VARCHAR(20), -- visa, mastercard, amex
card_last4 VARCHAR(4),
card_exp_month INTEGER,
card_exp_year INTEGER,
card_funding VARCHAR(20), -- credit, debit, prepaid
-- Billing
billing_name VARCHAR(100),
billing_country VARCHAR(2),
billing_postal_code VARCHAR(20),
-- Status
is_default BOOLEAN DEFAULT false,
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(customer_id, processor_token)
);
```
**PCI scope note:** This schema stores only tokens and non-sensitive metadata. The actual card numbers (PAN) are stored by Stripe/Adyen in their PCI-compliant vaults. Your database never sees or stores raw card data.
### Database Selection Matrix
| Data | Store | Rationale |
| ------------------------ | ------------------ | ----------------------------------------- |
| Payments | PostgreSQL | ACID, complex queries, audit requirements |
| Ledger entries | PostgreSQL | Strong consistency, immutable append-only |
| Idempotency keys | Redis + PostgreSQL | Fast lookup (Redis), durable record (PG) |
| Payment method tokens | PostgreSQL | Referential integrity with customers |
| Event stream | Kafka | High throughput, replay capability |
| Reconciliation snapshots | S3 + Parquet | Cost-effective analytics storage |
| Rate limiting | Redis | Sub-ms counters |
## Low-Level Design
### Idempotency Implementation
Idempotency prevents duplicate charges when clients retry failed requests. Stripe's implementation serves as the industry reference.
**Request flow:**

**Implementation:**
```typescript collapse={1-15, 65-80}
// idempotency-service.ts
import { Redis } from "ioredis"
import { createHash } from "crypto"
interface IdempotencyRecord {
key: string
request_hash: string
response: any
status: "processing" | "complete" | "error"
created_at: Date
expires_at: Date
}
const redis = new Redis(process.env.REDIS_URL)
const IDEMPOTENCY_TTL = 24 * 60 * 60 // 24 hours (Stripe's window)
export async function checkIdempotency(
key: string,
requestBody: object,
): Promise<{ exists: boolean; response?: any; conflict?: boolean }> {
const requestHash = hashRequest(requestBody)
// Check Redis first (fast path)
const cached = await redis.get(`idem:${key}`)
if (!cached) {
// Key doesn't exist, allow processing
return { exists: false }
}
const record: IdempotencyRecord = JSON.parse(cached)
// Key exists, check if parameters match
if (record.request_hash !== requestHash) {
// Same key, different request = conflict
return { exists: true, conflict: true }
}
// Same key, same request
if (record.status === "processing") {
// Request in flight, return 409 to trigger retry
return { exists: true, conflict: true }
}
// Return cached response
return { exists: true, response: record.response }
}
export async function startIdempotentRequest(key: string, requestBody: object): Promise {
const requestHash = hashRequest(requestBody)
// Atomic set-if-not-exists
const result = await redis.set(
`idem:${key}`,
JSON.stringify({
key,
request_hash: requestHash,
status: "processing",
created_at: new Date(),
}),
"EX",
IDEMPOTENCY_TTL,
"NX", // Only set if not exists
)
return result === "OK"
}
export async function completeIdempotentRequest(
key: string,
response: any,
status: "complete" | "error",
): Promise {
const cached = await redis.get(`idem:${key}`)
if (!cached) return
const record: IdempotencyRecord = JSON.parse(cached)
record.response = response
record.status = status
await redis.setex(`idem:${key}`, IDEMPOTENCY_TTL, JSON.stringify(record))
// Also persist to PostgreSQL for durability
await db.idempotency_records.upsert({
key,
request_hash: record.request_hash,
response: JSON.stringify(response),
status,
expires_at: new Date(Date.now() + IDEMPOTENCY_TTL * 1000),
})
}
function hashRequest(body: object): string {
return createHash("sha256").update(JSON.stringify(body)).digest("hex")
}
```
**Payment controller with idempotency:**
```typescript collapse={1-8, 50-60}
// payment-controller.ts
import { checkIdempotency, startIdempotentRequest, completeIdempotentRequest } from "./idempotency-service"
export async function createPayment(req: Request): Promise {
const idempotencyKey = req.headers.get("Idempotency-Key")
if (!idempotencyKey) {
return errorResponse(400, "idempotency_key_required")
}
// Check for existing request
const check = await checkIdempotency(idempotencyKey, req.body)
if (check.conflict) {
return errorResponse(409, "idempotency_conflict")
}
if (check.exists && check.response) {
// Return cached response (including errors)
return new Response(JSON.stringify(check.response), {
status: check.response.status_code || 200,
headers: { "Idempotent-Replayed": "true" },
})
}
// Start processing (atomic lock)
const acquired = await startIdempotentRequest(idempotencyKey, req.body)
if (!acquired) {
// Another request is processing, retry later
return errorResponse(409, "request_in_progress")
}
try {
// Process payment
const payment = await processPayment(req.body)
// Cache successful response
await completeIdempotentRequest(idempotencyKey, payment, "complete")
return successResponse(201, payment)
} catch (error) {
// Cache error response (prevents retry storms)
await completeIdempotentRequest(idempotencyKey, { error: error.message }, "error")
throw error
}
}
```
**Design decisions:**
| Decision | Rationale |
| ------------------ | ------------------------------------------------------------------------- |
| 24-hour TTL | Matches Stripe; long enough for debugging, short enough to not accumulate |
| Hash request body | Detects different requests with same key |
| Cache errors too | Prevents retry storms for permanent failures |
| Redis + PostgreSQL | Redis for speed, PostgreSQL for durability and audit |
### Fraud Detection Pipeline
Real-time fraud scoring must complete within 100ms to avoid impacting checkout latency.
**Scoring architecture:**

**Feature examples (1000+ per transaction):**
| Category | Features |
| ----------- | ------------------------------------------------------ |
| Velocity | Transactions/hour from this card, IP, device |
| Historical | Days since first transaction, avg transaction amount |
| Geolocation | Distance from billing address, IP geolocation mismatch |
| Card | BIN country, funding type (credit/debit/prepaid) |
| Device | Browser fingerprint, screen resolution, timezone |
| Behavioral | Time on page before purchase, mouse movement patterns |
**Stripe Radar reference:**
- 0-99 risk score (0 = lowest risk)
- Default block threshold: 75
- Elevated risk threshold: 65
- Decision time: <100ms
- False positive rate: ~0.1%
- Evaluates 1000+ characteristics per transaction
- Models retrained monthly (0.5% recall improvement per retraining)
**3D Secure integration:**
```typescript collapse={1-10, 45-55}
// 3ds-service.ts
interface ThreeDSResult {
authentication_status: "success" | "failed" | "attempted" | "not_supported"
liability_shift: boolean
eci: string // Electronic Commerce Indicator
}
export async function handle3DSChallenge(
paymentIntentId: string,
returnUrl: string,
): Promise<{ requires_action: boolean; redirect_url?: string }> {
const pi = await stripe.paymentIntents.retrieve(paymentIntentId)
if (pi.status === "requires_action") {
// 3DS challenge required
return {
requires_action: true,
redirect_url: pi.next_action?.redirect_to_url?.url,
}
}
return { requires_action: false }
}
```
**3D Secure 2.0 flows:**
| Flow | Description | User Experience |
| ------------ | ---------------------------------- | ------------------- |
| Frictionless | Risk-based auth, no user input | Instant (invisible) |
| Challenge | Biometrics, OTP, push notification | 10-30 seconds |
**3DS benefits:** Visa research shows 85% reduction in checkout times and 75% reduction in cart abandonment compared to 3DS 1.0. Required for PSD2 SCA compliance in EU.
### Reconciliation Service
Reconciliation ensures internal ledger matches external settlements. Discrepancies indicate bugs, fraud, or processor errors.
**Reconciliation process:**

**Implementation:**
```typescript collapse={1-15, 70-85}
// reconciliation-service.ts
interface SettlementRecord {
external_id: string
amount_cents: number
currency: string
type: "charge" | "refund" | "chargeback"
settled_at: Date
fees_cents: number
}
interface ReconciliationResult {
matched: number
unmatched_internal: SettlementRecord[]
unmatched_external: SettlementRecord[]
amount_discrepancy_cents: number
}
export async function reconcileSettlement(date: Date, processor: string): Promise {
// 1. Fetch processor settlement report
const externalRecords = await fetchSettlementReport(processor, date)
// 2. Fetch internal ledger entries
const internalRecords = await fetchLedgerEntries(date, processor)
// 3. Match by external ID
const matched: string[] = []
const unmatchedExternal: SettlementRecord[] = []
const unmatchedInternal: SettlementRecord[] = []
const internalMap = new Map(internalRecords.map((r) => [r.external_id, r]))
for (const ext of externalRecords) {
const internal = internalMap.get(ext.external_id)
if (!internal) {
// Transaction in processor report but not in our ledger
unmatchedExternal.push(ext)
continue
}
// Verify amounts match
if (internal.amount_cents !== ext.amount_cents) {
unmatchedExternal.push(ext)
unmatchedInternal.push(internal)
continue
}
matched.push(ext.external_id)
internalMap.delete(ext.external_id)
}
// Remaining internal records not in processor report
for (const [, record] of internalMap) {
unmatchedInternal.push(record)
}
// Calculate total discrepancy
const externalTotal = externalRecords.reduce((sum, r) => sum + r.amount_cents, 0)
const internalTotal = internalRecords.reduce((sum, r) => sum + r.amount_cents, 0)
return {
matched: matched.length,
unmatched_internal: unmatchedInternal,
unmatched_external: unmatchedExternal,
amount_discrepancy_cents: externalTotal - internalTotal,
}
}
// Alert on discrepancies
export async function handleReconciliationBreaks(result: ReconciliationResult): Promise {
if (result.unmatched_external.length > 0) {
await alertFinanceTeam({
type: "unmatched_external",
count: result.unmatched_external.length,
transactions: result.unmatched_external,
})
}
if (Math.abs(result.amount_discrepancy_cents) > 100) {
// $1 threshold
await alertFinanceTeam({
type: "amount_discrepancy",
amount_cents: result.amount_discrepancy_cents,
})
}
}
```
**Common reconciliation breaks:**
| Break Type | Cause | Resolution |
| -------------------- | ------------------------------ | ---------------------------- |
| Missing in ledger | Webhook missed, race condition | Replay webhook, manual entry |
| Missing in processor | Auth expired, voided | Close internal record |
| Amount mismatch | Partial capture, FX | Verify capture amount |
| Duplicate | Idempotency failure | Refund duplicate |
| Timing | Settlement date rollover | Verify dates |
### Smart Routing Implementation
Smart routing optimizes for authorization rate, cost, or both by selecting the best processor per transaction.
```typescript collapse={1-12, 55-70}
// smart-router.ts
interface RoutingDecision {
processor: "stripe" | "adyen" | "paypal"
reason: string
expected_auth_rate: number
expected_cost_bps: number // basis points
}
interface TransactionContext {
card_brand: string
card_country: string
card_funding: "credit" | "debit" | "prepaid"
amount_cents: number
currency: string
merchant_country: string
}
export function routeTransaction(ctx: TransactionContext): RoutingDecision {
// Rule 1: US debit cards - route to lowest-cost network
if (ctx.card_country === "US" && ctx.card_funding === "debit") {
return {
processor: "adyen", // Intelligent routing for US debit
reason: "us_debit_cost_optimization",
expected_auth_rate: 0.96,
expected_cost_bps: 50, // vs 150+ for credit
}
}
// Rule 2: European cards - route to Adyen for local acquiring
if (["DE", "FR", "GB", "NL", "ES", "IT"].includes(ctx.card_country)) {
return {
processor: "adyen",
reason: "eu_local_acquiring",
expected_auth_rate: 0.94,
expected_cost_bps: 180, // Lower cross-border fees
}
}
// Rule 3: High-value transactions - route to processor with best auth rate
if (ctx.amount_cents > 100000) {
// > $1000
return {
processor: "stripe",
reason: "high_value_auth_optimization",
expected_auth_rate: 0.92,
expected_cost_bps: 290,
}
}
// Default: Stripe
return {
processor: "stripe",
reason: "default",
expected_auth_rate: 0.9,
expected_cost_bps: 290,
}
}
// Failover on processor error
export async function processWithFailover(ctx: TransactionContext, paymentData: PaymentData): Promise {
const primary = routeTransaction(ctx)
const processors = [primary.processor, ...getFailoverProcessors(primary.processor)]
for (const processor of processors) {
try {
return await processPayment(processor, paymentData)
} catch (error) {
if (isRetryableError(error)) {
continue // Try next processor
}
throw error // Non-retryable (e.g., card declined)
}
}
throw new Error("All processors failed")
}
```
**Routing optimization results (Adyen reference):**
- US debit routing: 26% average cost savings
- Authorization rate uplift: 0.22%
- Some merchants: 55% cost savings
## Frontend Considerations
### PCI Scope Reduction
The primary frontend concern is keeping card data out of your servers entirely.
**Tokenization at edge:**
```typescript collapse={1-8, 40-50}
// payment-form.tsx (React example)
import { loadStripe } from "@stripe/stripe-js";
import { Elements, CardElement, useStripe, useElements } from "@stripe/react-stripe-js";
const stripePromise = loadStripe(process.env.NEXT_PUBLIC_STRIPE_KEY);
function CheckoutForm() {
const stripe = useStripe();
const elements = useElements();
const handleSubmit = async (e: FormEvent) => {
e.preventDefault();
// Card data goes directly to Stripe, never touches your server
const { error, paymentMethod } = await stripe.createPaymentMethod({
type: "card",
card: elements.getElement(CardElement),
billing_details: {
name: "Customer Name",
},
});
if (error) {
setError(error.message);
return;
}
// Only the token (pm_xxx) goes to your backend
const response = await fetch("/api/payments", {
method: "POST",
body: JSON.stringify({
payment_method_token: paymentMethod.id,
amount: 9999,
}),
});
};
return (
);
}
export function CheckoutPage() {
return (
);
}
```
**PCI scope impact:**
| Approach | PCI Level | Effort |
| ----------------------- | ---------------------- | ------------------------- |
| Direct card handling | SAQ-D (400+ questions) | Months of compliance work |
| Stripe.js iframe | SAQ-A (22 questions) | Hours |
| Redirect to hosted page | SAQ-A | Minimal |
### 3D Secure Challenge Handling
```typescript collapse={1-10, 45-55}
// 3ds-handler.ts
import { loadStripe } from "@stripe/stripe-js"
export async function handle3DSChallenge(clientSecret: string): Promise<{ success: boolean; error?: string }> {
const stripe = await loadStripe(process.env.NEXT_PUBLIC_STRIPE_KEY)
// This opens the 3DS challenge in a modal/redirect
const { error, paymentIntent } = await stripe.confirmCardPayment(clientSecret, {
payment_method: paymentMethodId,
})
if (error) {
// 3DS failed or user canceled
return { success: false, error: error.message }
}
if (paymentIntent.status === "succeeded") {
return { success: true }
}
if (paymentIntent.status === "requires_action") {
// Additional action needed (rare edge case)
return await handle3DSChallenge(paymentIntent.client_secret)
}
return { success: false, error: "Unexpected payment status" }
}
```
### Error Handling UX
```typescript
// payment-errors.ts
const ERROR_MESSAGES: Record = {
card_declined: "Your card was declined. Please try a different card.",
insufficient_funds: "Insufficient funds. Please try a different card.",
expired_card: "Your card has expired. Please update your payment method.",
incorrect_cvc: "The CVC code is incorrect. Please check and try again.",
processing_error: "A processing error occurred. Please try again.",
rate_limited: "Too many attempts. Please wait a moment and try again.",
}
export function getErrorMessage(declineCode: string): string {
return ERROR_MESSAGES[declineCode] || "Payment failed. Please try again."
}
```
## Infrastructure Design
### Cloud-Agnostic Components
| Component | Purpose | Requirements |
| ------------------- | ------------------------- | ---------------------------------- |
| API Gateway | Rate limiting, auth | High availability, DDoS protection |
| Application servers | Payment processing | Horizontal scaling, idempotent |
| Primary database | Payments, ledger | ACID, strong consistency |
| Cache | Idempotency, sessions | Sub-ms latency |
| Message queue | Async processing | Exactly-once, durable |
| Event stream | Audit, analytics | High throughput, retention |
| Secrets manager | API keys, encryption keys | HSM-backed, audit logging |
### AWS Reference Architecture

**Service configuration:**
| Service | Configuration | Rationale |
| ----------------- | ---------------------------------- | ------------------------ |
| RDS PostgreSQL | db.r6g.xlarge, Multi-AZ, encrypted | ACID, HA, compliance |
| ElastiCache Redis | r6g.large, cluster mode, 3 nodes | Idempotency, low latency |
| ECS Fargate | 2 vCPU, 4GB, auto-scaling 2-20 | Predictable performance |
| SQS FIFO | 3000 msg/sec | Exactly-once webhooks |
| KMS | Customer-managed keys | Encryption key control |
| CloudWatch | 1-minute metrics, alarms | Observability |
### Self-Hosted Alternatives
| Managed Service | Self-Hosted | Trade-off |
| --------------- | -------------------- | ---------------------------------- |
| RDS PostgreSQL | PostgreSQL on EC2 | More control, operational burden |
| ElastiCache | Redis Cluster on EC2 | Cost at scale |
| SQS FIFO | Kafka | Higher throughput, complexity |
| Secrets Manager | HashiCorp Vault | Full control, operational overhead |
## Variations
### ACH/Bank Transfer Processing
ACH has fundamentally different timing and failure modes than cards.
```typescript
// ach-payment.ts
interface ACHPayment {
routing_number: string
account_number: string // tokenized
account_type: "checking" | "savings"
}
export async function initiateACHPayment(
payment: ACHPayment,
amount: number,
): Promise<{ status: string; estimated_settlement: Date }> {
// ACH is batch-processed, not real-time
const transfer = await stripe.paymentIntents.create({
amount,
currency: "usd",
payment_method_types: ["us_bank_account"],
payment_method_data: {
type: "us_bank_account",
us_bank_account: {
routing_number: payment.routing_number,
account_number: payment.account_number,
account_holder_type: "individual",
},
},
})
return {
status: "processing",
estimated_settlement: addBusinessDays(new Date(), 3), // Same-day ACH: same day
}
}
```
**ACH timing:**
| Type | Submission Deadline | Settlement |
| ------------ | ------------------- | ----------------- |
| Same-Day ACH | 4:45 PM ET | Same day by 5 PM |
| Next-Day ACH | 2:15 PM ET | Next business day |
| Standard ACH | Varies | 2-3 business days |
**ACH failure modes:**
- NSF (Non-Sufficient Funds): Returns after 2-3 days
- Account closed: Returns after settlement attempt
- Invalid account: Returns within 24 hours
### Subscription/Recurring Billing
```typescript collapse={1-12, 50-65}
// subscription-service.ts
interface Subscription {
id: string
customer_id: string
plan_id: string
status: "active" | "past_due" | "canceled"
current_period_end: Date
payment_method_id: string
}
const RETRY_SCHEDULE = [1, 3, 5, 7] // Days after failure
export async function processSubscriptionRenewal(subscription: Subscription): Promise {
const plan = await getPlan(subscription.plan_id)
try {
const payment = await createPayment({
amount: plan.amount_cents,
currency: plan.currency,
customer_id: subscription.customer_id,
payment_method_id: subscription.payment_method_id,
idempotency_key: `sub_${subscription.id}_${subscription.current_period_end.toISOString()}`,
})
if (payment.status === "succeeded") {
await extendSubscription(subscription.id)
}
} catch (error) {
if (isDeclinedError(error)) {
await markSubscriptionPastDue(subscription.id)
await scheduleRetry(subscription.id, RETRY_SCHEDULE[0])
await notifyCustomerPaymentFailed(subscription.customer_id)
}
}
}
async function handleRetry(subscriptionId: string, attemptNumber: number): Promise {
const subscription = await getSubscription(subscriptionId)
try {
await processSubscriptionRenewal(subscription)
} catch (error) {
if (attemptNumber < RETRY_SCHEDULE.length) {
await scheduleRetry(subscriptionId, RETRY_SCHEDULE[attemptNumber])
} else {
// Final attempt failed
await cancelSubscription(subscriptionId, "payment_failed")
await notifyCustomerCanceled(subscription.customer_id)
}
}
}
```
### Multi-Currency and FX
```typescript
// fx-service.ts
interface FXQuote {
from_currency: string
to_currency: string
rate: number
expires_at: Date
}
export async function getQuote(fromCurrency: string, toCurrency: string, amount: number): Promise {
// FX rates are volatile, quote expires quickly
const rate = await fetchCurrentRate(fromCurrency, toCurrency)
return {
from_currency: fromCurrency,
to_currency: toCurrency,
rate,
expires_at: new Date(Date.now() + 60 * 1000), // 60 second quote
}
}
export async function captureWithFX(paymentId: string, quote: FXQuote): Promise {
if (new Date() > quote.expires_at) {
throw new Error("FX quote expired")
}
// Lock in rate at capture time
await capturePayment(paymentId, {
fx_rate: quote.rate,
settlement_currency: quote.to_currency,
})
}
```
## Conclusion
Payment system design requires balancing competing concerns across multiple dimensions:
1. **Idempotency is non-negotiable** — Every payment operation must be safely retriable. Idempotency keys with request hashing prevent duplicate charges during network failures or client retries.
2. **PCI scope reduction saves engineering effort** — Edge tokenization (Stripe.js, Adyen Web Components) keeps card data off your servers, reducing PCI questionnaire from 400+ questions to 22.
3. **Fraud decisions must be fast** — Sub-100ms scoring using ML models (Stripe Radar processes 1000+ features per transaction) balances security with checkout conversion.
4. **Double-entry ledger ensures audit readiness** — Every fund movement (auth, capture, refund, settlement) recorded with debits equaling credits enables reconciliation and compliance.
5. **Smart routing optimizes cost and auth rates** — Multi-processor setups with intelligent routing can achieve 26%+ cost savings (Adyen US debit routing) while improving authorization rates.
**What this design optimizes for:**
- Zero duplicate charges (idempotency)
- PCI scope minimization (edge tokenization)
- High authorization rates (smart routing)
- Financial accuracy (double-entry ledger)
- Operational resilience (failover, reconciliation)
**What it sacrifices:**
- Simplicity (multi-processor complexity)
- Latency (fraud scoring adds ~100ms)
- Cost (third-party processor fees vs in-house)
**Known limitations:**
- Webhook reliability depends on processor—implement webhook replay mechanisms
- FX rates are volatile—quote expiration must be enforced
- Chargeback handling requires manual evidence gathering
- ACH failures surface days after initiation
## Appendix
### Prerequisites
- RESTful API design principles
- Database transactions and ACID guarantees
- Distributed systems basics (idempotency, exactly-once delivery)
- Basic understanding of card payment networks
### Terminology
- **PAN** (Primary Account Number): The 16-digit card number
- **PCI DSS** (Payment Card Industry Data Security Standard): Security standard for card data handling
- **SAQ** (Self-Assessment Questionnaire): PCI compliance verification form
- **Interchange**: Fee paid by acquirer to issuer on each transaction (1.4-2.6% typical)
- **Authorization**: Hold placed on cardholder's available credit
- **Capture**: Finalization of authorized transaction for settlement
- **Settlement**: Transfer of funds from issuer to acquirer to merchant
- **3D Secure**: Protocol for authenticating card-not-present transactions
- **SCA** (Strong Customer Authentication): EU PSD2 requirement for two-factor auth
- **ACH** (Automated Clearing House): US bank-to-bank transfer network
- **Chargeback**: Disputed transaction reversed by card network
### Summary
- Payment systems require **idempotent operations** with 24-hour key retention (Stripe's standard)
- **Edge tokenization** (Stripe.js/Adyen Web Components) reduces PCI scope from SAQ-D to SAQ-A
- **Fraud scoring** must complete in <100ms, evaluating 1000+ features per transaction
- **Double-entry ledger** tracks every fund movement: authorization holds, captures, refunds, settlements
- **Smart routing** across multiple processors can achieve 26% cost savings on US debit
- **Reconciliation** matches internal ledger against processor settlements daily
- **3D Secure 2.0** enables frictionless authentication with 85% checkout time reduction
### References
- [Stripe API Documentation](https://docs.stripe.com/api) - Comprehensive payment API reference
- [Stripe: Designing robust APIs with idempotency](https://stripe.com/blog/idempotency) - Idempotency key implementation patterns
- [Stripe Radar: Machine learning for fraud detection](https://stripe.com/guides/primer-on-machine-learning-for-fraud-protection) - ML fraud scoring architecture
- [PCI Security Standards Council](https://www.pcisecuritystandards.org/) - PCI DSS 4.0 requirements
- [Adyen: Intelligent Payment Routing](https://www.adyen.com/press-and-media/adyens-intelligent-payment-routing-usdebit) - Smart routing cost savings data
- [Stripe: 3D Secure 2 Guide](https://stripe.com/guides/3d-secure-2) - 3DS implementation patterns
- [Nacha: How ACH Payments Work](https://www.nacha.org/content/how-ach-payments-work) - ACH network specifications
- [Visa: Credit Card Processing](https://usa.visa.com/support/small-business/regulations-fees.html) - Card network processing details
- [Martin Fowler: Patterns of Enterprise Application Architecture](https://martinfowler.com/eaaCatalog/) - Double-entry accounting patterns
---
## Design a Flash Sale System
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-flash-sale-system
**Category:** System Design / System Design Problems
**Description:** Building a system to handle millions of concurrent users competing for limited inventory during time-bounded sales events. Flash sales present a unique challenge: extreme traffic spikes (10-100x normal) concentrated in seconds, with zero tolerance for inventory errors. This design covers virtual waiting rooms, atomic inventory management, and asynchronous order processing.
# Design a Flash Sale System
Building a system to handle millions of concurrent users competing for limited inventory during time-bounded sales events. Flash sales present a unique challenge: extreme traffic spikes (10-100x normal) concentrated in seconds, with zero tolerance for inventory errors. This design covers virtual waiting rooms, atomic inventory management, and asynchronous order processing.

Flash sale system architecture: CDN-based waiting room absorbs traffic spike, queue service manages admission, Redis handles atomic inventory, message queue decouples order processing.
## Abstract
Flash sale design centers on three constraints working against each other:
1. **Traffic absorption** — Millions of users arriving in seconds cannot hit your database directly. A CDN-hosted waiting room absorbs the spike; a queue service meters admission at backend capacity.
2. **Inventory accuracy** — Overselling destroys trust. Redis Lua scripts provide atomic "check-and-decrement" operations. Pre-allocation (tokens = inventory) bounds the problem.
3. **Order durability under load** — Synchronous order processing cannot scale to 500K+ TPS. Asynchronous queues decouple order receipt from processing, with guaranteed delivery.
The mental model: **waiting room → token gate → atomic inventory → async order queue**. Each layer handles one constraint and shields the next.
| Design Decision | Tradeoff |
| ---------------------- | --------------------------------------------------------- |
| CDN waiting room | Absorbs traffic cheaply; adds user-facing latency |
| Token-based admission | Prevents overselling; requires pre-allocation |
| Redis atomic counters | Sub-millisecond inventory checks; single point of failure |
| Async order processing | Handles 100x normal load; delayed confirmation |
## Requirements
### Functional Requirements
| Feature | Scope | Notes |
| --------------------------- | -------- | ------------------------------------- |
| Virtual waiting room | Core | Absorbs traffic spike before backend |
| Queue management | Core | FIFO admission with position tracking |
| Inventory reservation | Core | Atomic decrement, no overselling |
| Order placement | Core | Async processing with durability |
| Bot detection | Core | Multi-layer defense |
| Payment processing | Core | Idempotent, timeout-aware |
| Order confirmation | Core | Email/push notification |
| Purchase limits | Extended | 1-2 units per customer |
| VIP early access | Extended | Tiered queue priority |
| Real-time inventory display | Extended | Eventually consistent display |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ----------------------- | --------- | ------------------------------------------------------------------- |
| Availability | 99.99% | Revenue impact; Alibaba achieved "zero downtime" during Singles Day |
| Waiting room latency | < 100ms | Static CDN, must feel instant |
| Inventory check latency | < 50ms | Critical path, Redis required |
| Checkout latency | < 5s | User-acceptable; async processing hides backend |
| Queue position accuracy | Real-time | Trust requires visible progress |
| Inventory accuracy | 100% | Zero tolerance for overselling |
| Order durability | Zero loss | Queued orders must survive failures |
### Scale Estimation
**Traffic Profile:**
| Metric | Normal | Flash Sale Peak | Multiplier |
| -------------------- | ------- | --------------- | ---------- |
| Concurrent users | 100K | 10M | 100x |
| Page requests/sec | 10K RPS | 1M RPS | 100x |
| Inventory checks/sec | 1K RPS | 500K RPS | 500x |
| Orders/sec | 100 TPS | 10K TPS | 100x |
**Back-of-envelope (1M users, 10K inventory):**
```
Users arriving in first minute: 1,000,000
Waiting room page views: 1M × 3 refreshes = 3M requests/min = 50K RPS
Queue status checks: 1M × 1 check/5sec = 200K RPS
Inventory checks (admitted users): 50K users admitted × 1 check = 50K RPS spike
Orders attempted: 50K (not all convert)
Orders completed: 10K (inventory limit)
```
**Storage:**
```
Queue state: 1M users × 100 bytes = 100MB (Redis)
Order records: 10K orders × 5KB = 50MB (PostgreSQL)
Event logs: 10M events × 200 bytes = 2GB/sale
```
## Design Paths
### Path A: Pre-Allocation Model (Token-Based)
**Best when:**
- Fixed, known inventory quantity
- Fairness is paramount (ticketing, limited editions)
- High-value items where overselling is catastrophic
**Architecture:**

**Key characteristics:**
- Tokens generated equal to inventory before sale starts
- Each admitted user receives one token
- Token guarantees checkout opportunity (not purchase—user may abandon)
- Token expires if unused (returns to pool)
**Trade-offs:**
- :white_check_mark: Zero overselling by construction
- :white_check_mark: Predictable admission rate
- :white_check_mark: Fair (FIFO or randomized entry)
- :x: Requires accurate inventory count pre-sale
- :x: Token management complexity (expiration, reclaim)
- :x: Abandoned tokens reduce conversion
**Real-world example:** SeatGeek uses token-based admission for concert ticket sales. Lambda + DynamoDB manages token lifecycle; tokens expire on purchase or 15-minute timeout, returning to the pool for the next user in queue.
### Path B: Real-Time Inventory Model (Counter-Based)
**Best when:**
- Dynamic inventory (multiple warehouses, restocking)
- E-commerce flash sales with variable stock
- Lower-stakes items where occasional overselling is recoverable
**Architecture:**

**Key characteristics:**
- No pre-allocation; inventory checked in real-time
- Atomic decrement at checkout (not admission)
- Rate limiting protects backend; doesn't guarantee purchase
- Inventory can be restocked mid-sale
**Trade-offs:**
- :white_check_mark: Handles dynamic inventory
- :white_check_mark: Simpler pre-sale setup
- :white_check_mark: Can restock mid-sale
- :x: Overselling risk if counter and order processing desync
- :x: Users admitted without guarantee (frustration)
- :x: Thundering herd on inventory service if rate limiting fails
**Real-world example:** Alibaba Singles Day uses Redis atomic counters with Lua scripts. Product ID = key, inventory = value. Lua script performs atomic `GET + DECR` in single operation. Handles 583K operations/second with careful sharding.
### Path Comparison
| Factor | Path A (Token) | Path B (Counter) |
| ----------------- | --------------------------- | ----------------------------- |
| Overselling risk | Zero | Low (with proper atomicity) |
| Setup complexity | Higher | Lower |
| Dynamic inventory | Difficult | Native |
| User expectation | Guaranteed opportunity | Best effort |
| Fairness | Explicit (token order) | Implicit (first to checkout) |
| Best for | Ticketing, limited releases | E-commerce, restockable goods |
### This Article's Focus
This article implements **Path A (Token-Based)** for the core flow because:
1. Flash sales typically have fixed, high-value inventory
2. Fairness is a differentiator (users accept waiting if fair)
3. Zero overselling is non-negotiable for most use cases
Path B implementation details are covered in the [Variations](#variations) section.
## High-Level Design
### Component Overview
| Component | Responsibility | Technology |
| -------------------- | -------------------------------------------- | ------------------------ |
| Virtual Waiting Room | Absorb traffic spike, display queue position | Static HTML on CDN |
| Queue Service | Manage admission, assign tokens | Lambda + DynamoDB |
| Inventory Service | Atomic inventory operations | Redis Cluster |
| Order Service | Process orders asynchronously | ECS + SQS |
| Payment Service | Handle payments idempotently | Stripe/Adyen integration |
| Notification Service | Send confirmations | SES + SNS |
| Bot Detection | Filter non-human traffic | WAF + Custom rules |
### Request Flow

## API Design
### Queue Service APIs
#### Join Queue
```http
POST /api/v1/queue/join
Authorization: Bearer {user_token}
X-Device-Fingerprint: {fingerprint}
{
"sale_id": "flash-sale-2024-001",
"product_ids": ["sku-001", "sku-002"]
}
```
**Response (202 Accepted):**
```json
{
"queue_ticket": "qt_abc123xyz",
"position": 15234,
"estimated_wait_seconds": 180,
"status_url": "/api/v1/queue/status/qt_abc123xyz"
}
```
**Error responses:**
- `400 Bad Request`: Invalid sale_id or product not in flash sale
- `403 Forbidden`: Bot detected or user already in queue
- `429 Too Many Requests`: Rate limit exceeded
#### Check Queue Status
```http
GET /api/v1/queue/status/{queue_ticket}
```
**Response (200 OK):**
```json
{
"queue_ticket": "qt_abc123xyz",
"status": "waiting",
"position": 8234,
"estimated_wait_seconds": 90,
"poll_interval_seconds": 5
}
```
**Status values:**
- `waiting`: In queue, not yet admitted
- `admitted`: Token assigned, can proceed to checkout
- `expired`: Waited too long, removed from queue
- `completed`: Purchased or abandoned checkout
#### Token Admission (Internal)
When user reaches front of queue:
```json
{
"queue_ticket": "qt_abc123xyz",
"status": "admitted",
"checkout_token": "ct_xyz789abc",
"checkout_url": "/checkout?token=ct_xyz789abc",
"token_expires_at": "2024-03-15T10:05:00Z"
}
```
### Checkout Service APIs
#### Start Checkout Session
```http
POST /api/v1/checkout/start
Authorization: Bearer {user_token}
{
"checkout_token": "ct_xyz789abc",
"product_id": "sku-001",
"quantity": 1
}
```
**Response (201 Created):**
```json
{
"session_id": "cs_def456",
"reserved_until": "2024-03-15T10:05:00Z",
"product": {
"id": "sku-001",
"name": "Limited Edition Sneaker",
"price": 299.0,
"currency": "USD"
},
"next_step": "payment"
}
```
**Error responses:**
- `400 Bad Request`: Invalid token or product
- `409 Conflict`: Token already used
- `410 Gone`: Token expired
- `422 Unprocessable`: Inventory exhausted (token invalid)
#### Submit Order
```http
POST /api/v1/orders
Authorization: Bearer {user_token}
{
"session_id": "cs_def456",
"shipping_address": {
"line1": "123 Main St",
"city": "San Francisco",
"state": "CA",
"postal_code": "94102",
"country": "US"
},
"payment_method_id": "pm_card_visa"
}
```
**Response (202 Accepted):**
```json
{
"order_id": "ord_789xyz",
"status": "processing",
"estimated_confirmation": "< 60 seconds",
"tracking_url": "/api/v1/orders/ord_789xyz"
}
```
**Design note:** Returns 202 (not 201) because order processing is asynchronous. The order is durably queued but not yet confirmed.
### Pagination Strategy
Queue status uses cursor-based polling, not traditional pagination:
```json
{
"position": 1234,
"poll_interval_seconds": 5,
"next_poll_after": "2024-03-15T10:01:05Z"
}
```
**Rationale:** Queue position changes continuously. Polling interval increases as position improves (less uncertainty near front).
## Data Modeling
### Queue State (DynamoDB)
```
Table: FlashSaleQueue
Partition Key: sale_id
Sort Key: queue_ticket
Attributes:
- user_id: string
- position: number (GSI for ordering)
- status: enum [waiting, admitted, expired, completed]
- joined_at: ISO8601
- admitted_at: ISO8601 | null
- checkout_token: string | null
- token_expires_at: ISO8601 | null
- device_fingerprint: string
- ip_address: string
```
**GSI:** `sale_id-position-index` for efficient position lookups.
**Why DynamoDB:** Single-digit millisecond latency at any scale, automatic scaling, TTL for expired entries.
### Inventory Counter (Redis)
```redis
# Inventory count per product
SET inventory:sku-001 10000
# Atomic decrement with Lua script
EVAL "
local count = redis.call('GET', KEYS[1])
if tonumber(count) > 0 then
return redis.call('DECR', KEYS[1])
else
return -1
end
" 1 inventory:sku-001
```
**Why Lua script:** `GET` and `DECR` must be atomic. Without Lua, two concurrent requests could both see `count=1` and both decrement, causing overselling.
### Token Registry (Redis)
```redis
# Token → user mapping with TTL
SETEX token:ct_xyz789abc 300 "user_123"
# Used tokens (prevent replay)
SADD used_tokens:flash-sale-2024-001 ct_xyz789abc
```
**TTL:** 5 minutes for checkout tokens. Expired tokens return to the pool.
### Order Schema (PostgreSQL)
```sql
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
sale_id VARCHAR(50) NOT NULL,
checkout_token VARCHAR(100) NOT NULL UNIQUE,
status VARCHAR(20) DEFAULT 'pending',
-- Order details
product_id VARCHAR(50) NOT NULL,
quantity INT NOT NULL DEFAULT 1,
unit_price DECIMAL(10,2) NOT NULL,
total_amount DECIMAL(10,2) NOT NULL,
currency VARCHAR(3) DEFAULT 'USD',
-- Shipping
shipping_address JSONB NOT NULL,
-- Payment
payment_intent_id VARCHAR(100),
payment_status VARCHAR(20),
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
confirmed_at TIMESTAMPTZ,
-- Idempotency
idempotency_key VARCHAR(100) UNIQUE
);
CREATE INDEX idx_orders_user ON orders(user_id, created_at DESC);
CREATE INDEX idx_orders_sale ON orders(sale_id, status);
CREATE INDEX idx_orders_payment ON orders(payment_intent_id);
```
**Idempotency key:** Prevents duplicate orders if user retries during network issues. Typically `{user_id}:{checkout_token}`.
### Database Selection Matrix
| Data | Store | Rationale |
| ------------------ | ------------- | ---------------------------------------- |
| Queue state | DynamoDB | Single-digit ms latency, auto-scale, TTL |
| Inventory counters | Redis Cluster | Sub-ms atomic operations |
| Tokens | Redis | TTL, fast lookup |
| Orders | PostgreSQL | ACID, complex queries, durability |
| Event logs | Kinesis → S3 | High throughput, analytics |
| User sessions | Redis | Fast auth checks |
## Low-Level Design
### Virtual Waiting Room
The waiting room is the first line of defense. It must:
1. Absorb millions of requests without backend load
2. Provide fair queue positioning
3. Communicate progress transparently
**Architecture:**

**Static HTML design:**
```html collapse={1-10, 25-30}
Flash Sale - Please Wait
You're in the queue
Position: --
Estimated wait: --
Please keep this tab open
Redirecting to checkout...
```
**Queue polling logic:**
```typescript collapse={1-8, 40-50}
// queue-client.ts
interface QueueStatus {
status: "waiting" | "admitted" | "expired"
position?: number
estimated_wait_seconds?: number
checkout_url?: string
poll_interval_seconds: number
}
async function pollQueueStatus(ticket: string): Promise {
const response = await fetch(`/api/v1/queue/status/${ticket}`)
const status: QueueStatus = await response.json()
switch (status.status) {
case "waiting":
updateUI(status.position, status.estimated_wait_seconds)
// Exponential backoff near front of queue
const interval = status.poll_interval_seconds * 1000
setTimeout(() => pollQueueStatus(ticket), interval)
break
case "admitted":
showRedirectNotice()
// Small delay for user to see the message
setTimeout(() => {
window.location.href = status.checkout_url
}, 1500)
break
case "expired":
showExpiredMessage()
break
}
}
// Start polling on page load
const ticket = new URLSearchParams(window.location.search).get("ticket")
if (ticket) {
pollQueueStatus(ticket)
}
```
**Design decisions:**
| Decision | Rationale |
| ------------------- | ----------------------------------------------------------------------- |
| Static HTML on CDN | Millions of users hitting origin would saturate it; CDN absorbs at edge |
| Client-side polling | Push (WebSocket) at this scale requires massive connection management |
| Exponential backoff | Users near front poll more frequently; reduces total requests |
| No refresh needed | Single-page polling prevents users from losing position by refreshing |
### Queue Service (Token Management)
The queue service manages the FIFO queue and token assignment.
**Lambda handler:**
```typescript collapse={1-15, 60-75}
// queue-service.ts
import { DynamoDB } from "@aws-sdk/client-dynamodb"
import { DynamoDBDocument } from "@aws-sdk/lib-dynamodb"
const ddb = DynamoDBDocument.from(new DynamoDB({}))
interface QueueEntry {
sale_id: string
queue_ticket: string
user_id: string
position: number
status: "waiting" | "admitted" | "expired" | "completed"
checkout_token?: string
}
export async function joinQueue(
saleId: string,
userId: string,
deviceFingerprint: string,
): Promise<{ ticket: string; position: number }> {
// Check if user already in queue
const existing = await findUserInQueue(saleId, userId)
if (existing) {
return { ticket: existing.queue_ticket, position: existing.position }
}
// Get current queue length (approximate, for position)
const position = await getNextPosition(saleId)
const ticket = generateTicket()
await ddb.put({
TableName: "FlashSaleQueue",
Item: {
sale_id: saleId,
queue_ticket: ticket,
user_id: userId,
position: position,
status: "waiting",
joined_at: new Date().toISOString(),
device_fingerprint: deviceFingerprint,
ttl: Math.floor(Date.now() / 1000) + 3600, // 1 hour TTL
},
ConditionExpression: "attribute_not_exists(queue_ticket)",
})
return { ticket, position }
}
export async function admitNextUsers(saleId: string, count: number): Promise {
// Invoked by EventBridge at fixed rate (e.g., every second)
// Admits 'count' users from front of queue
const waiting = await ddb.query({
TableName: "FlashSaleQueue",
IndexName: "sale_id-position-index",
KeyConditionExpression: "sale_id = :sid",
FilterExpression: "#status = :waiting",
ExpressionAttributeNames: { "#status": "status" },
ExpressionAttributeValues: {
":sid": saleId,
":waiting": "waiting",
},
Limit: count,
ScanIndexForward: true, // Ascending by position (FIFO)
})
for (const entry of waiting.Items || []) {
await admitUser(entry as QueueEntry)
}
}
async function admitUser(entry: QueueEntry): Promise {
const token = generateCheckoutToken()
const expiresAt = new Date(Date.now() + 5 * 60 * 1000) // 5 min
await ddb.update({
TableName: "FlashSaleQueue",
Key: { sale_id: entry.sale_id, queue_ticket: entry.queue_ticket },
UpdateExpression: "SET #status = :admitted, checkout_token = :token, token_expires_at = :exp",
ExpressionAttributeNames: { "#status": "status" },
ExpressionAttributeValues: {
":admitted": "admitted",
":token": token,
":exp": expiresAt.toISOString(),
},
})
// Also store token in Redis for fast lookup during checkout
await redis.setex(`token:${token}`, 300, entry.user_id)
}
```
**Admission rate control:**
The admission rate must match backend capacity. EventBridge triggers `admitNextUsers` every second:
```
Admission rate = min(backend_capacity, remaining_inventory / expected_checkout_time)
Example:
- Backend can handle 1000 checkouts/sec
- Remaining inventory: 5000
- Average checkout time: 60 seconds
- Admission rate: min(1000, 5000/60) = min(1000, 83) = 83 users/sec
```
**Design decisions:**
| Decision | Rationale |
| ------------------------- | ----------------------------------------------------------- |
| DynamoDB for queue | Handles millions of entries with single-digit ms latency |
| Position as GSI | Enables efficient "next N users" query |
| EventBridge for admission | Decouples admission rate from user requests |
| Token in Redis + DynamoDB | Redis for fast checkout validation; DynamoDB for durability |
### Inventory Service (Atomic Counters)
The inventory service prevents overselling through atomic operations.
**Redis Lua script for atomic reservation:**
```lua
-- reserve_inventory.lua
-- KEYS[1] = inventory key (e.g., "inventory:sku-001")
-- KEYS[2] = reserved set key (e.g., "reserved:sku-001")
-- ARGV[1] = user_id
-- ARGV[2] = quantity
-- ARGV[3] = reservation_id
-- ARGV[4] = ttl_seconds
local inventory_key = KEYS[1]
local reserved_key = KEYS[2]
local user_id = ARGV[1]
local quantity = tonumber(ARGV[2])
local reservation_id = ARGV[3]
local ttl = tonumber(ARGV[4])
-- Check current inventory
local available = tonumber(redis.call('GET', inventory_key) or 0)
if available < quantity then
return { err = 'insufficient_inventory', available = available }
end
-- Atomic decrement
local new_count = redis.call('DECRBY', inventory_key, quantity)
if new_count < 0 then
-- Race condition: restore and fail
redis.call('INCRBY', inventory_key, quantity)
return { err = 'race_condition' }
end
-- Track reservation for expiration
redis.call('HSET', reserved_key, reservation_id,
cjson.encode({ user_id = user_id, quantity = quantity, created_at = redis.call('TIME')[1] }))
redis.call('EXPIRE', reserved_key, ttl)
return { ok = true, remaining = new_count, reservation_id = reservation_id }
```
**Inventory service implementation:**
```typescript collapse={1-12, 50-65}
// inventory-service.ts
import Redis from "ioredis"
import { readFileSync } from "fs"
const redis = new Redis.Cluster([
{ host: "redis-1.example.com", port: 6379 },
{ host: "redis-2.example.com", port: 6379 },
{ host: "redis-3.example.com", port: 6379 },
])
const reserveScript = readFileSync("./reserve_inventory.lua", "utf-8")
interface ReservationResult {
success: boolean
reservation_id?: string
remaining?: number
error?: string
}
export async function reserveInventory(
productId: string,
userId: string,
quantity: number,
ttlSeconds: number = 300,
): Promise {
const reservationId = `res_${Date.now()}_${userId}`
const result = (await redis.eval(
reserveScript,
2, // number of keys
`inventory:${productId}`,
`reserved:${productId}`,
userId,
quantity.toString(),
reservationId,
ttlSeconds.toString(),
)) as any
if (result.err) {
return { success: false, error: result.err }
}
return {
success: true,
reservation_id: reservationId,
remaining: result.remaining,
}
}
export async function releaseReservation(productId: string, reservationId: string): Promise {
// Called when checkout times out or user abandons
const reserved = await redis.hget(`reserved:${productId}`, reservationId)
if (reserved) {
const { quantity } = JSON.parse(reserved)
await redis.incrby(`inventory:${productId}`, quantity)
await redis.hdel(`reserved:${productId}`, reservationId)
}
}
export async function confirmReservation(productId: string, reservationId: string): Promise {
// Called after successful payment - just remove from reserved set
await redis.hdel(`reserved:${productId}`, reservationId)
}
```
**Reservation lifecycle:**

**Design decisions:**
| Decision | Rationale |
| --------------------- | ---------------------------------------------------- |
| Lua script | Atomic read-check-decrement prevents race conditions |
| Redis Cluster | Horizontal scaling for high throughput |
| Reservation with TTL | Prevents inventory lock-up from abandoned checkouts |
| Hash for reservations | O(1) lookup/delete by reservation ID |
### Order Processing (Async Queue)
Orders are placed on a durable queue for async processing. This decouples order receipt from processing, preventing database overwhelm.
**Order submission flow:**
```typescript collapse={1-10, 55-70}
// order-service.ts
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs"
import { v4 as uuid } from "uuid"
const sqs = new SQSClient({})
const ORDER_QUEUE_URL = process.env.ORDER_QUEUE_URL!
interface OrderRequest {
session_id: string
user_id: string
product_id: string
quantity: number
shipping_address: Address
payment_method_id: string
}
export async function submitOrder(request: OrderRequest): Promise<{ order_id: string }> {
const orderId = uuid()
const idempotencyKey = `${request.user_id}:${request.session_id}`
// Check for duplicate submission
const existing = await db.orders.findOne({ idempotency_key: idempotencyKey })
if (existing) {
return { order_id: existing.id }
}
// Create order record in pending state
await db.orders.insert({
id: orderId,
user_id: request.user_id,
product_id: request.product_id,
quantity: request.quantity,
status: "pending",
idempotency_key: idempotencyKey,
created_at: new Date(),
})
// Queue for async processing
await sqs.send(
new SendMessageCommand({
QueueUrl: ORDER_QUEUE_URL,
MessageBody: JSON.stringify({
order_id: orderId,
...request,
}),
MessageDeduplicationId: idempotencyKey,
MessageGroupId: request.user_id, // Ensures per-user ordering
}),
)
return { order_id: orderId }
}
```
**Order processor (worker):**
```typescript collapse={1-15, 70-85}
// order-processor.ts
import { SQSEvent, SQSRecord } from "aws-lambda"
import Stripe from "stripe"
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!)
interface OrderMessage {
order_id: string
user_id: string
product_id: string
quantity: number
shipping_address: Address
payment_method_id: string
session_id: string
}
export async function handler(event: SQSEvent): Promise {
for (const record of event.Records) {
await processOrder(record)
}
}
async function processOrder(record: SQSRecord): Promise {
const message: OrderMessage = JSON.parse(record.body)
try {
// 1. Verify reservation still valid
const reservation = await getReservation(message.product_id, message.session_id)
if (!reservation) {
await markOrderFailed(message.order_id, "reservation_expired")
return
}
// 2. Process payment
const paymentIntent = await stripe.paymentIntents.create({
amount: calculateTotal(message.product_id, message.quantity),
currency: "usd",
payment_method: message.payment_method_id,
confirm: true,
idempotency_key: `payment_${message.order_id}`,
})
if (paymentIntent.status !== "succeeded") {
await releaseReservation(message.product_id, message.session_id)
await markOrderFailed(message.order_id, "payment_failed")
return
}
// 3. Confirm inventory (remove from reserved set)
await confirmReservation(message.product_id, message.session_id)
// 4. Update order status
await db.orders.update(message.order_id, {
status: "confirmed",
payment_intent_id: paymentIntent.id,
confirmed_at: new Date(),
})
// 5. Send confirmation
await sendOrderConfirmation(message.order_id)
} catch (error) {
// Let SQS retry with exponential backoff
throw error
}
}
async function markOrderFailed(orderId: string, reason: string): Promise {
await db.orders.update(orderId, {
status: "failed",
failure_reason: reason,
})
// Notify user
await sendOrderFailureNotification(orderId, reason)
}
```
**Dead letter queue handling:**
Orders that fail after max retries go to a Dead Letter Queue (DLQ) for manual review:
```typescript
// dlq-processor.ts
export async function handleDeadLetter(record: SQSRecord): Promise {
const message = JSON.parse(record.body)
// Log for investigation
console.error("Order failed permanently", {
order_id: message.order_id,
attempts: record.attributes.ApproximateReceiveCount,
error: record.attributes.DeadLetterQueueSourceArn,
})
// Alert ops team
await pagerduty.createIncident({
title: `Flash sale order failed: ${message.order_id}`,
severity: "high",
})
// Release inventory back to pool
await releaseReservation(message.product_id, message.session_id)
}
```
**Design decisions:**
| Decision | Rationale |
| --------------------------- | -------------------------------------------------- |
| SQS FIFO queue | Exactly-once processing, per-user ordering |
| Idempotency key | Prevents duplicate orders on retry |
| Payment before confirmation | Never confirm inventory without successful payment |
| DLQ for failures | Ensures no order is silently lost |
## Bot Detection and Fairness
### Multi-Layer Bot Defense

**Layer 1: Edge defense (WAF)**
```yaml
# AWS WAF rules for flash sale
Rules:
- Name: RateLimitPerIP
Action: Block
Statement:
RateBasedStatement:
Limit: 100 # requests per 5 minutes per IP
AggregateKeyType: IP
- Name: BlockKnownBots
Action: Block
Statement:
IPSetReferenceStatement:
ARN: arn:aws:wafv2:....:ipset/known-bots
- Name: GeoRestriction
Action: Block
Statement:
NotStatement:
Statement:
GeoMatchStatement:
CountryCodes: [US, CA, GB, DE] # Allowed countries
```
**Layer 2: Application-level detection**
```typescript collapse={1-5, 35-45}
// bot-detection.ts
interface BotSignals {
score: number
signals: string[]
}
export function detectBot(request: Request): BotSignals {
const signals: string[] = []
let score = 0
// Device fingerprint consistency
const fp = request.headers.get("x-device-fingerprint")
if (!fp || fp.length < 32) {
signals.push("missing_fingerprint")
score += 30
}
// Behavioral signals
const timing = parseTimingHeader(request)
if (timing.pageLoadToAction < 500) {
// < 500ms is suspicious
signals.push("fast_interaction")
score += 25
}
// Browser consistency
const ua = request.headers.get("user-agent")
const acceptLang = request.headers.get("accept-language")
if (isHeadlessBrowser(ua) || !acceptLang) {
signals.push("headless_indicators")
score += 40
}
// Known residential proxy detection
const ip = getClientIP(request)
if (await isResidentialProxy(ip)) {
signals.push("residential_proxy")
score += 20
}
return { score, signals }
}
export function shouldChallenge(signals: BotSignals): boolean {
return signals.score >= 50
}
export function shouldBlock(signals: BotSignals): boolean {
return signals.score >= 80
}
```
**Layer 3: Queue-level protection**
```typescript
// queue-protection.ts
export async function validateQueueJoin(
userId: string,
deviceFingerprint: string,
saleId: string,
): Promise<{ allowed: boolean; reason?: string }> {
// Check for duplicate user
const existingEntry = await findUserInQueue(saleId, userId)
if (existingEntry) {
return { allowed: false, reason: "already_in_queue" }
}
// Check for fingerprint reuse (same device, different accounts)
const fpCount = await countFingerprintInQueue(saleId, deviceFingerprint)
if (fpCount >= 2) {
return { allowed: false, reason: "device_limit_exceeded" }
}
// Velocity check: how many queues has this user joined recently?
const recentJoins = await countRecentQueueJoins(userId, 3600) // last hour
if (recentJoins >= 5) {
return { allowed: false, reason: "velocity_exceeded" }
}
return { allowed: true }
}
```
### Fairness Mechanisms
**1. FIFO queue with randomized entry window**
Users who arrive before sale start are randomized when the sale begins (prevents "refresh at exactly 10:00:00" advantage):
```typescript
export async function openSaleQueue(saleId: string): Promise {
// Get all users who arrived in pre-sale window (e.g., last 15 minutes)
const earlyArrivals = await getEarlyArrivals(saleId)
// Shuffle positions randomly
const shuffled = shuffleArray(earlyArrivals)
// Assign positions 1, 2, 3, ...
for (let i = 0; i < shuffled.length; i++) {
await updatePosition(shuffled[i].queue_ticket, i + 1)
}
// Users arriving after sale start get position = current_max + 1 (true FIFO)
}
```
**2. Per-customer purchase limits**
```typescript
export async function validatePurchaseLimit(userId: string, productId: string, quantity: number): Promise {
const existingOrders = await db.orders.count({
user_id: userId,
product_id: productId,
status: { $in: ["confirmed", "pending"] },
})
const LIMIT_PER_USER = 2
return existingOrders + quantity <= LIMIT_PER_USER
}
```
## Frontend Considerations
### Waiting Room UX
**Critical UX decisions:**
| Decision | Implementation | Rationale |
| ------------------------- | ---------------------------------------- | ----------------------------------------------- |
| Progress indicator | Position + estimated time + progress bar | Reduces anxiety; users know they're progressing |
| No refresh needed | SPA with polling | Prevents users from losing position |
| Transparent communication | Show exact position | Trust requires honesty |
| Graceful degradation | Static HTML | Must work even if JS fails |
**Optimistic UI for checkout:**
```typescript
// checkout-ui.ts
async function submitOrder(orderData: OrderData): Promise {
// Optimistic: show "Processing..." immediately
setOrderStatus("processing")
showConfirmationPreview(orderData)
try {
const { order_id } = await api.submitOrder(orderData)
// Poll for confirmation (async processing)
pollOrderStatus(order_id, (status) => {
if (status === "confirmed") {
setOrderStatus("confirmed")
showSuccessAnimation()
} else if (status === "failed") {
setOrderStatus("failed")
showRetryOption()
}
})
} catch (error) {
// Revert optimistic UI
setOrderStatus("error")
showErrorMessage(error)
}
}
```
### Real-Time Queue Updates
**Polling vs WebSocket decision:**
| Factor | Polling | WebSocket |
| -------------- | ---------------- | ---------------------------- |
| Scale | Easy (stateless) | Hard (connection management) |
| Latency | 5-10s | Sub-second |
| Infrastructure | Simple | Complex |
| Battery impact | Higher | Lower |
**Chosen: Adaptive polling** — Poll every 5s when far from front; every 1s when close.
```typescript
function calculatePollInterval(position: number, totalAhead: number): number {
const progressPercent = 1 - position / totalAhead
if (progressPercent > 0.9) return 1000 // Top 10%: 1s
if (progressPercent > 0.7) return 2000 // Top 30%: 2s
if (progressPercent > 0.5) return 3000 // Top 50%: 3s
return 5000 // Back 50%: 5s
}
```
### Client State Management
```typescript
// flash-sale-state.ts
interface FlashSaleState {
// Queue state
queueTicket: string | null
position: number | null
status: "idle" | "queued" | "admitted" | "checkout" | "completed" | "expired"
// Checkout state
checkoutToken: string | null
checkoutExpiresAt: Date | null
reservationId: string | null
// Order state
orderId: string | null
orderStatus: "pending" | "processing" | "confirmed" | "failed" | null
}
// State persisted to localStorage for tab recovery
function persistState(state: FlashSaleState): void {
localStorage.setItem("flash-sale-state", JSON.stringify(state))
}
// Restore on page load (handles accidental tab close)
function restoreState(): FlashSaleState | null {
const saved = localStorage.getItem("flash-sale-state")
if (!saved) return null
const state = JSON.parse(saved)
// Check if checkout token is still valid
if (state.checkoutExpiresAt && new Date(state.checkoutExpiresAt) < new Date()) {
return null // Expired, start fresh
}
return state
}
```
## Infrastructure Design
### Cloud-Agnostic Components
| Component | Purpose | Requirements |
| ------------------ | --------------------------- | --------------------------------- |
| CDN | Waiting room, static assets | Edge caching, high throughput |
| Serverless compute | Queue service, APIs | Auto-scale, pay-per-use |
| Key-value store | Inventory counters, tokens | Sub-ms latency, atomic operations |
| Document store | Queue state | Single-digit ms, auto-scale |
| Message queue | Order processing | Durability, exactly-once |
| Relational DB | Orders, users | ACID, complex queries |
### AWS Reference Architecture

**Service configuration:**
| Service | Configuration | Rationale |
| ----------- | -------------------------------------------- | ---------------------------------------- |
| CloudFront | Origin: S3 (static), Cache: 1 year | Waiting room must survive origin failure |
| API Gateway | Throttling: 10K RPS, Burst: 5K | Protects backend during spike |
| Lambda | Memory: 1024MB, Timeout: 30s, Reserved: 1000 | Predictable latency under load |
| ElastiCache | Redis Cluster, 3 nodes, r6g.large | Sub-ms latency, failover |
| DynamoDB | On-demand, Auto-scaling | Handles unpredictable load |
| SQS FIFO | 3000 msg/sec, 14-day retention | Order durability |
| RDS | Multi-AZ, db.r6g.xlarge, Read replicas | ACID + read scaling |
### Self-Hosted Alternatives
| Managed Service | Self-Hosted Option | Trade-off |
| --------------- | -------------------- | ----------------------------------------- |
| ElastiCache | Redis Cluster on EC2 | More control, operational burden |
| DynamoDB | Cassandra/ScyllaDB | Cost at scale, complexity |
| SQS FIFO | Kafka | Higher throughput, operational complexity |
| Lambda | Kubernetes + KEDA | Fine-grained control, cold starts |
## Variations
### Path B Implementation: Real-Time Counter Model
For e-commerce with dynamic inventory, replace token-based admission with real-time inventory checks:
```typescript
// real-time-inventory.ts
export async function attemptPurchase(
productId: string,
userId: string,
quantity: number,
): Promise<{ success: boolean; orderId?: string }> {
// Rate limit first (protect backend)
const allowed = await rateLimiter.check(userId, "purchase")
if (!allowed) {
return { success: false }
}
// Atomic inventory check + decrement
const result = await redis.eval(
`
local count = redis.call('GET', KEYS[1])
if tonumber(count) >= tonumber(ARGV[1]) then
return redis.call('DECRBY', KEYS[1], ARGV[1])
else
return -1
end
`,
1,
`inventory:${productId}`,
quantity,
)
if (result < 0) {
return { success: false } // Sold out
}
// Proceed to order (inventory already decremented)
const orderId = await createOrder(productId, userId, quantity)
return { success: true, orderId }
}
```
**Key difference:** Inventory decremented at purchase attempt, not at queue admission. Higher risk of "sold out after waiting" but supports dynamic restocking.
### VIP Early Access
Add priority tiers to queue service:
```typescript
// vip-queue.ts
interface QueueEntry {
// ... existing fields
tier: "vip" | "member" | "standard"
tierJoinedAt: Date
}
export async function getNextPosition(saleId: string, tier: string): Promise {
// VIPs get positions 1-1000, members 1001-10000, standard 10001+
const tierOffsets = { vip: 0, member: 1000, standard: 10000 }
const offset = tierOffsets[tier]
const countInTier = await ddb.query({
TableName: "FlashSaleQueue",
KeyConditionExpression: "sale_id = :sid",
FilterExpression: "tier = :tier",
ExpressionAttributeValues: { ":sid": saleId, ":tier": tier },
})
return offset + (countInTier.Count || 0) + 1
}
```
### Raffle-Based Allocation
For extremely limited inventory (e.g., 100 items, 1M users), replace queue with raffle:
```typescript
// raffle-mode.ts
export async function enterRaffle(saleId: string, userId: string): Promise {
// Entry window: 1 hour before draw
await ddb.put({
TableName: "FlashSaleRaffle",
Item: {
sale_id: saleId,
user_id: userId,
entry_id: uuid(),
entered_at: new Date().toISOString(),
},
})
}
export async function drawWinners(saleId: string, count: number): Promise {
// Get all entries
const entries = await getAllEntries(saleId)
// Cryptographically random selection
const shuffled = cryptoShuffle(entries)
const winners = shuffled.slice(0, count)
// Grant checkout tokens to winners
for (const winner of winners) {
await grantCheckoutToken(winner.user_id, saleId)
}
return winners.map((w) => w.user_id)
}
```
## Conclusion
Flash sale systems require coordinated defense at every layer:
1. **Traffic absorption**: CDN-hosted waiting room prevents backend overwhelm. Static HTML + client-side polling scales infinitely at the edge.
2. **Fair admission**: Token-based queue management (Path A) guarantees purchase opportunity. FIFO with randomized early arrival prevents "refresh race."
3. **Inventory accuracy**: Redis Lua scripts provide atomic check-and-decrement. Zero overselling through construction, not hope.
4. **Order durability**: Async processing via SQS decouples order receipt from processing. DLQ ensures no order is silently lost.
5. **Bot defense**: Multi-layer detection (WAF → behavioral → queue-level) raises the bar for attackers without blocking legitimate users.
**What this design optimizes for:**
- Zero overselling (100% inventory accuracy)
- Fairness (transparent queue position)
- Durability (no lost orders)
- Scalability (1M+ concurrent users)
**What it sacrifices:**
- Latency (queue wait time)
- Simplicity (multiple coordinated services)
- Dynamic inventory (pre-allocation model)
**Known limitations:**
- Token expiration requires careful tuning (too short: frustrated users; too long: wasted inventory)
- Sophisticated bots with residential proxies remain challenging
- VIP tiers can feel unfair to standard users
## Appendix
### Prerequisites
- Distributed systems fundamentals (CAP theorem, consistency models)
- Queue theory basics (FIFO, rate limiting)
- Redis data structures and Lua scripting
- Message queue patterns (at-least-once, exactly-once)
- Payment processing (idempotency, webhooks)
### Summary
- Flash sales require a **waiting room → token gate → atomic inventory → async order queue** architecture
- **CDN-hosted waiting room** absorbs traffic spikes cheaply and reliably
- **Token-based admission** (Path A) guarantees purchase opportunity and prevents overselling by construction
- **Redis Lua scripts** provide atomic inventory operations at 500K+ ops/second
- **Async order processing** via message queues decouples order receipt from fulfillment
- **Multi-layer bot defense** (WAF + behavioral + queue-level) raises attack cost without blocking legitimate users
### References
- [Alibaba Cloud: System Stability for Large-Scale Flash Sales](https://www.alibabacloud.com/blog/system-stability-assurance-for-large-scale-flash-sales_596968) - Alibaba Singles Day architecture
- [AWS Prime Day 2025 Metrics](https://aws.amazon.com/blogs/aws/aws-services-scale-to-new-heights-for-prime-day-2025-key-metrics-and-milestones/) - Scale benchmarks
- [SeatGeek Virtual Waiting Room Architecture](https://aws.amazon.com/blogs/architecture/build-a-virtual-waiting-room-with-amazon-dynamodb-and-aws-lambda-at-seatgeek/) - Token-based queue implementation
- [Shopify Flash Sales Architecture](https://www.infoq.com/presentations/shopify-architecture-flash-sale/) - Multi-tenant SaaS approach
- [Ticketmaster Queue System](https://blog.ticketmaster.com/how-ticketmaster-queue-works/) - Virtual waiting room UX
- [Redis Distributed Locks](https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/) - Atomic operations patterns
- [Martin Kleppmann: Designing Data-Intensive Applications](https://dataintensive.net/) - Distributed systems fundamentals
---
## Design Amazon Shopping Cart
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-amazon-shopping-cart
**Category:** System Design / System Design Problems
**Description:** A system design for an e-commerce shopping cart handling millions of concurrent users, real-time inventory, dynamic pricing, and distributed checkout. This design focuses on high availability during flash sales, consistent inventory management, and seamless guest-to-user cart transitions.
# Design Amazon Shopping Cart
A system design for an e-commerce shopping cart handling millions of concurrent users, real-time inventory, dynamic pricing, and distributed checkout. This design focuses on high availability during flash sales, consistent inventory management, and seamless guest-to-user cart transitions.

High-level shopping cart architecture showing client apps, gateway layer, application services, and data stores with async processing for cart expiration and abandoned cart recovery.
## Abstract
Shopping cart systems must solve three interconnected problems: **cart state persistence** (guest vs authenticated users, cross-device sync, cart merge on login), **inventory accuracy** (preventing overselling during concurrent access while maintaining availability), and **checkout atomicity** (coordinating payment, inventory deduction, and order creation across distributed services).
The core architectural decisions:
1. **Two-tier cart storage**: Redis for sub-millisecond reads during browsing; persistent database for durability and cross-device sync
2. **Soft reservations with expiry**: Reserve inventory on add-to-cart (5-minute TTL), convert to hard reservation only after payment confirmation—prevents inventory lock-up from abandoned carts
3. **Saga pattern for checkout**: Orchestrated sequence of payment authorization → inventory hard reservation → order creation, with compensating transactions on failure
4. **Eventually consistent inventory display, strongly consistent checkout**: Accept stale counts on product pages for availability; enforce consistency only during checkout
| Dimension | Optimizes For | Sacrifices |
| ------------------- | -------------------------- | ---------------------------------------- |
| Redis cart cache | Read latency (<1ms) | Durability (requires DB backup) |
| Soft reservations | Inventory turnover | Checkout may fail if reservation expires |
| Saga orchestration | Reliability, debuggability | Latency (sequential steps) |
| Hash-based sharding | Even distribution | Cross-user queries |
## Requirements
### Functional Requirements
| Feature | Scope | Notes |
| ---------------------------- | -------- | ------------------------------------------------------ |
| Add/remove/update cart items | Core | Real-time quantity validation against inventory |
| Guest cart with persistence | Core | Survives browser close, 30-day expiry |
| Cart merge on login | Core | Combine guest + user cart, resolve quantity conflicts |
| Real-time price updates | Core | Price at checkout reflects current price, not add-time |
| Inventory soft reservation | Core | Prevent checkout failures from out-of-stock |
| Coupon/promotion application | Core | Stackable rules, exclusivity handling |
| Multi-step checkout | Core | Address → Payment → Review → Confirm |
| Abandoned cart recovery | Extended | Email/push notifications with cart link |
| Wishlist/Save for Later | Extended | Move items between cart and wishlist |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ------------------ | ------------------------------------- | ---------------------------------------------------------- |
| Availability | 99.99% (4 nines) | Revenue-critical; 1 hour downtime = millions in lost sales |
| Cart read latency | p99 < 50ms | Instant feedback on cart interactions |
| Cart write latency | p99 < 200ms | Acceptable for add/remove operations |
| Checkout latency | p99 < 3s | End-to-end including payment processing |
| Peak throughput | 100K cart ops/sec | Flash sale scenarios |
| Data durability | 99.999999999% | Cart data must not be lost |
| Consistency | Eventual (display), Strong (checkout) | Hybrid model per operation type |
### Scale Estimation
**Users:**
- DAU: 50M
- Peak concurrent users: 5M (10% of DAU during flash sales)
- Carts per user: 1 active
**Traffic:**
- Cart views: 50M DAU × 10 views/day = 500M/day = ~6K RPS average
- Cart modifications: 50M DAU × 3 ops/day = 150M/day = ~1.7K RPS average
- Peak multiplier (flash sale): 50x → 300K cart views/sec, 85K modifications/sec
- Checkouts: 5M/day = ~60 checkouts/sec average, 3K/sec peak
**Storage:**
- Cart record: ~2KB (metadata + 10 items average)
- Active carts: 50M × 2KB = 100GB
- Historical carts (90-day retention): 100GB × 3 = 300GB
- With replication (3x): ~1TB total
**Inventory:**
- SKUs: 100M products
- Inventory checks: 500M/day (1 per cart view)
- Reservation writes: 150M/day
## Design Paths
### Path A: Consistency-First (Financial/Limited-Inventory)
**Best when:**
- High-value items where overselling has significant cost (electronics, luxury goods)
- Limited inventory flash sales (concert tickets, limited editions)
- Regulatory requirements for inventory accuracy
**Architecture:**
- Strong consistency for all inventory operations
- Synchronous inventory checks before cart add
- Pessimistic locking during checkout
**Trade-offs:**
- ✅ Zero overselling
- ✅ Accurate inventory counts everywhere
- ❌ Higher latency (lock contention)
- ❌ Lower throughput under load
- ❌ Checkout failures increase during traffic spikes
**Real-world example:** Ticketmaster uses strong consistency for seat inventory. During high-demand events, this leads to "waiting room" queuing to serialize access and prevent overselling reserved seating.
### Path B: Availability-First (High-Volume Retail)
**Best when:**
- Large inventory buffers make overselling rare
- Customer experience (speed) more important than perfect accuracy
- Compensation for overselling is acceptable (refund + coupon)
**Architecture:**
- Eventually consistent inventory reads
- Optimistic updates with conflict resolution at checkout
- Accept occasional overselling, handle via backorder/compensation
**Trade-offs:**
- ✅ Sub-millisecond cart operations
- ✅ Handles extreme traffic spikes gracefully
- ✅ Better user experience (no waiting)
- ❌ Occasional overselling (typically <0.1%)
- ❌ Requires robust compensation workflow
**Real-world example:** Amazon accepts minor overselling on most products. When it occurs, customers receive an apology email with a discount code and the option to wait for restock or cancel. The revenue from faster checkout far exceeds compensation costs.
### Path Comparison
| Factor | Path A: Consistency-First | Path B: Availability-First |
| ---------------------- | ------------------------- | --------------------------- |
| Inventory accuracy | 100% | 99.9%+ |
| Cart add latency | 50-200ms | <10ms |
| Peak throughput | 10K ops/sec | 100K+ ops/sec |
| Checkout failure rate | Higher (locks timeout) | Lower (optimistic) |
| Operational complexity | Lower | Higher (compensation flows) |
| Best for | Tickets, luxury, limited | General retail, commodities |
### This Article's Focus
This article focuses on **Path B (Availability-First)** because it represents the architecture of most large-scale e-commerce systems (Amazon, Shopify, Walmart). Path A patterns are noted where inventory criticality requires them.
## High-Level Design
### Service Architecture

### Cart Service
Manages cart lifecycle: creation, item management, persistence, and merge operations.
**Responsibilities:**
- CRUD operations on cart items
- Guest cart token generation and management
- Cart merge on user authentication
- Price/availability validation coordination
- Cart expiration scheduling
**Data Flow - Add to Cart:**

### Inventory Service
Manages stock levels, reservations, and availability across warehouses.
**Key Concepts:**
- **Available For Sale (AFS):** Physical stock minus hard reservations
- **Available For Reservation (AFR):** AFS minus soft reservations
- **Soft Reservation:** Temporary hold with TTL, automatically releases
- **Hard Reservation:** Committed hold after payment, triggers fulfillment
**Reservation State Machine:**

### Checkout Orchestrator
Coordinates the multi-step checkout process using the Saga pattern.
**Saga Steps:**
1. **Validate Cart:** Confirm items still in stock at current prices
2. **Authorize Payment:** Place hold on payment method
3. **Convert Reservations:** Soft → Hard for all cart items
4. **Create Order:** Generate order record
5. **Confirm Payment:** Capture authorized amount
6. **Trigger Fulfillment:** Send to warehouse
**Compensation Actions (on failure):**
- Payment capture failed → Release hard reservations, void authorization
- Reservation conversion failed → Void payment authorization
- Order creation failed → Release reservations, void authorization
### Pricing Service
Evaluates pricing rules, promotions, and coupons in real-time.
**Rule Evaluation Order:**
1. Base price (from catalog)
2. Sale price (time-based overrides)
3. Quantity discounts (buy 3, get 10% off)
4. Coupon codes (user-applied)
5. Cart-level promotions (free shipping over $50)
6. Loyalty discounts (member pricing)
**Conflict Resolution:**
- Exclusive promotions marked in rule metadata
- Priority field determines evaluation order
- "Best for customer" mode: apply combination yielding maximum discount
## API Design
### Cart Endpoints
#### Get Cart
```http
GET /api/v1/cart
Authorization: Bearer {token} | X-Guest-Token: {guest_token}
```
**Response (200 OK):**
```json collapse={16-45}
{
"cart_id": "cart_abc123",
"user_id": "user_xyz789",
"items": [
{
"item_id": "item_001",
"product_id": "prod_12345",
"product_name": "Wireless Headphones",
"variant_id": "var_black_medium",
"quantity": 2,
"unit_price": 79.99,
"line_total": 159.98,
"image_url": "https://cdn.example.com/headphones.jpg",
"availability": {
"status": "in_stock",
"quantity_available": 45,
"reservation_expires_at": "2024-01-15T10:35:00Z"
},
"applied_promotions": [
{
"promotion_id": "promo_winter_sale",
"name": "Winter Sale 20% Off",
"discount_amount": 31.99
}
]
}
],
"summary": {
"subtotal": 159.98,
"discount_total": 31.99,
"shipping_estimate": 0.0,
"tax_estimate": 10.24,
"total": 138.23
},
"applied_coupons": [],
"created_at": "2024-01-15T09:00:00Z",
"updated_at": "2024-01-15T10:30:00Z",
"expires_at": "2024-02-14T09:00:00Z"
}
```
**Design Decisions:**
- `availability` embedded per item: Frontend can show stock warnings without additional calls
- `reservation_expires_at` exposed: Client can show countdown timer encouraging checkout
- `summary` pre-calculated: Avoids client-side price calculation errors
- Pagination not needed: Carts rarely exceed 50 items; full payload < 10KB
#### Add Item to Cart
```http
POST /api/v1/cart/items
Authorization: Bearer {token} | X-Guest-Token: {guest_token}
Content-Type: application/json
Idempotency-Key: {uuid}
```
**Request:**
```json
{
"product_id": "prod_12345",
"variant_id": "var_black_medium",
"quantity": 2
}
```
**Response (201 Created):**
```json collapse={5-30}
{
"item_id": "item_001",
"product_id": "prod_12345",
"quantity": 2,
"unit_price": 79.99,
"line_total": 159.98,
"reservation": {
"reservation_id": "res_abc123",
"expires_at": "2024-01-15T10:35:00Z"
},
"cart_summary": {
"item_count": 2,
"subtotal": 159.98,
"total": 138.23
}
}
```
**Error Responses:**
| Status | Condition | Body |
| ------ | --------------------------- | ----------------------------------------------------------------- |
| 400 | Invalid product/variant ID | `{"error": "INVALID_PRODUCT", "message": "Product not found"}` |
| 409 | Insufficient inventory | `{"error": "INSUFFICIENT_STOCK", "available": 1, "requested": 2}` |
| 409 | Duplicate add (idempotency) | Returns original response |
| 429 | Rate limit exceeded | `{"error": "RATE_LIMITED", "retry_after": 60}` |
**Rate Limits:** 60 requests/minute per user (prevents cart bombing attacks)
#### Update Item Quantity
```http
PATCH /api/v1/cart/items/{item_id}
```
**Request:**
```json
{
"quantity": 3
}
```
Behavior:
- `quantity: 0` removes the item
- Validates against available inventory
- Updates soft reservation (extends TTL if increasing, releases delta if decreasing)
#### Apply Coupon
```http
POST /api/v1/cart/coupons
```
**Request:**
```json
{
"code": "SAVE20"
}
```
**Response (200 OK):**
```json
{
"coupon": {
"code": "SAVE20",
"description": "20% off your order",
"discount_type": "percentage",
"discount_value": 20,
"applied_discount": 27.99
},
"cart_summary": {
"subtotal": 159.98,
"discount_total": 59.98,
"total": 110.24
}
}
```
**Error Responses:**
| Status | Condition |
| ------ | ---------------------------------------------- |
| 400 | Invalid/expired coupon |
| 409 | Coupon not combinable with existing promotions |
| 409 | Minimum order value not met |
### Checkout Endpoints
#### Initialize Checkout
```http
POST /api/v1/checkout
Authorization: Bearer {token}
```
**Request:**
```json
{
"cart_id": "cart_abc123"
}
```
**Response (201 Created):**
```json collapse={7-35}
{
"checkout_id": "checkout_xyz789",
"status": "pending",
"cart_snapshot": {
"items": [...],
"summary": {...}
},
"required_steps": ["address", "payment", "review"],
"completed_steps": [],
"expires_at": "2024-01-15T11:00:00Z"
}
```
**Design Decisions:**
- `cart_snapshot` captured at checkout init: Prices locked for checkout duration
- `expires_at` enforced: 30-minute checkout session prevents indefinite reservation holds
- Steps returned by server: Enables A/B testing checkout flows without client changes
#### Submit Shipping Address
```http
PUT /api/v1/checkout/{checkout_id}/address
```
**Request:**
```json
{
"shipping_address": {
"name": "John Doe",
"line1": "123 Main St",
"line2": "Apt 4B",
"city": "Seattle",
"state": "WA",
"postal_code": "98101",
"country": "US",
"phone": "+1-206-555-0123"
},
"billing_same_as_shipping": true
}
```
**Response includes:**
- Validated/normalized address
- Updated shipping options with real costs
- Tax calculation based on destination
#### Submit Payment and Complete
```http
POST /api/v1/checkout/{checkout_id}/complete
Idempotency-Key: {uuid}
```
**Request:**
```json
{
"payment_method_id": "pm_card_visa_4242",
"accept_terms": true
}
```
**Response (201 Created):**
```json collapse={8-25}
{
"order_id": "order_abc123",
"status": "confirmed",
"confirmation_number": "AMZ-2024-ABC123",
"estimated_delivery": "2024-01-18",
"total_charged": 138.23,
"payment": {
"method": "Visa ending in 4242",
"transaction_id": "txn_xyz789"
}
}
```
**Idempotency Behavior:**
- Same `Idempotency-Key` within 24 hours returns cached response
- Prevents duplicate charges on network retries or user double-clicks
## Data Modeling
### Cart Schema
**Primary Store:** PostgreSQL (ACID guarantees, complex merge queries)
```sql collapse={1-5, 25-40}
-- Cart table
CREATE TABLE carts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
guest_token VARCHAR(64) UNIQUE,
status VARCHAR(20) DEFAULT 'active',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ,
merged_into_cart_id UUID REFERENCES carts(id),
CONSTRAINT user_or_guest CHECK (
(user_id IS NOT NULL AND guest_token IS NULL) OR
(user_id IS NULL AND guest_token IS NOT NULL)
)
);
-- Cart items table
CREATE TABLE cart_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
cart_id UUID NOT NULL REFERENCES carts(id) ON DELETE CASCADE,
product_id UUID NOT NULL,
variant_id UUID NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price_at_add DECIMAL(10,2) NOT NULL,
reservation_id UUID,
added_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (cart_id, product_id, variant_id)
);
-- Indexes for common access patterns
CREATE INDEX idx_carts_user ON carts(user_id) WHERE user_id IS NOT NULL;
CREATE INDEX idx_carts_guest ON carts(guest_token) WHERE guest_token IS NOT NULL;
CREATE INDEX idx_carts_expires ON carts(expires_at) WHERE status = 'active';
CREATE INDEX idx_cart_items_cart ON cart_items(cart_id);
CREATE INDEX idx_cart_items_reservation ON cart_items(reservation_id)
WHERE reservation_id IS NOT NULL;
```
**Design Decisions:**
- `user_id` vs `guest_token` mutual exclusion: Clean separation of authenticated vs guest carts
- `unit_price_at_add`: Audit trail for price changes between add and checkout
- `reservation_id` nullable: Not all cart items require reservation (digital goods)
- Soft delete via `merged_into_cart_id`: Preserves guest cart history for analytics
### Cart Cache Structure (Redis)
```redis
# Cart metadata (Hash)
HSET cart:{cart_id}
user_id "user_xyz789"
item_count 3
subtotal 259.97
updated_at 1705312200
# Cart items (Hash - one per item)
HSET cart:{cart_id}:item:{item_id}
product_id "prod_12345"
variant_id "var_black_medium"
quantity 2
unit_price 79.99
reservation_id "res_abc123"
reservation_expires 1705312500
# Guest token to cart mapping
SET guest:{guest_token} cart_abc123 EX 2592000 # 30 days
# Cart expiration sorted set (for cleanup workers)
ZADD cart_expirations 1705312500 cart_abc123
```
**TTL Strategy:**
- Cart metadata: 30 days (matches business retention policy)
- Reservation entries: 5 minutes (align with soft reservation TTL)
- Guest token mapping: 30 days
### Inventory Schema
**Primary Store:** PostgreSQL with read replicas
```sql collapse={1-3, 20-35}
-- Inventory by location
CREATE TABLE inventory_entries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
product_id UUID NOT NULL,
variant_id UUID NOT NULL,
location_id UUID NOT NULL,
quantity_on_hand INT NOT NULL DEFAULT 0,
quantity_reserved INT NOT NULL DEFAULT 0,
quantity_available INT GENERATED ALWAYS AS
(quantity_on_hand - quantity_reserved) STORED,
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE (product_id, variant_id, location_id),
CHECK (quantity_reserved <= quantity_on_hand)
);
-- Reservations table
CREATE TABLE inventory_reservations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
inventory_entry_id UUID NOT NULL REFERENCES inventory_entries(id),
cart_id UUID NOT NULL,
quantity INT NOT NULL,
type VARCHAR(10) NOT NULL CHECK (type IN ('soft', 'hard')),
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL for hard reservations
order_id UUID, -- Set when converted to hard reservation
INDEX idx_reservations_entry (inventory_entry_id),
INDEX idx_reservations_cart (cart_id),
INDEX idx_reservations_expires (expires_at) WHERE type = 'soft'
);
```
**Consistency Approach:**
- `quantity_available` as generated column: Always consistent with underlying values
- Reservation updates use `SELECT FOR UPDATE`: Prevents race conditions
- Read replicas used for availability display (eventual consistency acceptable)
### Reservation Cache (Redis)
```redis
# Soft reservation with automatic expiry
SET reservation:{res_id}
'{"inventory_entry_id":"inv_123","cart_id":"cart_abc","quantity":2}'
EX 300 # 5 minutes
# Fast lookup: cart → reservations
SADD cart_reservations:{cart_id} res_001 res_002
EXPIRE cart_reservations:{cart_id} 300
# Fast lookup: inventory → reservations (for availability calculation)
SADD inventory_reservations:{inventory_entry_id} res_001 res_002
EXPIRE inventory_reservations:{inventory_entry_id} 300
```
**Why Redis for reservations:**
- Automatic TTL expiration handles cleanup without background jobs
- Sub-millisecond lookups for availability checks
- SADD operations for atomic reservation tracking
### Database Selection Summary
| Data Type | Store | Rationale |
| ----------------- | --------------------- | ------------------------------------------ |
| Cart (persistent) | PostgreSQL | ACID for merge operations, complex queries |
| Cart (cache) | Redis Cluster | Sub-ms reads, automatic expiration |
| Inventory | PostgreSQL + replicas | Strong consistency writes, scaled reads |
| Reservations | Redis + PostgreSQL | Redis for speed, PG for durability |
| Orders | PostgreSQL | ACID required for financial records |
| Price rules | PostgreSQL + cache | Complex queries, Redis for hot paths |
## Low-Level Design: Cart Merge
Cart merge occurs when a guest user authenticates. The system must combine items from both carts while handling conflicts.
### Merge Algorithm

### Merge Implementation
```typescript collapse={1-15, 45-65}
interface CartItem {
productId: string
variantId: string
quantity: number
reservationId?: string
}
interface MergeResult {
mergedCart: Cart
addedItems: CartItem[]
updatedItems: Array<{ item: CartItem; previousQty: number }>
conflicts: Array<{ guestItem: CartItem; reason: string }>
}
async function mergeGuestCart(
userId: string,
guestToken: string,
strategy: "sum" | "max" | "keep_user" = "sum",
): Promise {
return await db.transaction(async (tx) => {
// Load both carts with row-level locks
const [userCart, guestCart] = await Promise.all([
tx.query("SELECT * FROM carts WHERE user_id = $1 FOR UPDATE", [userId]),
tx.query("SELECT * FROM carts WHERE guest_token = $1 FOR UPDATE", [guestToken]),
])
if (!guestCart) {
return { mergedCart: userCart, addedItems: [], updatedItems: [], conflicts: [] }
}
const result: MergeResult = {
mergedCart: userCart || (await createUserCart(tx, userId)),
addedItems: [],
updatedItems: [],
conflicts: [],
}
for (const guestItem of guestCart.items) {
const existingItem = result.mergedCart.items.find(
(i) => i.productId === guestItem.productId && i.variantId === guestItem.variantId,
)
if (!existingItem) {
// Transfer item to user cart
await transferItem(tx, guestItem, result.mergedCart.id)
result.addedItems.push(guestItem)
} else {
// Handle conflict based on strategy
const newQty = resolveQuantity(existingItem.quantity, guestItem.quantity, strategy)
const maxAllowed = await getMaxQuantity(guestItem.productId, guestItem.variantId)
if (newQty > maxAllowed) {
result.conflicts.push({
guestItem,
reason: `Quantity capped at ${maxAllowed} (max per order)`,
})
}
if (newQty !== existingItem.quantity) {
await updateItemQuantity(tx, existingItem.id, Math.min(newQty, maxAllowed))
result.updatedItems.push({ item: existingItem, previousQty: existingItem.quantity })
}
// Release guest item's reservation (will be replaced by user cart's)
if (guestItem.reservationId) {
await releaseReservation(guestItem.reservationId)
}
}
}
// Mark guest cart as merged
await tx.query("UPDATE carts SET status = $1, merged_into_cart_id = $2 WHERE id = $3", [
"merged",
result.mergedCart.id,
guestCart.id,
])
return result
})
}
function resolveQuantity(userQty: number, guestQty: number, strategy: string): number {
switch (strategy) {
case "sum":
return userQty + guestQty
case "max":
return Math.max(userQty, guestQty)
case "keep_user":
return userQty
}
}
```
### Merge Edge Cases
| Scenario | Handling |
| ------------------------------- | ------------------------------------------------------------------ |
| Guest item now out of stock | Add to cart with `unavailable` flag; notify user |
| Price changed since guest add | Use current price; show price change notice |
| Guest item discontinued | Add to "saved items" instead; notify user |
| Combined quantity exceeds limit | Cap at limit; show conflict message |
| Guest cart has applied coupon | Validate coupon for user; may not transfer (user-specific coupons) |
## Low-Level Design: Checkout Saga
The checkout process spans multiple services that must coordinate atomically despite having independent databases.
### Saga Orchestration

### Saga State Machine
```typescript collapse={1-10, 40-60}
enum CheckoutState {
INITIATED = "initiated",
CART_VALIDATED = "cart_validated",
PAYMENT_AUTHORIZED = "payment_authorized",
INVENTORY_RESERVED = "inventory_reserved",
ORDER_CREATED = "order_created",
PAYMENT_CAPTURED = "payment_captured",
COMPLETED = "completed",
COMPENSATION_REQUIRED = "compensation_required",
FAILED = "failed",
}
interface CheckoutSaga {
id: string
cartId: string
state: CheckoutState
authorizationId?: string
orderId?: string
failedStep?: string
compensationSteps: string[]
createdAt: Date
updatedAt: Date
}
async function executeCheckoutSaga(checkoutId: string): Promise {
const saga = await loadSaga(checkoutId)
try {
// Each step is idempotent and checks current state before executing
if (saga.state === CheckoutState.INITIATED) {
await validateCart(saga)
await transitionState(saga, CheckoutState.CART_VALIDATED)
}
if (saga.state === CheckoutState.CART_VALIDATED) {
saga.authorizationId = await authorizePayment(saga)
await transitionState(saga, CheckoutState.PAYMENT_AUTHORIZED)
}
if (saga.state === CheckoutState.PAYMENT_AUTHORIZED) {
await convertReservations(saga)
await transitionState(saga, CheckoutState.INVENTORY_RESERVED)
}
if (saga.state === CheckoutState.INVENTORY_RESERVED) {
saga.orderId = await createOrder(saga)
await transitionState(saga, CheckoutState.ORDER_CREATED)
}
if (saga.state === CheckoutState.ORDER_CREATED) {
await capturePayment(saga)
await transitionState(saga, CheckoutState.PAYMENT_CAPTURED)
}
if (saga.state === CheckoutState.PAYMENT_CAPTURED) {
await clearCart(saga)
await transitionState(saga, CheckoutState.COMPLETED)
}
return await loadOrder(saga.orderId)
} catch (error) {
saga.failedStep = saga.state
await transitionState(saga, CheckoutState.COMPENSATION_REQUIRED)
await executeCompensation(saga)
throw error
}
}
async function executeCompensation(saga: CheckoutSaga): Promise {
// Compensate in reverse order of completed steps
if (saga.orderId && saga.state !== CheckoutState.PAYMENT_CAPTURED) {
await markOrderFailed(saga.orderId)
}
if (saga.state >= CheckoutState.INVENTORY_RESERVED) {
await releaseHardReservations(saga.cartId)
}
if (saga.authorizationId) {
await voidAuthorization(saga.authorizationId)
}
await transitionState(saga, CheckoutState.FAILED)
}
```
### Idempotency Implementation
```typescript collapse={1-8, 30-45}
interface IdempotencyRecord {
key: string
requestHash: string
response: any
statusCode: number
createdAt: Date
expiresAt: Date
}
async function withIdempotency(
key: string,
request: any,
handler: () => Promise,
): Promise<{ result: T; statusCode: number; cached: boolean }> {
const requestHash = hashRequest(request)
// Check for existing response
const existing = await redis.get(`idempotency:${key}`)
if (existing) {
const record: IdempotencyRecord = JSON.parse(existing)
if (record.requestHash === requestHash) {
return { result: record.response, statusCode: record.statusCode, cached: true }
}
// Same key, different request = error
throw new ConflictError("Idempotency key reused with different request")
}
// Lock the key during processing
const lockAcquired = await redis.set(
`idempotency:${key}`,
JSON.stringify({ requestHash, status: "processing" }),
"NX",
"EX",
300, // 5 minute lock
)
if (!lockAcquired) {
// Another request is processing with same key
throw new ConflictError("Request already in progress")
}
try {
const result = await handler()
const record: IdempotencyRecord = {
key,
requestHash,
response: result,
statusCode: 201,
createdAt: new Date(),
expiresAt: new Date(Date.now() + 24 * 60 * 60 * 1000), // 24 hours
}
await redis.set(`idempotency:${key}`, JSON.stringify(record), "EX", 86400)
return { result, statusCode: 201, cached: false }
} catch (error) {
await redis.del(`idempotency:${key}`)
throw error
}
}
```
## Frontend Considerations
### Cart Data Structure
**Naive approach:**
```typescript
// ❌ Array-based: O(n) lookups for quantity updates
interface Cart {
items: CartItem[]
}
```
**Optimized approach:**
```typescript
// ✅ Normalized: O(1) lookups, efficient updates
interface CartState {
items: Record // itemId → CartItem
itemOrder: string[] // Maintains display order
summary: CartSummary
appliedCoupons: Coupon[]
reservationTimers: Record // itemId → expiresAt
}
```
**Why normalized:**
- Quantity update: Update single object, no array scan
- Remove item: Delete from `items`, filter `itemOrder`
- Reorder: Modify `itemOrder` only
- React renders: Reference equality checks work correctly
### Optimistic Updates with Rollback
```typescript collapse={1-10, 40-60}
import { useMutation, useQueryClient } from "@tanstack/react-query"
function useAddToCart() {
const queryClient = useQueryClient()
return useMutation({
mutationFn: addItemToCart,
onMutate: async (newItem) => {
// Cancel outgoing refetches
await queryClient.cancelQueries({ queryKey: ["cart"] })
// Snapshot previous state
const previousCart = queryClient.getQueryData(["cart"])
// Optimistically update
queryClient.setQueryData(["cart"], (old: CartState) => ({
...old,
items: {
...old.items,
[newItem.itemId]: {
...newItem,
status: "pending", // Visual indicator
},
},
itemOrder: [...old.itemOrder, newItem.itemId],
summary: recalculateSummary(old, newItem),
}))
return { previousCart }
},
onError: (err, newItem, context) => {
// Rollback on error
queryClient.setQueryData(["cart"], context.previousCart)
showToast({
type: "error",
message: err.code === "INSUFFICIENT_STOCK" ? `Only ${err.available} available` : "Failed to add item",
})
},
onSuccess: (data, newItem) => {
// Update with server response (includes reservation info)
queryClient.setQueryData(["cart"], (old: CartState) => ({
...old,
items: {
...old.items,
[newItem.itemId]: {
...data.item,
status: "confirmed",
},
},
summary: data.cartSummary,
}))
},
onSettled: () => {
// Refetch to ensure consistency
queryClient.invalidateQueries({ queryKey: ["cart"] })
},
})
}
```
### Reservation Countdown Timer
```typescript collapse={1-5, 25-35}
function useReservationTimer(expiresAt: string | null) {
const [timeLeft, setTimeLeft] = useState(null);
const [isExpired, setIsExpired] = useState(false);
useEffect(() => {
if (!expiresAt) return;
const updateTimer = () => {
const remaining = new Date(expiresAt).getTime() - Date.now();
if (remaining <= 0) {
setIsExpired(true);
setTimeLeft(0);
} else {
setTimeLeft(Math.ceil(remaining / 1000));
}
};
updateTimer();
const interval = setInterval(updateTimer, 1000);
return () => clearInterval(interval);
}, [expiresAt]);
return { timeLeft, isExpired };
}
// Usage in component
function CartItem({ item }: { item: CartItemData }) {
const { timeLeft, isExpired } = useReservationTimer(item.reservationExpiresAt);
return (
{/* ... item display ... */}
{timeLeft !== null && timeLeft < 300 && (
Reserved for {formatTime(timeLeft)} - complete checkout soon
)}
{isExpired && (
Reservation expired - item may become unavailable
)}
);
}
```
### Real-Time Price Updates
```typescript collapse={1-10, 30-45}
function useCartPriceSync(cartId: string) {
const queryClient = useQueryClient()
useEffect(() => {
const ws = new WebSocket(`wss://api.example.com/cart/${cartId}/updates`)
ws.onmessage = (event) => {
const update = JSON.parse(event.data)
switch (update.type) {
case "price_change":
queryClient.setQueryData(["cart"], (old: CartState) => {
const item = old.items[update.itemId]
if (!item) return old
const priceDiff = update.newPrice - item.unitPrice
return {
...old,
items: {
...old.items,
[update.itemId]: {
...item,
unitPrice: update.newPrice,
lineTotal: update.newPrice * item.quantity,
priceChanged: priceDiff !== 0,
priceDiff,
},
},
summary: recalculateSummary(old, update),
}
})
break
case "item_unavailable":
queryClient.setQueryData(["cart"], (old: CartState) => ({
...old,
items: {
...old.items,
[update.itemId]: {
...old.items[update.itemId],
available: false,
availableQuantity: update.availableQuantity,
},
},
}))
showToast({
type: "warning",
message: `${old.items[update.itemId].name} is now out of stock`,
})
break
case "reservation_expired":
queryClient.invalidateQueries({ queryKey: ["cart"] })
break
}
}
return () => ws.close()
}, [cartId, queryClient])
}
```
## Infrastructure Design
### Cloud-Agnostic Architecture
#### Compute
| Component | Concept | Requirements |
| --------------------- | ------------------------ | ----------------------------------- |
| Cart Service | Stateless API servers | Auto-scaling, health checks |
| Checkout Orchestrator | Stateful workflow engine | Durable execution, retry support |
| Background Workers | Job processors | At-least-once delivery, idempotency |
#### Data Stores
| Data | Concept | Requirements |
| ----------------- | ----------------- | ------------------------------------- |
| Cart (hot) | In-memory cache | Sub-ms reads, TTL support, clustering |
| Cart (persistent) | Relational DB | ACID, complex queries, replication |
| Inventory | Relational DB | Strong consistency, row-level locking |
| Reservations | KV store with TTL | Automatic expiration, high throughput |
| Orders | Relational DB | ACID, audit trail |
#### Messaging
| Use Case | Concept | Requirements |
| ----------------- | ------------- | -------------------------------- |
| Cart events | Message queue | At-least-once, ordering per cart |
| Inventory updates | Event stream | Fan-out to multiple consumers |
| Abandoned cart | Delayed queue | Scheduled delivery |
### AWS Reference Architecture

### AWS Service Mapping
| Component | AWS Service | Configuration |
| ------------------ | ----------------- | ---------------------------------------- |
| Cart API | ECS Fargate | 2-50 tasks, auto-scaling on CPU |
| Cart cache | ElastiCache Redis | r6g.large, cluster mode, 3 shards |
| Cart DB | RDS PostgreSQL | db.r6g.xlarge, Multi-AZ, 2 read replicas |
| Reservations | DynamoDB | On-demand, TTL enabled |
| Checkout saga | Step Functions | Express workflow (30 min max) |
| Event bus | SQS + EventBridge | Standard queue, 14-day retention |
| Background workers | Lambda | 1024MB, 15-min timeout |
| CDN | CloudFront | Price class 100 (US/EU) |
| WAF | AWS WAF | Rate limiting, SQL injection protection |
### Multi-Region Deployment
For high availability during peak events:

**Failover Strategy:**
- Route 53 health checks detect primary failure
- Global Accelerator reroutes traffic to secondary
- RDS replica promoted to primary (RPO: ~1 minute)
- Redis cache rebuilt from database (acceptable for carts)
### Self-Hosted Alternatives
| Managed Service | Self-Hosted Option | When to Self-Host |
| --------------- | ------------------------ | ---------------------------------------- |
| ElastiCache | Redis Cluster on EC2 | Specific modules (RedisGraph, RedisJSON) |
| RDS PostgreSQL | PostgreSQL on EC2 | Cost at scale, specific extensions |
| DynamoDB | ScyllaDB / Cassandra | Multi-cloud, cost optimization |
| Step Functions | Temporal.io | Complex workflows, long-running sagas |
| SQS | RabbitMQ / Redis Streams | Specific routing needs |
## Conclusion
This shopping cart design prioritizes **availability and user experience** over perfect consistency, accepting that:
1. **Eventual consistency for display** is acceptable when strong consistency is enforced at checkout
2. **Soft reservations with expiry** prevent inventory lock-up while providing reasonable purchase assurance
3. **Saga orchestration** provides reliable distributed transactions with clear compensation paths
4. **Multi-tier caching** delivers sub-millisecond cart reads while maintaining durability
**Key tradeoffs accepted:**
- Occasional checkout failures when reservations expire (mitigated by clear countdown UX)
- Rare overselling on flash sales (handled via backorder/compensation)
- Higher operational complexity from distributed architecture (justified by scale requirements)
**What this design does NOT address:**
- Multi-currency pricing (requires additional currency service)
- Subscription/recurring purchases (different cart model)
- B2B bulk ordering (different quantity/pricing rules)
- Marketplace multi-seller carts (complex checkout splitting)
## Appendix
### Prerequisites
- Distributed systems fundamentals (CAP theorem, eventual consistency)
- Database concepts (ACID, sharding, replication)
- API design principles (REST, idempotency)
- Basic understanding of payment processing flows
### Terminology
| Term | Definition |
| ----------------------------------- | ---------------------------------------------------------------------- |
| **AFS (Available For Sale)** | Physical inventory minus hard reservations |
| **AFR (Available For Reservation)** | AFS minus soft reservations |
| **Soft Reservation** | Temporary inventory hold with automatic expiry (typically 5 minutes) |
| **Hard Reservation** | Committed inventory allocation after payment confirmation |
| **Saga** | Pattern for distributed transactions using compensating actions |
| **Idempotency Key** | Client-generated UUID ensuring duplicate requests return same response |
| **Cart Merge** | Process of combining guest cart with authenticated user's cart |
### Summary
- **Two-tier storage** (Redis + PostgreSQL) balances latency and durability for cart operations
- **Soft/hard reservation model** prevents inventory lock-up while providing checkout assurance
- **Saga orchestration** with compensation ensures reliable multi-service checkout
- **Eventually consistent inventory reads** with strongly consistent checkout enables scale
- **Normalized frontend state** with optimistic updates delivers responsive UX
- **Multi-region deployment** with failover provides 99.99% availability target
### References
- [Shopify Engineering: BFCM Readiness](https://shopify.engineering/bfcm-readiness-2025) - How Shopify handles 80K+ checkouts/minute
- [Microservices.io: Saga Pattern](https://microservices.io/patterns/data/saga.html) - Distributed transaction patterns
- [AWS: Serverless E-commerce Architecture](https://aws.amazon.com/blogs/architecture/architecting-a-highly-available-serverless-microservices-based-ecommerce-site/) - Reference implementation
- [InfoQ: Shopify Flash Sale Architecture](https://www.infoq.com/presentations/shopify-architecture-flash-sale/) - Pod-based scaling
- [Modern Treasury: Idempotency in Payments](https://www.moderntreasury.com/journal/why-idempotency-matters-in-payments) - Payment idempotency patterns
- [Baymard Institute: Checkout Usability](https://baymard.com/research/checkout-usability) - UX research with 70% abandonment baseline
- [Microsoft: Soft Reservation Capabilities](https://learn.microsoft.com/en-us/dynamics365/supply-chain/inventory/inventory-visibility-reservations) - Inventory reservation patterns
- [Queue-it: Overselling Prevention](https://queue-it.com/blog/overselling/) - Flash sale inventory challenges
---
## Design a URL Shortener: IDs, Storage, and Scale
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/url-shortener-design
**Category:** System Design / System Design Problems
**Description:** A comprehensive system design for a URL shortening service covering ID generation strategies, database sharding, caching layers, analytics pipelines, and abuse prevention. This design addresses sub-50ms redirect latency at Bitly scale (6 billion clicks/month) with 99.99% availability and real-time click tracking.
# Design a URL Shortener: IDs, Storage, and Scale
A comprehensive system design for a URL shortening service covering ID generation strategies, database sharding, caching layers, analytics pipelines, and abuse prevention. This design addresses sub-50ms redirect latency at Bitly scale (6 billion clicks/month) with 99.99% availability and real-time click tracking.

High-level architecture: CDN handles cached redirects, core services manage shortening and redirection, analytics collected asynchronously to avoid blocking redirects.
## Abstract
URL shorteners solve a deceptively simple problem—mapping short codes to long URLs—but at scale, the design choices compound: ID generation must be collision-free across distributed nodes, redirects must complete in milliseconds globally, and analytics must not block the critical path.
**Core architectural decisions:**
| Decision | Choice | Rationale |
| ------------- | ------------------------------------ | ----------------------------------------------------------- |
| ID generation | Snowflake + Base62 | Decentralized, time-ordered, no coordination overhead |
| Storage | NoSQL key-value (Cassandra/DynamoDB) | O(1) lookups, horizontal scaling, 100:1 read-to-write ratio |
| Caching | Multi-tier (CDN → Redis → DB) | Sub-50ms global latency, hot-key protection |
| Redirect type | 302 (Temporary) with CDN caching | Enables click tracking while CDN absorbs traffic |
| Analytics | Async pipeline (Kafka → ClickHouse) | Never blocks redirect path |
| Sharding | Consistent hashing on short code | Minimal data movement on cluster changes |
**Key trade-offs accepted:**
- 302 redirects increase server load vs. 301, but enable accurate analytics
- Pre-generated keys (KGS) require storage overhead but guarantee no collisions
- Multi-tier caching adds complexity but essential for viral link handling
- Async analytics means real-time dashboards have 1-5 second delay
**What this design optimizes:**
- Sub-50ms redirect latency globally via CDN edge caching
- 10,000+ redirects/second per node with Redis caching
- Zero collision guarantee via Key Generation Service
- Real-time abuse detection without blocking legitimate traffic
## Requirements
### Functional Requirements
| Requirement | Priority | Notes |
| ------------------- | -------- | ----------------------------------------------- |
| Shorten long URLs | Core | Generate unique short code for any valid URL |
| Redirect short URLs | Core | 301/302 redirect to original URL |
| Custom short codes | Core | User-specified aliases (e.g., `suj.ee/my-link`) |
| Link expiration | Core | TTL-based or click-limit expiration |
| Click analytics | Core | Count, geo, device, referrer tracking |
| Link management | Extended | Edit destination, enable/disable links |
| Bulk shortening | Extended | API for batch URL processing |
| QR code generation | Extended | QR codes for shortened URLs |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ------------------- | ------------------------ | ----------------------------------------------------- |
| Availability | 99.99% (4 nines) | Links embedded everywhere; downtime = broken internet |
| Redirect latency | p99 < 50ms | User experience, SEO impact |
| Write latency | p99 < 200ms | Acceptable for link creation |
| Throughput (reads) | 100K RPS | Peak viral traffic handling |
| Throughput (writes) | 1K RPS | Link creation is infrequent |
| Data durability | 99.999999999% (11 nines) | Links must never be lost |
| URL lifetime | 5+ years default | Permalinks for content |
### Scale Estimation
**Traffic patterns:**
- Read-to-write ratio: 100:1 (redirects dominate)
- Viral amplification: Single link can generate 80% of daily traffic in minutes
**Users and URLs:**
- Monthly Active Users: 10M
- New URLs/day: 1M
- Total URLs (5 years): 1M × 365 × 5 = 1.825B URLs
**Traffic:**
- Redirects/day: 1M URLs × 100 clicks avg = 100M redirects/day
- Average RPS: 100M / 86400 ≈ 1,157 RPS
- Peak multiplier (10x): 11,570 RPS
- Viral spike (100x single link): 100K+ RPS burst
**Storage:**
- URL record: ~500 bytes (short_code, long_url, metadata, timestamps)
- Daily growth: 1M × 500B = 500MB/day
- 5-year storage: 500MB × 365 × 5 ≈ 912GB URLs
- Analytics (per click): ~200 bytes
- Analytics daily: 100M × 200B = 20GB/day
- Analytics 1-year retention: 7.3TB
**Key insight:** Storage is manageable; the challenge is latency at the read path and handling viral traffic spikes.
## Design Paths
### Path A: Counter-Based ID Generation
**Best when:**
- Single datacenter deployment
- Moderate scale (< 10M URLs)
- Simplicity is paramount
- Sequential IDs acceptable
**Architecture:**

**Key characteristics:**
- Auto-increment database ID
- Base62 encode the numeric ID
- Single source of truth for ID generation
**Trade-offs:**
- ✅ Simplest implementation
- ✅ Guaranteed unique (database constraint)
- ✅ Short codes (sequential = compact)
- ❌ Single point of failure (database)
- ❌ Predictable URLs (security concern)
- ❌ Doesn't scale horizontally
**Real-world example:** Early TinyURL used this approach before scaling challenges emerged.
### Path B: Hash-Based Generation
**Best when:**
- Idempotent shortening required (same URL → same short code)
- Deduplication is important
- Moderate collision risk acceptable
**Architecture:**

**Key characteristics:**
- Hash the long URL
- Truncate hash to desired length
- Handle collisions with rehashing or appending
**Trade-offs:**
- ✅ Same URL always produces same short code
- ✅ No central coordinator needed
- ✅ Natural deduplication
- ❌ Collision probability increases with scale
- ❌ Collision handling adds latency
- ❌ Cannot support custom codes easily
**Collision math (Birthday Paradox):**
- 7-character Base62: 62^7 = 3.5 trillion combinations
- At 1 billion URLs: collision probability ≈ 0.014%
- At 10 billion URLs: collision probability ≈ 1.4%
### Path C: Distributed ID Generation (Snowflake + Base62)
**Best when:**
- Multi-datacenter deployment
- High scale (billions of URLs)
- Time-ordered IDs beneficial
- No central coordinator acceptable
**Architecture:**

**Snowflake ID structure (64-bit):**
| Bits | Field | Purpose |
| ---- | --------- | ----------------------------------- |
| 41 | Timestamp | Milliseconds since epoch (69 years) |
| 10 | Node ID | 1024 unique nodes |
| 12 | Sequence | 4096 IDs/ms/node |
**Trade-offs:**
- ✅ No coordination between nodes
- ✅ Time-ordered (useful for analytics)
- ✅ 4M IDs/second/node capacity
- ❌ Longer codes (64-bit → 11 Base62 chars)
- ❌ Clock synchronization required
- ❌ Node ID management overhead
**Real-world example:** Twitter uses Snowflake for tweet IDs; Discord adopted it for message IDs.
### Path D: Pre-Generated Key Service (KGS)
**Best when:**
- Guaranteed zero collisions required
- Predictable short code length needed
- Can tolerate key pre-generation overhead
**Architecture:**

**Key characteristics:**
- Offline process generates all possible short codes
- Application servers fetch batches of unused keys
- Atomic move from unused → used on assignment
**Trade-offs:**
- ✅ Zero collision guarantee
- ✅ O(1) key retrieval
- ✅ Predictable key length
- ❌ Storage for pre-generated keys (~412GB for 68.7B 6-char keys)
- ❌ KGS becomes critical dependency
- ❌ Key exhaustion planning required
**Real-world example:** Production URL shorteners at scale often use KGS for reliability.
### Path Comparison
| Factor | Counter | Hash | Snowflake | KGS |
| ---------------- | -------- | ------ | --------- | ------------ |
| Collision risk | None | Medium | None | None |
| Coordination | Required | None | None | Batch fetch |
| Code length | Shortest | Fixed | 11 chars | Configurable |
| Predictability | High | Low | Medium | Low |
| Horizontal scale | Poor | Good | Excellent | Good |
| Complexity | Low | Medium | Medium | High |
### This Article's Focus
This article focuses on **Path D (KGS) + Snowflake hybrid** because:
1. Zero collision guarantee is critical for production reliability
2. Enables both random (KGS) and time-ordered (Snowflake) IDs
3. Proven at Bitly scale (6B clicks/month)
4. Supports custom short codes naturally
## High-Level Design
### Component Overview

### Shortening Service
Receives long URLs, validates, scans for malware, assigns short codes, and persists mappings.
**Responsibilities:**
- URL validation (scheme, format, reachability)
- Malware/phishing scanning
- Short code assignment (KGS or custom)
- Duplicate detection (optional)
- Metadata extraction (title, favicon)
**Design decisions:**
| Decision | Choice | Rationale |
| ------------------ | ------------------------------------------ | ----------------------------------------- |
| Duplicate handling | Optional dedup | Some users want unique links for tracking |
| URL validation | Async HEAD request | Don't block on slow destinations |
| Scanning | Sync for new domains, async for known-good | Balance security with latency |
| Custom codes | Reserve from KGS pool | Prevents collision with random codes |
### Redirect Service
The most critical service—handles 99% of traffic. Must be blazing fast.
**Flow:**

**Critical optimizations:**
- Redis stores hot URLs (LRU eviction)
- Bloom filter prevents cache stampede on non-existent codes
- Connection pooling to database (PGBouncer pattern)
- Analytics logging is fire-and-forget (async)
### Key Generation Service (KGS)
Pre-generates unique short codes for zero-collision guarantee.
**Architecture:**

**Key allocation strategy:**
1. Each app server requests batch of 1000 keys
2. Keys moved atomically to "allocated" state
3. App server caches keys in memory
4. On use, key marked as "used"
5. Unused allocated keys returned on graceful shutdown
**Failure handling:**
- App server crash: Allocated keys orphaned (acceptable loss at scale)
- KGS unavailable: App servers have local cache buffer
- Key exhaustion: Alert at 80% usage, generate new batch
### Analytics Collector
Captures click data without blocking redirects.
**Click event structure:**
```typescript
interface ClickEvent {
shortCode: string
timestamp: number
// Client info
ipHash: string // Anonymized for GDPR
userAgent: string
referer: string | null
// Derived fields
country: string
city: string
deviceType: "mobile" | "desktop" | "tablet"
browser: string
os: string
// Bot detection
isBot: boolean
botType: string | null
}
```
**Pipeline:**
1. Redirect service sends event to Kafka (fire-and-forget)
2. Stream processor enriches (geo-IP, device parsing, bot detection)
3. Aggregated into ClickHouse for querying
4. Real-time counters updated in Redis (for API responses)
## API Design
### Create Short URL
**Endpoint:** `POST /api/v1/urls`
**Request:**
```json
{
"url": "https://example.com/very/long/path?with=params",
"customCode": "my-link",
"expiresAt": "2025-12-31T23:59:59Z",
"password": "optional-password",
"maxClicks": 1000,
"tags": ["campaign-2024", "social"]
}
```
**Response (201 Created):**
```json
{
"id": "url_abc123def456",
"shortCode": "my-link",
"shortUrl": "https://suj.ee/my-link",
"longUrl": "https://example.com/very/long/path?with=params",
"createdAt": "2024-02-03T10:00:00Z",
"expiresAt": "2025-12-31T23:59:59Z",
"isPasswordProtected": true,
"maxClicks": 1000,
"clickCount": 0,
"qrCode": "https://suj.ee/api/v1/urls/url_abc123def456/qr"
}
```
**Error Responses:**
| Code | Error | When |
| ---- | --------------------- | -------------------------------- |
| 400 | `INVALID_URL` | Malformed or unreachable URL |
| 400 | `INVALID_CUSTOM_CODE` | Code contains invalid characters |
| 409 | `CODE_TAKEN` | Custom code already exists |
| 403 | `URL_BLOCKED` | Destination flagged as malicious |
| 429 | `RATE_LIMITED` | Too many requests |
**Rate Limits:**
| Plan | Create/hour | Create/day |
| ---------- | ----------- | ---------- |
| Free | 50 | 500 |
| Pro | 500 | 5,000 |
| Enterprise | 5,000 | Unlimited |
### Redirect (Read)
**Endpoint:** `GET /{shortCode}`
**Response:** `302 Found` with `Location` header
**Headers:**
```http
HTTP/1.1 302 Found
Location: https://example.com/very/long/path
Cache-Control: private, max-age=60
X-Robots-Tag: noindex
```
**Error Responses:**
| Code | When |
| ---- | ------------------------------------- |
| 404 | Short code not found |
| 410 | Link expired or disabled |
| 429 | Click limit exceeded |
| 403 | Password required (returns HTML form) |
### Get URL Analytics
**Endpoint:** `GET /api/v1/urls/{id}/analytics`
**Query Parameters:**
| Param | Type | Default | Description |
| --------- | ------- | ------- | -------------------------------------- |
| period | string | 7d | Time range (24h, 7d, 30d, 90d, custom) |
| startDate | ISO8601 | - | Custom range start |
| endDate | ISO8601 | - | Custom range end |
| groupBy | string | day | Aggregation (hour, day, week, month) |
**Response:**
```json
{
"urlId": "url_abc123def456",
"period": {
"start": "2024-01-27T00:00:00Z",
"end": "2024-02-03T23:59:59Z"
},
"summary": {
"totalClicks": 15420,
"uniqueClicks": 12350,
"botClicks": 1230
},
"timeSeries": [
{ "date": "2024-01-27", "clicks": 2100, "unique": 1800 },
{ "date": "2024-01-28", "clicks": 2450, "unique": 2100 }
],
"topReferrers": [
{ "referrer": "twitter.com", "clicks": 5200, "percentage": 33.7 },
{ "referrer": "facebook.com", "clicks": 3100, "percentage": 20.1 }
],
"topCountries": [
{ "country": "US", "clicks": 6800, "percentage": 44.1 },
{ "country": "UK", "clicks": 2300, "percentage": 14.9 }
],
"devices": {
"mobile": { "clicks": 9200, "percentage": 59.7 },
"desktop": { "clicks": 5800, "percentage": 37.6 },
"tablet": { "clicks": 420, "percentage": 2.7 }
}
}
```
### Bulk Create URLs
**Endpoint:** `POST /api/v1/urls/bulk`
**Request:**
```json
{
"urls": [{ "url": "https://example.com/page1" }, { "url": "https://example.com/page2", "customCode": "page2" }],
"defaultExpiry": "2025-12-31T23:59:59Z",
"tags": ["bulk-import"]
}
```
**Response (202 Accepted):**
```json
{
"jobId": "job_xyz789",
"status": "processing",
"totalUrls": 2,
"statusUrl": "/api/v1/jobs/job_xyz789"
}
```
### List User URLs
**Endpoint:** `GET /api/v1/urls?cursor=xxx&limit=50`
**Response:**
```json
{
"urls": [
{
"id": "url_abc123",
"shortCode": "abc123",
"shortUrl": "https://suj.ee/abc123",
"longUrl": "https://example.com/...",
"clickCount": 1542,
"createdAt": "2024-01-15T10:00:00Z",
"status": "active"
}
],
"pagination": {
"nextCursor": "cursor_def456",
"hasMore": true,
"total": 234
}
}
```
## Data Modeling
### URL Mapping (Cassandra)
**Primary table optimized for redirect lookups:**
```sql
CREATE TABLE url_mappings (
short_code TEXT,
long_url TEXT,
user_id UUID,
created_at TIMESTAMP,
expires_at TIMESTAMP,
is_active BOOLEAN,
click_count COUNTER,
password_hash TEXT,
max_clicks INT,
metadata MAP,
PRIMARY KEY (short_code)
) WITH default_time_to_live = 157680000 -- 5 years
AND compaction = {'class': 'LeveledCompactionStrategy'}
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'};
```
**Secondary table for user queries:**
```sql
CREATE TABLE urls_by_user (
user_id UUID,
created_at TIMESTAMP,
short_code TEXT,
long_url TEXT,
click_count BIGINT,
is_active BOOLEAN,
PRIMARY KEY ((user_id), created_at, short_code)
) WITH CLUSTERING ORDER BY (created_at DESC);
```
**Why Cassandra:**
- O(1) lookups by short_code (partition key)
- Horizontal scaling for billions of URLs
- Tunable consistency (ONE for reads, QUORUM for writes)
- Built-in TTL for expiration
- 100:1 read-to-write ratio matches Cassandra strengths
### Click Analytics (ClickHouse)
```sql
CREATE TABLE clicks (
short_code String,
clicked_at DateTime64(3),
-- Client data
ip_hash FixedString(16),
country LowCardinality(String),
city String,
-- Device data
device_type Enum8('mobile' = 1, 'desktop' = 2, 'tablet' = 3),
browser LowCardinality(String),
os LowCardinality(String),
-- Referrer
referrer_domain LowCardinality(String),
referrer_path String,
-- Bot detection
is_bot UInt8,
bot_type LowCardinality(String),
-- Aggregation keys
date Date MATERIALIZED toDate(clicked_at),
hour UInt8 MATERIALIZED toHour(clicked_at)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(clicked_at)
ORDER BY (short_code, clicked_at)
TTL clicked_at + INTERVAL 1 YEAR;
-- Materialized view for real-time aggregates
CREATE MATERIALIZED VIEW clicks_daily_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (short_code, date, country, device_type)
AS SELECT
short_code,
date,
country,
device_type,
count() as clicks,
uniqExact(ip_hash) as unique_clicks
FROM clicks
GROUP BY short_code, date, country, device_type;
```
**Why ClickHouse:**
- Columnar storage for analytics queries
- 10-100x compression on click data
- Sub-second aggregations on billions of rows
- Materialized views for real-time dashboards
### User and Configuration (PostgreSQL)
```sql
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
plan TEXT DEFAULT 'free',
api_key_hash TEXT UNIQUE,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE custom_domains (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
domain TEXT UNIQUE NOT NULL,
is_verified BOOLEAN DEFAULT false,
ssl_status TEXT DEFAULT 'pending',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE api_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id),
key_hash TEXT UNIQUE NOT NULL,
name TEXT,
permissions JSONB DEFAULT '["read", "write"]',
last_used_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_users_api_key ON users(api_key_hash);
CREATE INDEX idx_domains_user ON custom_domains(user_id);
```
### Key Generation Service (PostgreSQL)
```sql
CREATE TABLE keys_unused (
short_code TEXT PRIMARY KEY,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE keys_allocated (
short_code TEXT PRIMARY KEY,
allocated_to TEXT NOT NULL, -- Server instance ID
allocated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE keys_used (
short_code TEXT PRIMARY KEY,
used_at TIMESTAMPTZ DEFAULT NOW()
);
-- Batch allocation function
CREATE OR REPLACE FUNCTION allocate_keys(
server_id TEXT,
batch_size INT
) RETURNS TABLE(short_code TEXT) AS $$
BEGIN
RETURN QUERY
WITH allocated AS (
DELETE FROM keys_unused
WHERE short_code IN (
SELECT ku.short_code
FROM keys_unused ku
LIMIT batch_size
FOR UPDATE SKIP LOCKED
)
RETURNING keys_unused.short_code
)
INSERT INTO keys_allocated (short_code, allocated_to)
SELECT a.short_code, server_id
FROM allocated a
RETURNING keys_allocated.short_code;
END;
$$ LANGUAGE plpgsql;
```
### Database Selection Matrix
| Data Type | Store | Rationale |
| -------------- | ---------- | ---------------------------------------------------- |
| URL mappings | Cassandra | O(1) lookups, horizontal scale, high read throughput |
| Click events | ClickHouse | Columnar analytics, compression, fast aggregations |
| User accounts | PostgreSQL | ACID transactions, relational queries |
| KGS keys | PostgreSQL | Transactional batch allocation |
| Hot URLs cache | Redis | Sub-ms latency, TTL support |
| Rate limits | Redis | Atomic counters, sliding windows |
| Bloom filter | Redis | Memory-efficient existence checks |
## Low-Level Design
### Base62 Encoder
```typescript collapse={1-5}
const CHARSET = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
const BASE = BigInt(62)
export function encodeBase62(num: bigint): string {
if (num === 0n) return CHARSET[0]
let result = ""
while (num > 0n) {
result = CHARSET[Number(num % BASE)] + result
num = num / BASE
}
return result
}
export function decodeBase62(str: string): bigint {
let result = 0n
for (const char of str) {
const index = CHARSET.indexOf(char)
if (index === -1) throw new Error(`Invalid character: ${char}`)
result = result * BASE + BigInt(index)
}
return result
}
// Pad to fixed length for consistent URLs
export function encodeBase62Padded(num: bigint, length: number): string {
const encoded = encodeBase62(num)
return encoded.padStart(length, "0")
}
```
**Code length capacity:**
| Length | Combinations | Sufficient for |
| ------ | ------------ | --------------------- |
| 6 | 56.8B | Small-medium services |
| 7 | 3.5T | Large services |
| 8 | 218T | All URLs on internet |
### Snowflake ID Generator
```typescript collapse={1-15}
const EPOCH = 1609459200000n // 2021-01-01 00:00:00 UTC
const NODE_BITS = 10n
const SEQUENCE_BITS = 12n
const MAX_NODE_ID = (1n << NODE_BITS) - 1n
const MAX_SEQUENCE = (1n << SEQUENCE_BITS) - 1n
const NODE_SHIFT = SEQUENCE_BITS
const TIMESTAMP_SHIFT = SEQUENCE_BITS + NODE_BITS
export class SnowflakeGenerator {
private nodeId: bigint
private sequence: bigint = 0n
private lastTimestamp: bigint = -1n
constructor(nodeId: number) {
if (nodeId < 0 || BigInt(nodeId) > MAX_NODE_ID) {
throw new Error(`Node ID must be between 0 and ${MAX_NODE_ID}`)
}
this.nodeId = BigInt(nodeId)
}
generate(): bigint {
let timestamp = BigInt(Date.now()) - EPOCH
if (timestamp === this.lastTimestamp) {
this.sequence = (this.sequence + 1n) & MAX_SEQUENCE
if (this.sequence === 0n) {
// Sequence exhausted, wait for next millisecond
timestamp = this.waitNextMillis(this.lastTimestamp)
}
} else {
this.sequence = 0n
}
this.lastTimestamp = timestamp
return (timestamp << TIMESTAMP_SHIFT) | (this.nodeId << NODE_SHIFT) | this.sequence
}
private waitNextMillis(lastTimestamp: bigint): bigint {
let timestamp = BigInt(Date.now()) - EPOCH
while (timestamp <= lastTimestamp) {
timestamp = BigInt(Date.now()) - EPOCH
}
return timestamp
}
}
```
**Capacity:** 4,096 IDs per millisecond per node × 1,024 nodes = 4.1M IDs/second total.
### Redirect Service with Bloom Filter
```typescript collapse={1-20}
import { BloomFilter } from "bloom-filters"
interface RedirectResult {
found: boolean
longUrl?: string
isExpired?: boolean
requiresPassword?: boolean
}
class RedirectService {
private readonly redis: RedisCluster
private readonly cassandra: CassandraClient
private readonly bloomFilter: BloomFilter
private readonly analytics: AnalyticsCollector
constructor() {
// Bloom filter: 1B items, 0.1% false positive rate
// Memory: ~1.2GB
this.bloomFilter = BloomFilter.create(1_000_000_000, 0.001)
}
async redirect(shortCode: string, context: RequestContext): Promise {
// Step 1: Bloom filter check (prevents cache stampede on non-existent codes)
if (!this.bloomFilter.has(shortCode)) {
return { found: false }
}
// Step 2: Redis cache check
const cached = await this.redis.hgetall(`url:${shortCode}`)
if (cached && cached.long_url) {
this.logClick(shortCode, context) // Async, non-blocking
return this.buildResult(cached)
}
// Step 3: Database lookup
const row = await this.cassandra.execute("SELECT * FROM url_mappings WHERE short_code = ?", [shortCode])
if (!row || row.length === 0) {
// False positive from bloom filter
return { found: false }
}
const url = row[0]
// Step 4: Cache the result
await this.redis.hset(`url:${shortCode}`, {
long_url: url.long_url,
expires_at: url.expires_at?.toISOString() || "",
password_hash: url.password_hash || "",
is_active: url.is_active ? "1" : "0",
})
await this.redis.expire(`url:${shortCode}`, 3600) // 1 hour TTL
this.logClick(shortCode, context)
return this.buildResult(url)
}
private buildResult(data: any): RedirectResult {
if (data.is_active === "0" || data.is_active === false) {
return { found: false }
}
if (data.expires_at && new Date(data.expires_at) < new Date()) {
return { found: true, isExpired: true }
}
if (data.password_hash) {
return { found: true, requiresPassword: true, longUrl: data.long_url }
}
return { found: true, longUrl: data.long_url }
}
private logClick(shortCode: string, context: RequestContext): void {
// Fire and forget - don't await
this.analytics
.log({
shortCode,
timestamp: Date.now(),
ip: context.ip,
userAgent: context.userAgent,
referer: context.referer,
})
.catch((err) => console.error("Analytics error:", err))
}
}
```
### Rate Limiter (Sliding Window)
```typescript collapse={1-12}
interface RateLimitResult {
allowed: boolean
remaining: number
resetAt: number
}
class SlidingWindowRateLimiter {
private readonly redis: RedisCluster
async checkLimit(key: string, limit: number, windowMs: number): Promise {
const now = Date.now()
const windowStart = now - windowMs
// Lua script for atomic sliding window
const result = await this.redis.eval(
`
local key = KEYS[1]
local now = tonumber(ARGV[1])
local window_start = tonumber(ARGV[2])
local limit = tonumber(ARGV[3])
local window_ms = tonumber(ARGV[4])
-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
-- Count current entries
local count = redis.call('ZCARD', key)
if count < limit then
-- Add new entry
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('PEXPIRE', key, window_ms)
return {1, limit - count - 1, now + window_ms}
else
-- Get oldest entry for reset time
local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
local reset_at = oldest[2] + window_ms
return {0, 0, reset_at}
end
`,
[key],
[now, windowStart, limit, windowMs],
)
return {
allowed: result[0] === 1,
remaining: result[1],
resetAt: result[2],
}
}
}
```
### URL Scanner (Security)
```typescript collapse={1-15}
interface ScanResult {
isSafe: boolean
threats: string[]
scanTime: number
}
class URLScanner {
private readonly blocklist: BlocklistService
private readonly googleSafeBrowsing: GoogleSafeBrowsingClient
private readonly virusTotal: VirusTotalClient
private readonly redis: RedisCluster
async scan(url: string): Promise {
const urlHash = this.hashUrl(url)
// Check cache first
const cached = await this.redis.get(`scan:${urlHash}`)
if (cached) {
return JSON.parse(cached)
}
const domain = new URL(url).hostname
const threats: string[] = []
// Step 1: Local blocklist (fast)
if (await this.blocklist.contains(domain)) {
return this.cacheResult(urlHash, { isSafe: false, threats: ["blocklist"] })
}
// Step 2: Known-good allowlist
if (await this.isKnownGood(domain)) {
return this.cacheResult(urlHash, { isSafe: true, threats: [] })
}
// Step 3: Google Safe Browsing API
const gsbResult = await this.googleSafeBrowsing.lookup(url)
if (gsbResult.threats.length > 0) {
threats.push(...gsbResult.threats)
}
// Step 4: VirusTotal (for suspicious domains)
if (await this.isSuspicious(domain)) {
const vtResult = await this.virusTotal.scan(url)
if (vtResult.positives > 2) {
threats.push("malware")
}
}
const result: ScanResult = {
isSafe: threats.length === 0,
threats,
scanTime: Date.now(),
}
return this.cacheResult(urlHash, result)
}
private async cacheResult(hash: string, result: ScanResult): Promise {
// Cache safe results longer than unsafe
const ttl = result.isSafe ? 86400 : 3600 // 24h vs 1h
await this.redis.setex(`scan:${hash}`, ttl, JSON.stringify(result))
return result
}
private async isKnownGood(domain: string): Promise {
// Top 10K domains by traffic
return this.redis.sismember("domains:allowlist", domain)
}
private async isSuspicious(domain: string): Promise {
// New domain, unusual TLD, etc.
const domainAge = await this.getDomainAge(domain)
return domainAge < 30 // Less than 30 days old
}
}
```
### Analytics Pipeline
```typescript collapse={1-10}
interface ClickEvent {
shortCode: string
timestamp: number
ip: string
userAgent: string
referer: string | null
}
class AnalyticsCollector {
private readonly kafka: KafkaProducer
private readonly buffer: ClickEvent[] = []
private readonly BUFFER_SIZE = 100
private readonly FLUSH_INTERVAL = 1000 // 1 second
constructor() {
setInterval(() => this.flush(), this.FLUSH_INTERVAL)
}
async log(event: ClickEvent): Promise {
this.buffer.push(event)
if (this.buffer.length >= this.BUFFER_SIZE) {
await this.flush()
}
}
private async flush(): Promise {
if (this.buffer.length === 0) return
const events = this.buffer.splice(0)
await this.kafka.sendBatch({
topic: "clicks",
messages: events.map((e) => ({
key: e.shortCode,
value: JSON.stringify(e),
timestamp: e.timestamp.toString(),
})),
})
}
}
// Stream processor (Kafka Consumer → ClickHouse)
class ClickProcessor {
private readonly clickhouse: ClickHouseClient
private readonly geoIP: GeoIPService
private readonly deviceParser: DeviceParser
private readonly botDetector: BotDetector
async process(event: ClickEvent): Promise {
const geo = await this.geoIP.lookup(event.ip)
const device = this.deviceParser.parse(event.userAgent)
const isBot = this.botDetector.detect(event.userAgent, event.ip)
return {
short_code: event.shortCode,
clicked_at: new Date(event.timestamp),
ip_hash: this.hashIP(event.ip), // GDPR compliance
country: geo.country,
city: geo.city,
device_type: device.type,
browser: device.browser,
os: device.os,
referrer_domain: this.extractDomain(event.referer),
referrer_path: this.extractPath(event.referer),
is_bot: isBot.isBot ? 1 : 0,
bot_type: isBot.type,
}
}
private hashIP(ip: string): string {
// GDPR-compliant anonymization
return crypto
.createHash("sha256")
.update(ip + process.env.IP_SALT)
.digest("hex")
.substring(0, 32)
}
}
```
## Frontend Considerations
### Redirect Performance
**Critical path optimization:**
The redirect is the most latency-sensitive operation. Every millisecond matters.
```typescript collapse={1-8}
// Server-side redirect handler (minimal processing)
export async function handleRedirect(req: Request): Promise {
const shortCode = req.url.split("/").pop()
// Validate format before any I/O
if (!isValidCode(shortCode)) {
return new Response(null, { status: 404 })
}
const result = await redirectService.redirect(shortCode, {
ip: req.headers.get("x-forwarded-for"),
userAgent: req.headers.get("user-agent"),
referer: req.headers.get("referer"),
})
if (!result.found) {
return new Response(null, { status: 404 })
}
if (result.isExpired) {
return new Response("Link expired", { status: 410 })
}
if (result.requiresPassword) {
return renderPasswordPage(shortCode)
}
return new Response(null, {
status: 302,
headers: {
Location: result.longUrl,
"Cache-Control": "private, max-age=60",
"X-Robots-Tag": "noindex",
},
})
}
```
**Why 302 over 301:**
| Factor | 301 (Permanent) | 302 (Temporary) |
| --------------- | ----------------------- | ---------------------- |
| Browser caching | Aggressive (forever) | Respects Cache-Control |
| Click tracking | Misses cached clicks | Tracks all clicks |
| URL updates | Cached version persists | Changes reflected |
| CDN caching | Long TTL safe | Short TTL recommended |
**Recommendation:** Use 302 with `Cache-Control: private, max-age=60`. CDN caches for 60s (absorbs spikes), browsers cache briefly, analytics remain accurate.
### Dashboard State Management
```typescript collapse={1-12}
// URL analytics dashboard state
interface DashboardState {
urls: Map
selectedUrl: string | null
analytics: AnalyticsData | null
dateRange: DateRange
isLoading: boolean
}
// Normalized store for efficient updates
const useDashboardStore = create((set, get) => ({
urls: new Map(),
selectedUrl: null,
analytics: null,
dateRange: { start: subDays(new Date(), 7), end: new Date() },
isLoading: false,
// Actions
fetchUrls: async () => {
set({ isLoading: true })
const urls = await api.getUrls()
set({
urls: new Map(urls.map((u) => [u.id, u])),
isLoading: false,
})
},
selectUrl: async (urlId: string) => {
set({ selectedUrl: urlId, isLoading: true })
const analytics = await api.getAnalytics(urlId, get().dateRange)
set({ analytics, isLoading: false })
},
updateDateRange: async (range: DateRange) => {
set({ dateRange: range })
const { selectedUrl } = get()
if (selectedUrl) {
set({ isLoading: true })
const analytics = await api.getAnalytics(selectedUrl, range)
set({ analytics, isLoading: false })
}
},
}))
```
### Real-Time Click Counter
```typescript collapse={1-10}
// WebSocket for live click updates
class ClickStreamClient {
private ws: WebSocket | null = null
private subscriptions = new Set()
connect(authToken: string): void {
this.ws = new WebSocket(`wss://api.suj.ee/ws?token=${authToken}`)
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data)
if (data.type === "click") {
this.handleClick(data.shortCode, data.count)
}
}
}
subscribe(shortCode: string): void {
this.subscriptions.add(shortCode)
this.ws?.send(
JSON.stringify({
action: "subscribe",
shortCode,
}),
)
}
private handleClick(shortCode: string, count: number): void {
// Update local state
useDashboardStore.getState().updateClickCount(shortCode, count)
}
}
```
## Infrastructure
### Cloud-Agnostic Components
| Component | Purpose | Options |
| -------------- | ----------------------------- | ------------------------------ |
| CDN | Edge caching, DDoS protection | Cloudflare, Fastly, CloudFront |
| Load balancer | Traffic distribution | HAProxy, NGINX, ALB |
| Application | Redirect service, API | Node.js, Go, Rust |
| KV cache | Hot URLs, rate limits | Redis, KeyDB, Dragonfly |
| Primary DB | URL mappings | Cassandra, ScyllaDB, DynamoDB |
| Analytics DB | Click data | ClickHouse, Druid, TimescaleDB |
| Message queue | Analytics pipeline | Kafka, Pulsar, Redpanda |
| Object storage | Exports, backups | S3, GCS, MinIO |
### AWS Reference Architecture

**Service configurations:**
| Service | Configuration | Rationale |
| ---------------------------- | ---------------------------- | ------------------------------------------ |
| CloudFront | 200+ edge locations | Global low-latency redirects |
| Redirect Service (Fargate) | 2 vCPU, 4GB, 50 tasks | High throughput, stateless |
| Shortening Service (Fargate) | 2 vCPU, 4GB, 10 tasks | Lower traffic than redirects |
| ElastiCache Redis | r6g.xlarge cluster, 3 shards | Hot URL cache, rate limits |
| Amazon Keyspaces | On-demand | Serverless Cassandra, scales automatically |
| RDS PostgreSQL | db.r6g.large Multi-AZ | Users, KGS, configuration |
| MSK | kafka.m5.large × 3 | Click event streaming |
### Self-Hosted Alternatives
| Managed Service | Self-Hosted Option | When to Self-Host |
| ---------------- | -------------------- | ------------------------------- |
| Amazon Keyspaces | ScyllaDB | Cost at scale, lower latency |
| ElastiCache | Redis Cluster on EC2 | Specific modules, cost |
| CloudFront | Cloudflare | Better DDoS protection, cheaper |
| MSK | Redpanda | Lower latency, simpler ops |
### Monitoring
**Key metrics:**
| Metric | Alert Threshold | Action |
| -------------------- | --------------- | -------------------------- |
| Redirect latency p99 | > 100ms | Scale Redis, check CDN |
| CDN cache hit ratio | < 80% | Review cache headers |
| 404 rate | > 5% | Check for scanning attacks |
| KGS key inventory | < 20% | Generate new batch |
| Click pipeline lag | > 60s | Scale analytics processors |
**Distributed tracing:**
```
Request → CDN → Load Balancer → Redirect Service → Redis/Cassandra → Analytics
│ │
└─────────────────── Trace ID propagated ──────────────────────────────┘
```
- Each redirect gets unique trace ID
- Propagate through all services
- 100% sampling for errors, 1% for normal traffic
## Conclusion
This design provides a production-ready URL shortener with:
1. **Zero collision guarantee** via Key Generation Service
2. **Sub-50ms global redirect latency** through CDN edge caching and Redis
3. **100K+ RPS capacity** with horizontal scaling
4. **Real-time analytics** without blocking redirects
5. **Security-first approach** with URL scanning and rate limiting
**Key architectural decisions:**
- KGS + Snowflake hybrid for ID generation balances simplicity and scale
- 302 redirects with short CDN TTL enable analytics while absorbing traffic spikes
- Cassandra for URL mappings provides O(1) lookups and horizontal scaling
- Async analytics pipeline (Kafka → ClickHouse) keeps redirect path fast
**Known limitations:**
- KGS requires pre-generation overhead and key exhaustion monitoring
- 302 redirects increase server load vs. 301 caching
- Analytics have 1-5 second delay due to async processing
- Bloom filter false positives cause unnecessary DB lookups (~0.1%)
**Future enhancements:**
- Link preview generation (OpenGraph scraping)
- A/B testing for destinations
- Geographic routing (different destination per region)
- Deep link support for mobile apps
## Appendix
### Prerequisites
- Distributed systems fundamentals (consistent hashing, replication)
- Database selection trade-offs (SQL vs. NoSQL)
- Caching strategies (TTL, eviction policies)
- HTTP redirect semantics (301 vs. 302)
### Terminology
| Term | Definition |
| ---------------------- | --------------------------------------------------------------------------- |
| **Base62** | Encoding using 62 characters (0-9, A-Z, a-z) for URL-safe short codes |
| **Snowflake ID** | Twitter's distributed ID generation algorithm (timestamp + node + sequence) |
| **KGS** | Key Generation Service - pre-generates unique short codes |
| **Bloom filter** | Probabilistic data structure for fast membership testing |
| **Consistent hashing** | Sharding technique that minimizes data movement on cluster changes |
| **CDN** | Content Delivery Network - edge caching for global low latency |
| **Hot key** | Cache key receiving disproportionate traffic (viral links) |
### Summary
- **ID generation** uses KGS for random codes and Snowflake for time-ordered IDs, both Base62 encoded for URL-safe 7-character codes
- **Storage** uses Cassandra for URL mappings (O(1) lookups, horizontal scaling) and ClickHouse for analytics (columnar, fast aggregations)
- **Caching** is multi-tier: CDN edge (global) → Redis (regional) → Database, with Bloom filters preventing cache stampede
- **Redirects** use 302 status with short CDN TTL to balance analytics accuracy with traffic absorption
- **Analytics** collected asynchronously via Kafka to never block the redirect critical path
- **Security** includes URL scanning (Google Safe Browsing, VirusTotal), rate limiting, and bot detection
### References
**Real-World Implementations:**
- [Bitly: Lessons Learned Building a Distributed System that Handles 6 Billion Clicks a Month](http://highscalability.com/blog/2014/7/14/bitly-lessons-learned-building-a-distributed-system-that-han.html) - SOA architecture, monitoring at scale
- [Twitter t.co Documentation](https://developer.x.com/en/docs/tco) - Built-in security scanning, analytics integration
- [Snowflake ID - Wikipedia](https://en.wikipedia.org/wiki/Snowflake_ID) - Distributed ID generation algorithm
**Technical Deep Dives:**
- [System Design: URL Shortening](https://systemdesign.one/url-shortening-system-design/) - Comprehensive system design walkthrough
- [URL Shortener Using Snowflake IDs and Base62 Encoding](https://dev.to/speaklouder/url-shortener-using-snowflake-ids-and-base62-encoding-4179) - Implementation details
- [301 vs 302 Redirects in URL Shorteners](https://url-shortening.com/blog/301-vs-302-redirects-in-shorteners-speed-seo-and-caching) - Redirect strategy trade-offs
**Security:**
- [URL Shortening Allows Threats to Evade Traditional Tools](https://www.menlosecurity.com/blog/url-shortening-allows-threats-to-evade-url-filtering-and-categorization-tools) - Security challenges
- [How URL Shortener Services Handle Abuse and Spam](https://on4t.net/blog/url-shortener-handle-abuse-spam/) - Abuse prevention strategies
**Privacy and Compliance:**
- [GDPR Compliance in URL Shorteners](https://iplogger.org/gdpr/) - Privacy requirements
- [Best URL Shorteners for Privacy](https://blog.choto.co/best-url-shorteners-for-privacy/) - GDPR-compliant implementations
---
## Design Pastebin: Text Sharing, Expiration, and Abuse Prevention
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-pastebin
**Category:** System Design / System Design Problems
**Description:** A comprehensive system design for a text-sharing service like Pastebin covering URL generation strategies, content storage at scale, expiration policies, syntax highlighting, access control, and abuse prevention. This design addresses sub-100ms paste retrieval at 10:1 read-to-write ratio with content deduplication and multi-tier storage tiering.
# Design Pastebin: Text Sharing, Expiration, and Abuse Prevention
A comprehensive system design for a text-sharing service like Pastebin covering URL generation strategies, content storage at scale, expiration policies, syntax highlighting, access control, and abuse prevention. This design addresses sub-100ms paste retrieval at 10:1 read-to-write ratio with content deduplication and multi-tier storage tiering.

High-level architecture: CDN caches immutable paste content at the edge, core services handle creation and retrieval, object storage holds paste bodies separately from metadata for independent scaling.
## Abstract
A paste service maps short unique URLs to text blobs—conceptually simple, but the design space branches around three axes: how you generate collision-free IDs at scale, where you store potentially large text content cost-effectively, and how you handle the lifecycle of ephemeral vs. permanent pastes.
**Core architectural decisions:**
| Decision | Choice | Rationale |
| --------------- | ------------------------------------------------------- | -------------------------------------------------------------- |
| ID generation | KGS (Key Generation Service) with Base62 | Zero collisions, O(1) key retrieval, decoupled from write path |
| Content storage | Object storage (S3) for bodies, PostgreSQL for metadata | Independent scaling of blobs and queryable metadata |
| Caching | Multi-tier (CDN → Redis → S3) | Sub-100ms reads globally, pastes are immutable after creation |
| Compression | zstd at write time | 60-70% reduction on text; fast decompression for reads |
| Deduplication | SHA-256 content hash for internal dedup | Saves storage without leaking content existence via URL |
| Expiration | Hybrid lazy check + active sweep | Correct reads without dedicated cleanup blocking production |
**Key trade-offs accepted:**
- Object storage adds a network hop vs. inline database storage, but pastes can be multi-MB and S3 scales independently
- KGS pre-generation wastes some keys on server crashes, but guarantees zero write-path collisions
- zstd compression adds CPU at write time but reduces storage cost and CDN egress by 60-70%
- Lazy expiration means expired pastes consume storage until the next sweep, but reads are always correct
**What this design optimizes:**
- Sub-100ms paste retrieval via CDN edge caching of immutable content
- Cost-effective storage with compression + tiering (hot → warm → cold)
- Zero-collision URL generation without write-path coordination
- Graceful abuse prevention without blocking legitimate traffic
## Requirements
### Functional Requirements
| Requirement | Priority | Notes |
| -------------------- | -------- | -------------------------------------------------- |
| Create paste | Core | Accept text content, return unique short URL |
| Read paste | Core | Retrieve paste content by short URL |
| Paste expiration | Core | Time-based (10min to never) and burn-after-read |
| Syntax highlighting | Core | Server-side rendering with language detection |
| Access control | Core | Public, unlisted, private (password-protected) |
| Raw content endpoint | Core | Plain-text retrieval for CLI/API consumers |
| Content size limits | Extended | Configurable max size (default 512 KB, paid 10 MB) |
| Paste editing | Extended | Create new version, previous URL remains immutable |
| Paste forking | Extended | Copy and modify another user's paste |
| API access | Extended | RESTful API with key-based auth |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ------------------- | --------------------------- | ----------------------------------------------------- |
| Availability | 99.9% (3 nines) | Text sharing is useful but not mission-critical |
| Read latency | p99 < 100ms | Fast rendering for developer workflows |
| Write latency | p99 < 500ms | Acceptable for paste creation (compression + storage) |
| Throughput (reads) | 5K RPS | Estimated peak from scale estimation below |
| Throughput (writes) | 500 RPS | Write-light workload |
| Max paste size | 512 KB (free), 10 MB (paid) | Prevents abuse while supporting real use cases |
| Data durability | 99.999999999% (11 nines) | S3-grade durability for paste content |
| Paste URL length | 7-8 characters | Short enough for sharing, large enough keyspace |
### Scale Estimation
**Users and traffic (Pastebin.com-scale reference):**
- Monthly Active Users (MAU): 10M
- Daily Active Users (DAU): 1M (10% of MAU)
- New pastes/day: 500K
- Read-to-write ratio: 10:1 (reads dominate, but not as extreme as URL shorteners)
**Traffic:**
- Reads/day: 500K × 10 = 5M reads/day
- Average read RPS: 5M / 86,400 ≈ 58 RPS
- Peak multiplier (3x): ~174 RPS
- Viral spike (50x single paste): ~3K RPS burst
- Writes: 500K / 86,400 ≈ 6 RPS average, ~18 RPS peak
**Storage:**
- Average paste size: 5 KB (code snippets, logs, config)
- Daily raw content: 500K × 5 KB = 2.5 GB/day
- After zstd compression (~65% reduction): 875 MB/day
- Yearly content: ~320 GB compressed
- 5-year retention: ~1.6 TB compressed content
- Metadata per paste: ~200 bytes (IDs, timestamps, flags)
- Yearly metadata: 500K × 365 × 200B ≈ 36 GB
**Key insight:** Storage is modest even at scale. The real challenges are ID generation without collisions, efficient expiration of hundreds of millions of pastes, and abuse prevention for a service that accepts arbitrary text from the internet.
## Design Paths
### Path A: Monolithic Storage (Database-Only)
**Best when:**
- Small to moderate scale (< 1M pastes)
- Simple operational requirements
- Paste size consistently small (< 10 KB)
**Architecture:**

**Key characteristics:**
- Paste content stored inline in database as `TEXT` column
- Single data store for metadata and content
- Simpler operational model (one system to back up, monitor, scale)
**Trade-offs:**
- ✅ Simplest deployment and operations
- ✅ Transactional consistency between metadata and content
- ✅ No additional network hop for content retrieval
- ❌ Database bloat as paste volume grows (impacts query performance on metadata)
- ❌ Expensive storage (RDS per-GB cost is 10-20x S3)
- ❌ Backup and replication transfer large blobs unnecessarily
- ❌ Cannot independently scale storage and compute
**Real-world example:** dpaste stores content directly in the database via Django ORM. Works well at dpaste's scale but would strain at Pastebin's millions.
### Path B: Split Storage (Metadata DB + Object Storage)
**Best when:**
- Moderate to large scale (1M+ pastes)
- Variable paste sizes (1 KB to 10 MB)
- Cost optimization matters
- Independent scaling of storage and query layers needed
**Architecture:**

**Key characteristics:**
- Metadata (paste_id, created_at, expires_at, visibility, content_hash) in PostgreSQL
- Paste content stored as compressed objects in S3, keyed by paste_id
- Content hash stored in metadata for deduplication lookups
- Redis caches deserialized hot pastes
**Trade-offs:**
- ✅ S3 storage cost is ~$0.023/GB vs. ~$0.115/GB for RDS (5x cheaper)
- ✅ S3 scales to exabytes without provisioning
- ✅ Database stays lean—fast metadata queries, small backups
- ✅ CDN can serve S3 objects directly for raw content
- ❌ Extra network hop on cache miss (API → S3)
- ❌ No transactional consistency between metadata and content writes
- ❌ Two systems to operate
**Real-world example:** GitHub Gists use Git repositories (effectively object storage) for content with a relational layer for metadata and discovery.
### Path C: Content-Addressable Storage
**Best when:**
- High duplication rate (logs, error dumps, config files)
- Storage cost is the dominant concern
- Acceptable to trade write complexity for storage savings
**Architecture:**

**Key characteristics:**
- Content stored by SHA-256 hash, not by paste_id
- Multiple paste_ids can reference the same content blob
- Deduplication is automatic—identical pastes share storage
- Paste URL is a separate opaque ID (not the content hash)
**Trade-offs:**
- ✅ Automatic deduplication (significant savings if many identical pastes)
- ✅ Content integrity verification built-in
- ❌ Deletion complexity: cannot delete a blob until all referencing pastes expire
- ❌ Reference counting adds write-path complexity
- ❌ Content existence leakage if paste URL were the hash (mitigated by using opaque IDs)
- ❌ Negligible savings if duplication rate is low
### Path Comparison
| Factor | Monolithic (A) | Split Storage (B) | Content-Addressable (C) |
| ------------------------- | ---------------- | ----------------- | ----------------------- |
| Operational complexity | Low | Medium | High |
| Storage cost at scale | High | Low | Lowest (with dedup) |
| Read latency (cache miss) | Lowest (one hop) | Medium (two hops) | Medium (two hops) |
| Write consistency | ACID | Eventual | Eventual |
| Independent scaling | No | Yes | Yes |
| Max practical scale | ~10M pastes | Billions | Billions |
| Deduplication | None | Optional | Native |
### This Article's Focus
This article focuses on **Path B (Split Storage)** with optional content-addressable deduplication because:
1. It matches the scale profile of real paste services (hundreds of millions of pastes)
2. Cost-effective storage is critical when accepting arbitrary content from the internet
3. Split architecture allows CDN to serve raw paste content directly from S3
4. Deduplication can be layered on without architectural changes
## High-Level Design
### Component Overview

### Paste Write Service
Accepts text content, compresses it, stores it in S3, and persists metadata.
**Write flow:**

**Design decisions:**
| Decision | Choice | Rationale |
| ------------------- | ---------------------------- | -------------------------------------------------------------------------------------------------------- |
| Compression | zstd (level 3) at write time | 65% reduction on text; fast decompression; dictionary support for similar content |
| Write ordering | S3 first, then DB | If DB write fails, orphaned S3 object is cleaned up by periodic sweep—cheaper than inconsistent metadata |
| Syntax highlighting | Async via queue | Highlighting large pastes (10 MB) can take seconds; don't block the write response |
| Content hash | SHA-256, stored in metadata | Enables optional dedup without coupling to content-addressable storage |
### Paste Read Service
The hot path. Retrieves paste content via multi-tier cache.
**Read flow:**

**Critical optimizations:**
- **Immutable content = aggressive caching.** Paste content never changes after creation. CDN can cache with long TTLs; cache invalidation only needed on deletion or expiration.
- **Burn-after-read bypasses all caches.** These pastes are served directly from origin with `Cache-Control: no-store` and atomically deleted after the first read.
- **Bloom filter on Redis** prevents cache stampede for non-existent paste IDs (404s don't hit the database).
### Key Generation Service (KGS)
Pre-generates Base62-encoded 7-character keys for zero-collision paste URL assignment.
**Keyspace math:**
- 7-character Base62: 62^7 = 3.52 trillion possible keys
- At 500K pastes/day: 182.5M pastes/year
- Keyspace exhaustion: ~19,000 years
**Key allocation:**
1. Offline generator produces random 7-character Base62 strings in batches
2. Stored in a dedicated key pool table (or DynamoDB) with `status = 'available'`
3. Each app server fetches a batch of 1,000 keys on startup
4. Keys assigned from local cache—no database round-trip per paste
5. On graceful shutdown, unused keys are returned to the pool
**Failure handling:**
- **App server crash:** Allocated batch (~1,000 keys) is lost. At 3.52 trillion keyspace, this is negligible.
- **KGS unavailable:** App servers have local buffer. Alert at < 100 remaining local keys.
- **Duplicate prevention:** Keys are generated randomly and checked for uniqueness against the used-keys set before entering the pool.
### Expiration Service
Handles time-based paste expiration and burn-after-read.
**Hybrid expiration strategy:**
1. **Lazy check on read:** Every read checks `expires_at` before serving content. Expired pastes return `410 Gone`. This guarantees readers never see expired content.
2. **Active sweep:** A background cron job runs every hour, querying `SELECT id FROM pastes WHERE expires_at < NOW() AND deleted_at IS NULL LIMIT 10000`. Deletes S3 objects and marks metadata as deleted in batches.
**Burn-after-read implementation:**

**Race condition handling:** The `SELECT ... FOR UPDATE` acquires a row-level lock. If two concurrent readers hit the same burn-after-read paste, only the first gets the content; the second sees `deleted_at IS NOT NULL` and receives `410 Gone`. This is the correct behavior—exactly one reader sees the content.
### Content Scanner
Asynchronous scanning for malware signatures, credential dumps, and PII (Personally Identifiable Information) patterns.
**Scanning pipeline:**
1. On paste creation, content hash is checked against a known-bad-content blocklist
2. Regex patterns detect credential dumps (email:password patterns), API keys, and private keys
3. Flagged pastes are quarantined—visible only to the creator until manual review
4. Confirmed malicious content is deleted and the creator's account is flagged
**Rate limiting tiers:**
| Tier | Limit | Scope |
| -------------------- | ------------------- | ----------- |
| Anonymous | 10 pastes/hour | Per IP |
| Authenticated (free) | 60 pastes/hour | Per API key |
| Authenticated (paid) | 600 pastes/hour | Per API key |
| Read (all tiers) | 300 requests/minute | Per IP |
## API Design
### Create Paste
**Endpoint:** `POST /api/v1/pastes`
**Request:**
```json
{
"content": "string (max 512KB / 10MB for paid)",
"title": "string | null (max 100 chars)",
"language": "string | null (e.g., 'python', 'json')",
"expiration": "10m | 1h | 1d | 1w | 1m | 6m | 1y | never | burn_after_read",
"visibility": "public | unlisted | private",
"password": "string | null (required if visibility = 'private')"
}
```
**Response (201 Created):**
```json
{
"id": "aB3kF9x",
"url": "https://paste.example.com/aB3kF9x",
"raw_url": "https://paste.example.com/raw/aB3kF9x",
"title": "My Snippet",
"language": "python",
"created_at": "2025-01-15T10:30:00Z",
"expires_at": "2025-01-22T10:30:00Z",
"visibility": "unlisted",
"size_bytes": 2048
}
```
**Error responses:**
- `400 Bad Request` — Content too large, invalid expiration, missing required fields
- `401 Unauthorized` — Invalid or missing API key (for authenticated endpoints)
- `429 Too Many Requests` — Rate limit exceeded. Response includes `Retry-After` header
### Read Paste
**Endpoint:** `GET /api/v1/pastes/{id}`
**Response (200 OK):**
```json
{
"id": "aB3kF9x",
"title": "My Snippet",
"content": "def hello():\n print('world')",
"language": "python",
"highlighted_html": "... ",
"created_at": "2025-01-15T10:30:00Z",
"expires_at": "2025-01-22T10:30:00Z",
"visibility": "unlisted",
"size_bytes": 2048,
"views": 42
}
```
**Error responses:**
- `404 Not Found` — Paste does not exist
- `410 Gone` — Paste has expired or been burned
- `403 Forbidden` — Private paste, password required (provide via `X-Paste-Password` header)
### Read Raw Content
**Endpoint:** `GET /api/v1/pastes/{id}/raw`
Returns plain text (`Content-Type: text/plain; charset=utf-8`). No JSON wrapping. Designed for `curl`, piping, and CLI tooling.
**Cache headers for raw content:**
```http
Cache-Control: public, max-age=86400, immutable
ETag: "sha256:"
```
For burn-after-read pastes:
```http
Cache-Control: no-store
```
### List User's Pastes
**Endpoint:** `GET /api/v1/users/me/pastes?cursor={cursor}&limit=20`
**Pagination:** Cursor-based using `created_at` timestamp. Offset-based pagination degrades at high page numbers because the database must scan and discard all preceding rows.
**Response (200 OK):**
```json
{
"pastes": [
{
"id": "aB3kF9x",
"title": "My Snippet",
"language": "python",
"created_at": "2025-01-15T10:30:00Z",
"expires_at": "2025-01-22T10:30:00Z",
"visibility": "unlisted",
"size_bytes": 2048,
"views": 42
}
],
"next_cursor": "2025-01-14T08:00:00Z",
"has_more": true
}
```
### Delete Paste
**Endpoint:** `DELETE /api/v1/pastes/{id}`
**Response:** `204 No Content`
Soft-deletes the paste (sets `deleted_at`). S3 object cleanup happens in the background via the expiration service.
## Data Modeling
### Paste Metadata Schema
**Primary store:** PostgreSQL (ACID guarantees for metadata, rich querying for user dashboards and admin tooling)
```sql title="schema.sql" collapse={1-2}
-- Paste metadata (content stored in S3)
-- Indexes designed for the three hot query patterns: by ID, by user, by expiration
CREATE TABLE pastes (
id VARCHAR(8) PRIMARY KEY, -- KGS-generated Base62 key
user_id UUID REFERENCES users(id), -- NULL for anonymous pastes
title VARCHAR(100),
language VARCHAR(30), -- Detected or user-specified
visibility VARCHAR(10) DEFAULT 'unlisted'
CHECK (visibility IN ('public', 'unlisted', 'private')),
password_hash VARCHAR(60), -- bcrypt hash, NULL if not private
-- Content metadata (content itself lives in S3)
content_hash CHAR(64) NOT NULL, -- SHA-256 of raw content
size_bytes INT NOT NULL,
compressed_size INT NOT NULL,
-- Lifecycle
burn_after_read BOOLEAN DEFAULT FALSE,
read_count INT DEFAULT 0,
expires_at TIMESTAMPTZ, -- NULL = never expires
created_at TIMESTAMPTZ DEFAULT NOW(),
deleted_at TIMESTAMPTZ -- Soft delete
);
-- Lookup by ID (primary key handles this)
-- User's pastes, newest first (for dashboard)
CREATE INDEX idx_pastes_user
ON pastes(user_id, created_at DESC)
WHERE deleted_at IS NULL;
-- Expiration sweep: find expired pastes efficiently
CREATE INDEX idx_pastes_expiry
ON pastes(expires_at)
WHERE expires_at IS NOT NULL AND deleted_at IS NULL;
-- Deduplication lookup: find pastes with same content
CREATE INDEX idx_pastes_content_hash
ON pastes(content_hash);
```
### S3 Object Layout
```
s3://paste-content/
├── pastes/
│ ├── aB3kF9x.zst # Compressed paste content
│ ├── kL9mP2q.zst
│ └── ...
└── highlighted/
├── aB3kF9x.html # Pre-rendered syntax-highlighted HTML
└── ...
```
**Object naming:** Using paste_id as the S3 key. The random distribution of Base62 IDs avoids S3 hot-partition issues (S3 partitions by key prefix, and random prefixes distribute evenly).
### Database Selection Matrix
| Data Type | Store | Rationale |
| ------------------- | ------------------------ | ------------------------------------------------------------------------------------- |
| Paste metadata | PostgreSQL | ACID, complex queries (user dashboards, admin search), partial indexes for expiration |
| Paste content | S3 | Unlimited scale, $0.023/GB, 11 nines durability, CDN-friendly |
| Highlighted HTML | S3 | Large generated content, immutable, cacheable |
| Hot paste cache | Redis Cluster | Sub-ms reads, TTL-based eviction, LRU for memory management |
| Rate limit counters | Redis | Atomic increments, sliding window via sorted sets |
| KGS key pool | PostgreSQL (or DynamoDB) | Atomic batch allocation with row-level locking |
| User accounts | PostgreSQL | Relational data, auth queries |
### Sharding Strategy
At Pastebin scale (~180M pastes/year), a single PostgreSQL instance handles the metadata comfortably (36 GB/year metadata). Vertical scaling with read replicas is sufficient.
**When to shard:** If metadata exceeds ~500 GB or write throughput exceeds single-node capacity (~10K TPS for PostgreSQL):
- **Shard key:** `paste_id` (hash-based). Distributes uniformly because KGS generates random keys.
- **User-scoped queries:** User dashboard queries (`WHERE user_id = ?`) would span all shards. Mitigate with a denormalized user → paste_ids mapping table or application-level scatter-gather.
- **Expiration sweep:** Each shard runs its own sweep independently.
## Low-Level Design
### URL Generation: Key Generation Service
The KGS is the critical component that decouples ID generation from the write path.
#### Approach Comparison
**Option 1: Auto-increment + Base62**
- Database assigns sequential ID, application Base62-encodes it
- Pros: Simplest, guaranteed unique
- Cons: Single point of failure, sequential = predictable (enumerable), doesn't scale horizontally
- Best when: Single-server deployment
**Option 2: MD5/SHA hash truncation**
- Hash(content + salt), take first 7 Base62 characters
- Pros: Content-derived (same content = same hash if desired), no coordinator
- Cons: Birthday problem—at ~1.4M pastes, collision probability for 7-char Base62 exceeds 0.01%. Collision handling adds write-path latency
- Best when: Deduplication is the primary goal
**Option 3: Snowflake ID + Base62**
- 64-bit Snowflake (timestamp + node + sequence), Base62-encoded
- Pros: Time-ordered, no coordination, 4M IDs/second/node
- Cons: 11-character Base62 output (longer URLs), leaks creation timestamp
- Best when: Time-ordering is valuable, longer URLs acceptable
**Option 4: Pre-generated Key Service (KGS)**
- Offline process generates random 7-character Base62 strings, stores in pool
- Pros: Zero collision, O(1) retrieval, decoupled, predictable key length
- Cons: Requires separate service, wastes keys on crashes
- Best when: Short predictable-length URLs, high write throughput
**Chosen approach:** KGS (Option 4)
**Rationale:** Paste URLs must be short (7 characters) and unpredictable (no enumeration). KGS achieves both while eliminating collision handling from the write path entirely. The keyspace (62^7 = 3.52 trillion) is practically inexhaustible.
#### KGS Implementation Details

**Batch allocation query (PostgreSQL):**
```sql title="allocate_keys.sql"
-- Atomically claim a batch of keys for an app server
WITH batch AS (
SELECT key FROM key_pool
WHERE status = 'available'
LIMIT 1000
FOR UPDATE SKIP LOCKED
)
UPDATE key_pool
SET status = 'allocated', allocated_to = 'server-1', allocated_at = NOW()
WHERE key IN (SELECT key FROM batch)
RETURNING key;
```
`FOR UPDATE SKIP LOCKED` ensures multiple app servers can fetch key batches concurrently without blocking each other.
### Content Storage and Compression
#### Write Path
1. **Validate content:** Check size against tier limit (512 KB free, 10 MB paid)
2. **Compute SHA-256 hash:** Used for deduplication check and integrity verification
3. **Optional dedup check:** If enabled, check if `content_hash` already exists in S3. If so, skip S3 write and point new paste_id at existing blob.
4. **Compress with zstd:** Level 3 balances compression ratio (~65% on text) with CPU cost. Below 256 bytes, skip compression (overhead exceeds savings).
5. **Upload to S3:** Key = `pastes/{paste_id}.zst`, metadata = `{content_hash, original_size}`
6. **Persist metadata:** Insert into PostgreSQL
#### Read Path
1. **Check Redis:** Full deserialized paste (metadata + decompressed content) cached with TTL
2. **On miss — metadata from PostgreSQL:** Check expiration, visibility, burn-after-read
3. **Content from S3:** Download compressed object, decompress with zstd
4. **Populate Redis:** Cache the decompressed content for subsequent reads
#### Compression Benchmarks on Text Content
| Algorithm | Ratio (5 KB text) | Compress speed | Decompress speed | Notes |
| ---------------- | ----------------- | -------------- | ---------------- | -------------------------------- |
| gzip (level 6) | 65% reduction | 150 MB/s | 400 MB/s | Universal support |
| zstd (level 3) | 67% reduction | 500 MB/s | 1,700 MB/s | Best balance for dynamic content |
| Brotli (level 4) | 70% reduction | 80 MB/s | 400 MB/s | Best ratio, but slow compression |
**Why zstd over Brotli:** Write latency matters for paste creation. zstd at level 3 compresses 6x faster than Brotli at level 4 with only 3% less compression. For a write-path operation, this trade-off strongly favors zstd.
### Syntax Highlighting
#### Async Highlighting Pipeline
Syntax highlighting is CPU-intensive for large pastes. Running it synchronously on the write path would spike p99 write latency.
**Flow:**
1. Paste created → metadata and raw content stored
2. Message enqueued: `{paste_id, language, size_bytes}`
3. Highlight worker dequeues, retrieves raw content from S3
4. Runs highlighting (tree-sitter or Pygments/Chroma depending on language support)
5. Stores rendered HTML to `s3://paste-content/highlighted/{paste_id}.html`
6. Updates metadata: `highlighted_at = NOW()`
**First-read before highlighting completes:** If a user reads a paste before highlighting finishes, the read service returns raw content with client-side highlighting as a fallback (using a JavaScript library like Highlight.js or Shiki). Once the server-rendered HTML is available, subsequent reads serve it directly.
**Language detection:** If the user doesn't specify a language, the worker attempts detection using file extension heuristics, shebang lines, and statistical classifiers (similar to GitHub's Linguist). Fallback: plain text.
## Frontend Considerations
### Performance-Critical Decisions
#### Paste Rendering Strategy
**Problem:** Pastes can be up to 10 MB of text. Rendering this as syntax-highlighted HTML in the browser generates a massive DOM tree (millions of nodes for large files).
**Solution: Virtualized rendering for large pastes**
- Pastes < 100 KB: Render full highlighted HTML (reasonable DOM size)
- Pastes 100 KB–1 MB: Virtual scrolling—only render visible lines plus buffer (~100 lines visible, ~200 rendered)
- Pastes > 1 MB: Show first 1,000 lines with "Load more" or "Download raw" option
**Implementation:**
- Virtual scrolling using a library like `@tanstack/virtual`
- Each "row" is a highlighted line of code
- Line numbers are `position: sticky` for scroll synchronization
- Search within paste uses a web worker to avoid blocking the main thread
#### Data Structure for Paste Viewer
```typescript title="paste-viewer-state.ts" collapse={1-3, 18-25}
// State management for the paste viewer component
// Separates server data from ephemeral UI state
interface PasteViewerState {
// Server data (from API response)
paste: {
id: string
content: string
highlightedHtml: string | null // null = highlighting in progress
language: string
lineCount: number
}
// UI state (ephemeral, not persisted)
ui: {
wordWrap: boolean
showLineNumbers: boolean
selectedLines: Set // For line range selection (e.g., #L5-L10)
searchQuery: string
searchMatches: number[] // Line numbers with matches
currentMatchIndex: number
}
}
```
**Why this separation:** Server data is immutable after fetch (paste content never changes). UI state is ephemeral and driven entirely by user interaction. Separating them prevents unnecessary re-renders when toggling UI options.
#### Line Range Selection
Paste services commonly support linking to specific lines (e.g., `paste.example.com/aB3kF9x#L5-L10`).
**Implementation:**
- Parse URL fragment on load to determine initial selection
- Click on line number selects that line, Shift+Click extends selection
- Update URL fragment without triggering navigation (using `history.replaceState`)
- Scroll to selected line on initial load
#### API Response Optimization
**Initial page load returns metadata + content in a single response** (no separate fetch for highlighted HTML):
```json
{
"id": "aB3kF9x",
"content": "raw text...",
"highlighted_html": "... ",
"language": "python",
"line_count": 42
}
```
If `highlighted_html` is `null` (highlighting still in progress), the frontend falls back to client-side highlighting with Shiki or Highlight.js. This avoids a loading spinner for the common case where highlighting completes before the user loads the page.
**Raw content endpoint** (`/raw/aB3kF9x`) returns `text/plain` directly—no JSON parsing overhead for CLI consumers.
## Infrastructure Design
### Cloud-Agnostic Architecture
#### Object Storage
**Concept:** Durable blob storage for paste content
**Requirements:**
- 11 nines durability (cannot lose paste content)
- Low cost per GB (most content is cold)
- CDN integration for direct edge serving
- Lifecycle policies for storage tiering
**Open-source options:**
- MinIO — S3-compatible, self-hosted, battle-tested
- Ceph RADOS Gateway — S3-compatible, complex operations
**Managed options:**
- AWS S3, GCS (Google Cloud Storage), Azure Blob Storage
#### Cache Layer
**Concept:** In-memory cache for hot paste content
**Requirements:**
- Sub-millisecond reads
- TTL-based eviction
- LRU eviction when memory is full
- Cluster mode for horizontal scaling
**Options:**
- Redis Cluster — Rich data structures, Lua scripting for atomic operations
- Memcached — Simpler, multi-threaded, no persistence
- KeyDB — Redis-compatible, multi-threaded
#### Message Queue
**Concept:** Async job processing for syntax highlighting and expiration cleanup
**Requirements:**
- At-least-once delivery
- Dead letter queue for failed jobs
- Visibility timeout (prevent duplicate processing)
**Options:**
- Redis Streams — Simple, good for moderate throughput
- RabbitMQ — Feature-rich, moderate scale
- Apache Kafka — High throughput, overkill for this use case unless analytics pipeline is added
### AWS Reference Architecture
#### Compute
| Component | Service | Configuration |
| ----------------- | -------------------- | ------------------------------------------- |
| API servers | ECS Fargate | Auto-scaling 2-20 tasks, 1 vCPU / 2 GB each |
| Highlight workers | ECS Fargate (Spot) | Cost-optimized, tolerant of interruption |
| Expiration cron | EventBridge + Lambda | Hourly trigger, 15-min timeout |
| KGS generator | Lambda (scheduled) | Daily batch generation |
#### Data Stores
| Data | Service | Rationale |
| -------------- | ---------------------------- | --------------------------------------- |
| Paste metadata | RDS PostgreSQL (Multi-AZ) | ACID, managed backups, read replicas |
| Paste content | S3 Standard | Durability, cost, CDN integration |
| Warm content | S3 Infrequent Access | 40% cheaper, min 30-day retention |
| Cold content | S3 Glacier Instant Retrieval | 68% cheaper than Standard, ms retrieval |
| Hot cache | ElastiCache Redis Cluster | Sub-ms reads, 3 shards |
| Rate limits | ElastiCache Redis | Atomic counters, sorted sets |
#### Storage Tiering with S3 Lifecycle
```json title="s3-lifecycle-policy.json" collapse={1-2, 21-22}
{
"Rules": [
{
"ID": "paste-content-tiering",
"Status": "Enabled",
"Filter": { "Prefix": "pastes/" },
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
}
],
"Expiration": {
"Days": 1825
}
}
]
}
```
**Cost impact at 1.6 TB (5-year accumulated):**
| Tier | Data Volume | Monthly Cost |
| ------------------------- | ----------- | ---------------- |
| S3 Standard (< 30 days) | ~26 GB | $0.60 |
| S3 IA (30-90 days) | ~52 GB | $0.65 |
| S3 Glacier IR (> 90 days) | ~1.5 TB | $6.00 |
| **Total** | **1.6 TB** | **~$7.25/month** |
Storage cost is negligible. The dominant cost is compute (API servers) and Redis.
#### Self-Hosted Alternatives
| Managed Service | Self-Hosted Option | When to self-host |
| --------------- | -------------------- | ----------------------------------------------------- |
| RDS PostgreSQL | PostgreSQL on EC2 | Cost at scale, specific extensions (e.g., pg_partman) |
| ElastiCache | Redis on EC2 | Specific Redis modules, cost optimization |
| S3 | MinIO on EC2 | Multi-cloud portability, data sovereignty |
| ECS Fargate | Kubernetes (EKS/k3s) | Existing K8s expertise, multi-cloud |
### Production Deployment

## Variations
### Encrypted Pastes (PrivateBin Model)
For maximum privacy, client-side encryption ensures the server never sees plaintext:
1. Browser generates a random 256-bit AES-GCM (Advanced Encryption Standard, Galois/Counter Mode) key
2. Content is compressed (zlib), then encrypted client-side
3. Encrypted blob is sent to the server
4. Decryption key is placed in the URL fragment (`#key=...`), which browsers never send to the server
5. On read, the browser fetches the encrypted blob and decrypts locally
**Trade-offs:**
- ✅ True zero-knowledge: server cannot read paste content even under subpoena
- ❌ No server-side syntax highlighting (server cannot read content)
- ❌ No server-side search or content scanning
- ❌ URL is much longer (paste_id + encryption key)
- ❌ Key loss = permanent content loss
PrivateBin implements this model with AES-256-GCM encryption, PBKDF2-HMAC-SHA256 key derivation (100K iterations), and Base58-encoded keys in the URL fragment.
### Multi-File Pastes (GitHub Gists Model)
GitHub Gists extend the paste concept by backing each gist with a full Git repository:
- Each gist is a bare Git repo on disk
- Supports multiple files, full revision history, forking, and cloning
- URLs use hexadecimal IDs (Git-style)
- Discovery layer (starring, search) is a separate relational service
**Trade-off:** Dramatically more storage and compute per paste (Git object overhead), but enables collaboration workflows that simple paste services cannot.
## Conclusion
Pastebin's design centers on three decisions that cascade through the architecture:
1. **Split storage (PostgreSQL metadata + S3 content)** enables independent scaling, cost-effective tiering, and CDN-friendly raw content serving. The extra network hop on cache miss is justified by 5x storage cost reduction and operational simplicity.
2. **KGS for URL generation** eliminates collision handling from the write path entirely. The 3.52 trillion keyspace (7-char Base62) is practically inexhaustible, and batch allocation to app servers removes per-request coordination.
3. **Hybrid expiration (lazy read check + active sweep)** guarantees readers never see expired content while reclaiming storage in the background. Burn-after-read requires row-level locking for atomicity but stays off the hot read path.
**What this design sacrifices:**
- Write-path latency includes compression + S3 upload (~100-200ms added vs. database-only)
- Syntax highlighting is eventually consistent (async), requiring client-side fallback
- Burn-after-read pastes cannot use CDN caching
**Future improvements worth considering:**
- Content-addressable storage for automatic deduplication (Path C) if duplication rate exceeds 10%
- WebSocket-based collaborative editing for real-time multi-user pastes
- Differential compression (zstd dictionaries trained on common paste types) for further storage reduction
## Appendix
### Prerequisites
- Distributed storage concepts (object stores, caching tiers)
- Database indexing strategies (partial indexes, covering indexes)
- HTTP caching semantics (`Cache-Control`, `ETag`, immutable resources)
- Basic cryptographic hashing (SHA-256, collision resistance)
### Terminology
| Term | Definition |
| --------------- | ---------------------------------------------------------------------------------------------------- |
| KGS | Key Generation Service — pre-generates unique short codes offline |
| Base62 | Encoding using `[A-Za-z0-9]` (62 characters), producing URL-safe strings |
| Burn-after-read | Paste that self-destructs after a single read |
| CAS | Content-Addressable Storage — storage where the key is derived from the content's cryptographic hash |
| zstd | Zstandard — Facebook-developed compression algorithm balancing ratio and speed |
| PII | Personally Identifiable Information — data that can identify an individual |
### Summary
- **Split storage architecture:** PostgreSQL for metadata (36 GB/year), S3 for content (320 GB/year compressed). Independent scaling, 5x storage cost reduction vs. database-only.
- **KGS with Base62 7-character keys:** 3.52 trillion keyspace, zero write-path collisions, batch allocation eliminates per-request coordination.
- **Multi-tier caching (CDN → Redis → S3):** Immutable paste content enables aggressive caching. Sub-100ms reads globally.
- **Hybrid expiration strategy:** Lazy read checks guarantee correctness; hourly active sweeps reclaim storage. Burn-after-read uses row-level locking for atomicity.
- **zstd compression at write time:** 65-67% reduction on text content, 6x faster than Brotli at comparable ratios.
- **Async syntax highlighting:** CPU-intensive work decoupled from write path; client-side fallback until server rendering completes.
### References
- [Pastebin.com](https://pastebin.com) — Original paste service (2002), PHP + MySQL, burn-after-read since 2020
- [GitHub Gists](https://docs.github.com/en/get-started/writing-on-github/editing-and-sharing-content-with-gists) — Git-backed paste service with revision history and forking
- [PrivateBin](https://github.com/PrivateBin/PrivateBin) — Zero-knowledge paste service, AES-256-GCM client-side encryption
- [dpaste](https://docs.dpaste.org) — Django-based paste service, collision-handling slug generation
- [Haste-server](https://github.com/SparkUniverse/haste-server) — Node.js paste service with pluggable storage backends
- [RFC 8878 — Zstandard Compression](https://datatracker.ietf.org/doc/html/rfc8878) — IETF specification for zstd data format
- [Twitter Snowflake](https://github.com/twitter-archive/snowflake) — Distributed ID generation (archived)
- [AWS S3 Storage Classes](https://aws.amazon.com/s3/storage-classes/) — Tiered storage pricing and access patterns
- [System Design Primer — Pastebin](https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/pastebin/README.md) — Reference design with scale estimation
---
## Design an API Rate Limiter: Distributed Throttling, Multi-Tenant Quotas, and Graceful Degradation
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-api-rate-limiter
**Category:** System Design / System Design Problems
**Description:** A comprehensive system design for a distributed API rate limiting service covering algorithm selection, Redis-backed counting, multi-tenant quota management, rate limit header communication, and graceful degradation under failure. This design addresses sub-millisecond rate check latency at 500K+ decisions per second with configurable per-tenant policies and fail-open resilience.
# Design an API Rate Limiter: Distributed Throttling, Multi-Tenant Quotas, and Graceful Degradation
A comprehensive system design for a distributed API rate limiting service covering algorithm selection, Redis-backed counting, multi-tenant quota management, rate limit header communication, and graceful degradation under failure. This design addresses sub-millisecond rate check latency at 500K+ decisions per second with configurable per-tenant policies and fail-open resilience.

High-level architecture: API Gateway consults the Rate Limit Service before forwarding requests. The decision engine checks counters in Redis, applies rules from configuration, and returns allow/deny with remaining quota. Backend services are shielded from overload.
## Abstract
A rate limiter maps request identifiers (user ID, API key, IP) to counters and enforces thresholds—conceptually trivial, but the design space branches around three axes: which counting algorithm balances accuracy against memory, how to maintain consistent counts across distributed nodes, and how the limiter itself degrades when its backing store fails.
**Core architectural decisions:**
| Decision | Choice | Rationale |
| -------------------- | -------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| Algorithm | Sliding window counter | Best accuracy-to-memory ratio; Cloudflare measured 0.003% error across 400M requests |
| Counting store | Redis Cluster with Lua scripts | Atomic operations, sub-ms latency, horizontal scaling via hash slots |
| Rule storage | YAML config with hot reload | Declarative, version-controlled, no database dependency for rule evaluation |
| Multi-tenancy | Hierarchical quotas (global → tenant → endpoint) | Prevents noisy neighbors while allowing per-tier customization |
| Failure mode | Fail-open with circuit breaker | Availability over strictness—a rate limiter outage must not become an API outage |
| Client communication | RFC 9110 `Retry-After` + IETF `RateLimit`/`RateLimit-Policy` headers | Standards-based; clients can implement backoff without guessing |
**Key trade-offs accepted:**
- Sliding window counter is an approximation (not exact like sliding window log), but uses O(1) memory per key vs. O(n) for the log approach
- Centralized Redis adds a network hop per request, but guarantees globally consistent counts across all API gateway nodes
- Fail-open means a Redis outage temporarily disables rate limiting, but the alternative (fail-closed = reject all traffic) is worse for availability
- Lua scripts add operational complexity (debugging, versioning) but are necessary for atomic check-and-increment
**What this design optimizes:**
- Sub-millisecond rate check latency via Redis pipelining and local rule caching
- Zero-coordination counting across API gateway nodes (Redis is the single source of truth)
- Operational safety through fail-open, shadow mode, and gradual rollout
- Multi-tenant fairness without per-tenant infrastructure
## Requirements
### Functional Requirements
| Requirement | Priority | Notes |
| --------------------------- | -------- | ---------------------------------------------------------------------- |
| Per-user rate limiting | Core | Enforce request quotas per authenticated user or API key |
| Per-IP rate limiting | Core | Protect against unauthenticated abuse |
| Per-endpoint rate limiting | Core | Different limits for different API operations (e.g., writes vs. reads) |
| Multi-tier quotas | Core | Free, Pro, Enterprise tiers with different limits |
| Rate limit headers | Core | Communicate remaining quota and reset time to clients |
| Burst allowance | Core | Allow short bursts above steady-state rate |
| Global rate limiting | Extended | Protect backend services from total aggregate overload |
| Concurrent request limiting | Extended | Cap in-flight requests (Stripe-style) |
| Quota management API | Extended | CRUD operations for tenant quotas |
| Rate limit dashboard | Extended | Real-time visibility into rate limit decisions |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ---------------- | ---------------------------------------------- | --------------------------------------------------------------------------- |
| Decision latency | p99 < 1ms (local cache hit), p99 < 5ms (Redis) | Rate check is on the critical path of every API request |
| Availability | 99.99% (4 nines) | Rate limiter failure should not cause API failure (fail-open) |
| Throughput | 500K decisions/second per cluster | Based on scale estimation below |
| Accuracy | < 1% false positive rate | Wrongly blocking legitimate users is worse than letting some excess through |
| Consistency | Approximate (eventual) | Exact global consistency is too expensive; small overcounting is acceptable |
| Rule propagation | < 10 seconds | New or updated rules take effect within seconds |
### Scale Estimation
**API platform scale (mid-size SaaS reference):**
- Registered API consumers: 100K
- Daily Active API keys: 10K
- Average requests per active key: 5K/day
**Traffic:**
- Daily requests: 10K keys x 5K requests = 50M requests/day
- Average RPS: 50M / 86,400 ≈ 580 RPS
- Peak multiplier (5x): ~2,900 RPS
- Spike (viral event, 20x): ~11,600 RPS
- Rate limit checks per request: 2-3 (user + endpoint + global) = ~35K decisions/s peak
**For large-scale platform (GitHub/Stripe-class):**
- 10M+ API consumers, 100K+ concurrent
- 500K+ rate limit decisions per second
- 10M+ active rate limit keys in Redis
**Storage (Redis):**
- Per rate limit key: ~100 bytes (key + counter + TTL metadata)
- Active keys: 10M x 100B = 1 GB
- With sliding window (2 counters per key): ~2 GB
- Redis cluster with 3 replicas: ~6 GB total
**Key insight:** The rate limiter itself is a high-throughput, low-latency service that processes more requests than the APIs it protects. The dominant challenge is keeping decision latency under 1ms while maintaining globally consistent counts.
## Design Paths
### Path A: Sidecar / Local Rate Limiting
**Best when:**
- Services are deployed in a service mesh (Istio, Linkerd)
- Approximate limits are acceptable
- Latency budget is extremely tight (< 100μs)
**Architecture:**

**Key characteristics:**
- Each sidecar maintains local token buckets in memory
- No network hop for rate decisions
- Limits are per-node, not global (a 1000 RPS limit across 10 nodes allows up to 10,000 RPS total)
**Trade-offs:**
- ✅ Lowest possible latency (in-process)
- ✅ No external dependency—immune to Redis failures
- ✅ Simple operations (no separate service to maintain)
- ❌ Limits are per-node: actual global rate = limit x node_count
- ❌ Cannot enforce accurate per-user quotas across nodes
- ❌ Auto-scaling changes effective limits (more nodes = higher actual rate)
**Real-world example:** Envoy's local rate limiting filter maintains per-connection and per-route token buckets with no external coordination.
### Path B: Centralized Rate Limit Service
**Best when:**
- Accurate per-user quotas required (billing, SLA enforcement)
- Multi-tenant API platform with tiered pricing
- Globally consistent rate limits across all nodes
**Architecture:**

**Key characteristics:**
- Dedicated rate limit service accessed via gRPC (Envoy's `rls.proto` pattern)
- Redis stores all counters—single source of truth
- Rules loaded from configuration, hot-reloadable
**Trade-offs:**
- ✅ Globally accurate counts (all nodes check same Redis)
- ✅ Per-user quotas enforced correctly regardless of which node handles the request
- ✅ Centralized rule management and observability
- ❌ Network hop per request (1-5ms latency added)
- ❌ Redis is a single point of failure (mitigated by fail-open)
- ❌ Additional infrastructure to operate
**Real-world example:** Envoy's global rate limiting uses a separate Go/gRPC service (`envoyproxy/ratelimit`) backed by Redis, accessed via the `ShouldRateLimit` RPC.
### Path C: Hybrid (Local + Global Synchronization)
**Best when:**
- Need both low latency and reasonable global accuracy
- Willing to accept slightly higher implementation complexity
- Scale is large enough that Redis round-trips are measurable
**Architecture:**

**Key characteristics:**
- Local counters handle rate decisions in-process (token bucket per key)
- Periodically sync local counts to Redis and pull global totals
- Sync interval trades accuracy for latency (e.g., every 1-5 seconds)
**Trade-offs:**
- ✅ Near-zero latency for rate decisions (local memory)
- ✅ Reasonably accurate global counts (within sync interval)
- ✅ Resilient to Redis failures (local counters keep working)
- ❌ Accuracy gap during sync intervals (can overshoot by sync_interval x node_count x rate)
- ❌ Complex state management (merge conflicts, counter drift)
- ❌ Two code paths to maintain and debug
**Real-world example:** Cloudflare's rate limiting uses per-PoP (Point of Presence) memcache counters with a sliding window algorithm. Anycast routing ensures a single IP's requests reach the same PoP, avoiding cross-PoP synchronization.
### Path Comparison
| Factor | Local (A) | Centralized (B) | Hybrid (C) |
| ---------------------- | ----------------- | -------------------- | ------------------- |
| Decision latency | < 100μs | 1-5ms | < 100μs |
| Global accuracy | Poor (per-node) | Exact | Approximate |
| Redis dependency | None | Hard dependency | Soft dependency |
| Operational complexity | Low | Medium | High |
| Multi-tenant quotas | No | Yes | Partial |
| Failure mode | Always works | Fail-open needed | Degrades gracefully |
| Best for | Internal services | Public API platforms | High-scale edge |
### This Article's Focus
This article focuses on **Path B (Centralized Rate Limit Service)** because:
1. Public API platforms require accurate per-user quotas for billing and SLA enforcement
2. Multi-tenant fairness demands globally consistent counts
3. The 1-5ms latency overhead is acceptable for most API gateways
4. Fail-open with circuit breaker adequately mitigates the Redis dependency risk
## Rate Limiting Algorithms
Before diving into the system design, a deep understanding of the available algorithms is essential—the algorithm choice cascades through data modeling, Redis memory usage, and accuracy guarantees.
### Token Bucket
**How it works:** Each rate limit key (user, API key) gets a bucket with capacity _b_ (burst size) and refill rate _r_ (tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected. Tokens are refilled at rate _r_ up to maximum _b_.
**Implementation trick:** No background refill process needed. On each request, compute elapsed time since last check and add `elapsed × r` tokens (capped at _b_). This is called a "lazy refill"—O(1) time, O(1) space per key.
**Parameters:**
- `bucket_size` (b): Maximum burst size. A bucket of 100 allows 100 requests in a burst.
- `refill_rate` (r): Steady-state rate. A rate of 10/s means 10 tokens added per second.
| Aspect | Value |
| ---------------- | -------------------------------------------------- |
| Time complexity | O(1) per request |
| Space complexity | O(1) per key (token count + last_refill_timestamp) |
| Accuracy | Exact for single-key decisions |
| Burst behavior | Allows bursts up to bucket size |
**Who uses it:** AWS API Gateway (token bucket for throttling), Stripe (token bucket via Redis for request rate limiting), Amazon (internally for many services).
### Leaky Bucket
**How it works:** Conceptually a FIFO queue that processes requests at a fixed output rate. Incoming requests enter the queue; if full, they are rejected. Requests "leak" out at a constant rate. First proposed by Jonathan S. Turner in 1986, standardized in ITU-T Recommendation I.371 (1993) for ATM networks as the Generic Cell Rate Algorithm (GCRA).
**Key distinction from token bucket:** The leaky bucket smooths output to a constant rate—it shapes traffic. The token bucket permits bursts up to bucket capacity—it polices traffic. In practice, many implementations conflate the two; the GCRA variant is mathematically equivalent to an inverted token bucket.
| Aspect | Value |
| ---------------- | ------------------------------------------------ |
| Time complexity | O(1) per request |
| Space complexity | O(1) per key (queue depth + last_leak_timestamp) |
| Accuracy | Exact |
| Burst behavior | No bursts—output is constant rate |
**Who uses it:** NGINX (`limit_req` module uses leaky bucket), HAProxy.
### Fixed Window Counter
**How it works:** Divide time into fixed windows (e.g., 1-minute intervals). Maintain a counter per key per window. Increment on each request. Reject when counter exceeds limit.
**The boundary burst problem:** A user can send the full limit at the end of window N and again at the start of window N+1, effectively doubling their rate in a short period. For a 100 req/min limit, a user could send 200 requests in a 2-second span straddling the window boundary.
| Aspect | Value |
| ---------------- | ----------------------------- |
| Time complexity | O(1) per request |
| Space complexity | O(1) per key (single counter) |
| Accuracy | Inexact at window boundaries |
| Burst behavior | Up to 2x limit at boundary |
**Implementation advantage:** Trivially implemented with Redis `INCR` + `EXPIRE`—both operations are atomic, no Lua scripting needed.
### Sliding Window Log
**How it works:** Store the timestamp of every request in a sorted set. On each new request, remove entries older than the window, count remaining entries. If count exceeds limit, reject.
**The memory problem:** At 500 requests/day per user across 10K users, this stores 5M timestamps in Redis. Figma estimated ~20 MB for this scenario alone, making it impractical at scale.
| Aspect | Value |
| ---------------- | -------------------------------------------------------------- |
| Time complexity | O(n) worst case (removing expired entries), amortized O(log n) |
| Space complexity | O(n) per key (one entry per request within window) |
| Accuracy | Exact—no boundary artifacts |
| Burst behavior | Exact enforcement, no boundary issues |
### Sliding Window Counter (Chosen)
**How it works:** Combines fixed window counters with weighted interpolation. Maintain counters for the current window and the previous window. The effective count is:
$$
\text{rate} = \text{prev\_count} \times \frac{\text{window\_size} - \text{elapsed}}{\text{window\_size}} + \text{current\_count}
$$
This approximation smooths the boundary burst problem while maintaining O(1) memory.
**Figma's variant (2017):** Instead of two large windows, use many small sub-windows (1/60th of the rate limit window). For an hourly limit, increment per-minute counters and sum the last 60. This reduces approximation error further.
| Aspect | Value |
| ---------------- | ----------------------------------------------------------- |
| Time complexity | O(1) per request |
| Space complexity | O(1) per key (two counters) or O(k) for k sub-windows |
| Accuracy | ~0.003% error (Cloudflare measurement across 400M requests) |
| Burst behavior | Smoothed—no boundary doubling |
**Why this algorithm:** The sliding window counter is the best trade-off for a distributed rate limiter. It eliminates the boundary burst problem of fixed windows, uses constant memory (unlike sliding window log), and Cloudflare's production measurement of 0.003% error across 400M requests validates its accuracy at scale.
### Algorithm Comparison
| Algorithm | Memory per key | Accuracy | Burst handling | Distributed complexity | Best for |
| -------------------------- | --------------------- | ------------------- | ------------------ | ---------------------- | ------------------------- |
| Token bucket | O(1) — 2 values | Exact (single node) | Configurable burst | Lua script needed | Burst-tolerant APIs |
| Leaky bucket | O(1) — 2 values | Exact | No bursts (smooth) | Lua script needed | Traffic shaping |
| Fixed window | O(1) — 1 counter | Boundary artifacts | 2x at boundary | Atomic INCR | Simple, high-scale |
| Sliding window log | O(n) — all timestamps | Exact | Exact | Sorted set ops | Low-volume, exact billing |
| **Sliding window counter** | **O(1) — 2 counters** | **~99.997%** | **Smoothed** | **Atomic INCR** | **General-purpose** |
## High-Level Design
### Component Overview

### Request Flow

**When rate limited:**

### Rate Limit Service (Decision Engine)
The core component. Stateless—all state lives in Redis and configuration.
**Decision flow per request:**
1. Receive descriptors from gateway (domain, API key, endpoint, IP)
2. Match descriptors against rule configuration
3. For each matching rule, execute sliding window counter check in Redis
4. Return the most restrictive result (if any rule says deny, deny)
5. Emit metrics (allowed/denied, remaining quota, latency)
**Design decisions:**
| Decision | Choice | Rationale |
| --------------- | --------------------------------------------------- | ------------------------------------------------------------------------- |
| Protocol | gRPC (Envoy `rls.proto`) | Low overhead, schema-enforced, streaming support, ecosystem compatibility |
| Statefulness | Stateless (counters in Redis, rules in config) | Horizontal scaling without coordination between instances |
| Rule evaluation | All matching rules evaluated, most restrictive wins | Prevents circumventing a per-endpoint limit via a generous per-user limit |
| Failure mode | Fail-open (return ALLOW on any error) | Stripe's approach: catch exceptions at all levels so errors fail open |
### Rule Configuration
Rules are defined in YAML, loaded at startup, and hot-reloaded on change. This follows Envoy's `ratelimit` service pattern.
```yaml title="rate-limit-rules.yaml" collapse={1-2}
# Rate limit configuration
# Domain groups related rules; descriptors match request attributes
domain: api_platform
descriptors:
# Per-API-key rate limit (tiered)
- key: api_key
rate_limit:
unit: minute
requests_per_unit: 100 # Default (free tier)
descriptors:
# Per-endpoint within API key
- key: endpoint
value: "POST /api/v1/orders"
rate_limit:
unit: minute
requests_per_unit: 20 # Write-heavy endpoint, lower limit
# Per-IP rate limit (unauthenticated)
- key: remote_address
rate_limit:
unit: minute
requests_per_unit: 30
# Global aggregate limit (protect backend)
- key: global
value: "aggregate"
rate_limit:
unit: second
requests_per_unit: 10000
```
**Shadow mode:** Rules can be deployed in shadow mode where the check executes (updating counters, emitting metrics) but always returns ALLOW. This enables safe rollout—observe the impact of new rules before enforcement.
## API Design
### Rate Limit Check (Internal gRPC)
**Service:** `envoy.service.ratelimit.v3.RateLimitService`
**RPC:** `ShouldRateLimit`
**Request:**
```json
{
"domain": "api_platform",
"descriptors": [
{
"entries": [
{ "key": "api_key", "value": "abc123" },
{ "key": "endpoint", "value": "POST /api/v1/orders" }
]
},
{
"entries": [{ "key": "remote_address", "value": "192.168.1.100" }]
}
],
"hits_addend": 1
}
```
**Response:**
```json
{
"overall_code": "OVER_LIMIT",
"statuses": [
{
"code": "OK",
"current_limit": {
"requests_per_unit": 100,
"unit": "MINUTE"
},
"limit_remaining": 47,
"duration_until_reset": "42s"
},
{
"code": "OVER_LIMIT",
"current_limit": {
"requests_per_unit": 20,
"unit": "MINUTE"
},
"limit_remaining": 0,
"duration_until_reset": "42s"
}
]
}
```
### Rate Limit Response Headers
The gateway translates the rate limit decision into HTTP headers. Two standards coexist:
**Legacy (widely adopted):**
```http
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 47
X-RateLimit-Reset: 1706000060
```
**IETF Standard (draft-ietf-httpapi-ratelimit-headers):**
```http
RateLimit-Policy: "default";q=100;w=60, "writes";q=20;w=60
RateLimit: "writes";r=0;t=42
Retry-After: 42
```
The IETF `RateLimit` header uses Structured Fields (RFC 8941) with parameters:
- `q`: quota allocated (requests per window)
- `w`: window size in seconds
- `r`: remaining requests
- `t`: seconds until reset
**429 response (RFC 6585):**
```http
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 42
RateLimit-Policy: "writes";q=20;w=60
RateLimit: "writes";r=0;t=42
{
"error": {
"code": "RATE_LIMITED",
"message": "Rate limit exceeded for POST /api/v1/orders. Retry after 42 seconds.",
"retry_after": 42,
"limit": 20,
"window": "1m",
"scope": "api_key:abc123:endpoint:POST /api/v1/orders"
}
}
```
### Quota Management API (Admin)
**Create/Update Quota Override:**
**Endpoint:** `PUT /admin/v1/quotas/{api_key}`
```json
{
"tier": "enterprise",
"overrides": [
{
"descriptor": "endpoint:POST /api/v1/orders",
"requests_per_unit": 500,
"unit": "minute"
}
],
"effective_at": "2025-02-01T00:00:00Z",
"expires_at": null
}
```
**Response (200 OK):**
```json
{
"api_key": "abc123",
"tier": "enterprise",
"quotas": {
"default": { "requests_per_unit": 5000, "unit": "minute" },
"overrides": [
{
"descriptor": "endpoint:POST /api/v1/orders",
"requests_per_unit": 500,
"unit": "minute"
}
]
},
"updated_at": "2025-01-15T10:30:00Z"
}
```
**Get Current Usage:**
**Endpoint:** `GET /admin/v1/usage/{api_key}`
```json
{
"api_key": "abc123",
"current_window": {
"start": "2025-01-15T10:30:00Z",
"end": "2025-01-15T10:31:00Z"
},
"usage": {
"default": { "used": 53, "limit": 5000, "remaining": 4947 },
"endpoint:POST /api/v1/orders": { "used": 18, "limit": 500, "remaining": 482 }
}
}
```
## Data Modeling
### Redis Key Structure
**Sliding window counter keys:**
```
ratelimit:{domain}:{descriptor_hash}:{window_id}
```
**Example:**
```
ratelimit:api_platform:api_key=abc123:endpoint=POST_orders:1706000040
ratelimit:api_platform:api_key=abc123:endpoint=POST_orders:1706000000
```
- `1706000040` = Unix timestamp truncated to current window start
- `1706000000` = previous window start
- Key TTL = 2 × window_size (auto-cleanup of expired windows)
**Sliding window counter with sub-windows (Figma variant):**
```
ratelimit:api_platform:api_key=abc123:min:28433334
ratelimit:api_platform:api_key=abc123:min:28433335
```
Where `28433334` = Unix minutes (`timestamp / 60`). For a 1-hour limit, sum the last 60 minute-buckets.
### Quota Overrides Schema
**Primary store:** PostgreSQL (ACID for quota management, admin queries)
```sql title="schema.sql" collapse={1-2}
-- Tenant quota overrides
-- Base tier limits are in YAML config; overrides are per-tenant exceptions
CREATE TABLE quota_overrides (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
api_key VARCHAR(64) NOT NULL,
tier VARCHAR(20) NOT NULL DEFAULT 'free'
CHECK (tier IN ('free', 'pro', 'enterprise', 'custom')),
descriptor VARCHAR(256) NOT NULL, -- e.g., "endpoint:POST /api/v1/orders"
requests_per_unit INT NOT NULL,
unit VARCHAR(10) NOT NULL
CHECK (unit IN ('second', 'minute', 'hour', 'day')),
effective_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL = no expiration
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(api_key, descriptor)
);
-- Fast lookup by API key
CREATE INDEX idx_quota_api_key ON quota_overrides(api_key)
WHERE expires_at IS NULL OR expires_at > NOW();
-- Tier-level queries (admin dashboard)
CREATE INDEX idx_quota_tier ON quota_overrides(tier);
```
### Data Store Selection
| Data Type | Store | Rationale |
| ------------------- | ------------------------------ | ------------------------------------------------------- |
| Rate limit counters | Redis Cluster | Atomic increments, sub-ms latency, TTL-based expiration |
| Rule configuration | YAML files (mounted volume) | Version-controlled, no DB dependency, hot-reloadable |
| Quota overrides | PostgreSQL | ACID for billing-critical quota changes, admin queries |
| Quota cache | In-memory (rate limit service) | Avoids DB lookup per request; refresh every 30s |
| Metrics | Prometheus TSDB | Time-series optimized, Grafana integration |
| Audit log | Append-only log (Kafka → S3) | Compliance, debugging quota disputes |
## Low-Level Design
### Sliding Window Counter in Redis (Lua Script)
The core algorithm, implemented as a Lua script for atomicity. Redis executes Lua scripts in a single-threaded context—no other commands interleave during execution.
```lua title="sliding_window.lua" collapse={1-3, 28-30}
-- Sliding window counter rate limiter
-- Atomic: no race conditions between check and increment
-- Returns: {allowed (0/1), remaining, reset_at_unix}
local key_prefix = KEYS[1] -- e.g., "ratelimit:api_platform:api_key=abc123"
local limit = tonumber(ARGV[1]) -- e.g., 100
local window = tonumber(ARGV[2]) -- e.g., 60 (seconds)
local now = tonumber(ARGV[3]) -- Current Unix timestamp
-- Compute window boundaries
local current_window = math.floor(now / window) * window
local previous_window = current_window - window
local elapsed = now - current_window
-- Keys for current and previous windows
local current_key = key_prefix .. ":" .. current_window
local previous_key = key_prefix .. ":" .. previous_window
-- Get counts (returns 0 if key doesn't exist)
local previous_count = tonumber(redis.call("GET", previous_key) or 0)
local current_count = tonumber(redis.call("GET", current_key) or 0)
-- Weighted count: previous window's contribution decreases linearly
local weight = (window - elapsed) / window
local estimated_count = previous_count * weight + current_count
if estimated_count >= limit then
-- Over limit: return deny with remaining=0
local reset_at = current_window + window
return {0, 0, reset_at}
end
-- Under limit: increment current window and allow
redis.call("INCR", current_key)
redis.call("EXPIRE", current_key, window * 2) -- TTL = 2x window for overlap
local remaining = math.max(0, math.floor(limit - estimated_count - 1))
local reset_at = current_window + window
return {1, remaining, reset_at}
```
**Why Lua over Redis transactions (MULTI/EXEC):** The sliding window check requires reading the previous window's count to compute the weighted estimate, then conditionally incrementing the current window. Redis transactions cannot use a read result to make a conditional write—they batch commands blindly. Lua scripts can read, compute, and write in a single atomic operation.
**Redis Cluster consideration:** Both `current_key` and `previous_key` must hash to the same slot. Use Redis hash tags: `{api_key=abc123}:1706000040` and `{api_key=abc123}:1706000000`. The `{...}` portion determines the hash slot, ensuring both keys land on the same shard.
### Token Bucket in Redis (Lua Script)
For APIs requiring configurable burst allowance, a token bucket variant:
```lua title="token_bucket.lua" collapse={1-3, 25-27}
-- Token bucket rate limiter with lazy refill
-- Parameters: bucket_size (burst), refill_rate (tokens/sec)
-- Returns: {allowed (0/1), remaining_tokens, retry_after_ms}
local key = KEYS[1]
local bucket_size = tonumber(ARGV[1]) -- Max burst
local refill_rate = tonumber(ARGV[2]) -- Tokens per second
local now = tonumber(ARGV[3]) -- Current time in milliseconds
-- Get current state
local state = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(state[1])
local last_refill = tonumber(state[2])
if tokens == nil then
-- First request: initialize full bucket
tokens = bucket_size
last_refill = now
end
-- Lazy refill: compute tokens earned since last check
local elapsed_ms = now - last_refill
local new_tokens = elapsed_ms * refill_rate / 1000
tokens = math.min(bucket_size, tokens + new_tokens)
if tokens < 1 then
-- Empty bucket: compute retry delay
local deficit = 1 - tokens
local retry_after_ms = math.ceil(deficit / refill_rate * 1000)
return {0, 0, retry_after_ms}
end
-- Consume one token
tokens = tokens - 1
redis.call("HMSET", key, "tokens", tokens, "last_refill", now)
redis.call("PEXPIRE", key, math.ceil(bucket_size / refill_rate * 1000) + 1000)
return {1, math.floor(tokens), 0}
```
### Multi-Tier Quota Resolution
When a request arrives, multiple rules may match. Resolution follows a hierarchical evaluation:

**Rule precedence:** All matching rules are evaluated. The most restrictive (lowest remaining quota) determines the response. This prevents a user from bypassing a 20 req/min write limit by pointing to their 5000 req/min general limit.
**Tier resolution order:**
1. Check for per-key quota override in cache (PostgreSQL-backed)
2. Fall back to tier-level defaults from YAML config
3. Apply per-endpoint overrides if they exist
### Concurrent Request Limiting
Stripe's approach: in addition to rate-based limits, cap the number of simultaneously in-flight requests per API key. This catches pathological patterns (slow endpoints consuming all connection pool slots) that rate limiting alone misses.
**Implementation:** Use a Redis sorted set where the score is the request start timestamp and the member is a unique request ID. On request start, `ZADD`. On request completion (or timeout), `ZREM`. Check the set cardinality against the concurrent limit.
```lua title="concurrent_limiter.lua"
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local request_id = ARGV[2]
local now = tonumber(ARGV[3])
local timeout = tonumber(ARGV[4]) -- e.g., 60 seconds
-- Remove stale entries (requests that didn't clean up)
redis.call("ZREMRANGEBYSCORE", key, 0, now - timeout)
-- Check current in-flight count
local current = redis.call("ZCARD", key)
if current >= limit then
return {0, 0}
end
-- Register this request
redis.call("ZADD", key, now, request_id)
redis.call("EXPIRE", key, timeout)
return {1, limit - current - 1}
```
### Failure Handling and Graceful Degradation
**What happens when Redis is down?**
The rate limiter is on the critical path of every API request. A Redis failure must not become an API outage.
**Strategy: Fail-open with circuit breaker**

**Stripe's fail-open principle:** Catch exceptions at all levels so that any coding or operational errors fail open. Feature flags enable rapid disabling of individual limiters. Clear HTTP status codes distinguish rate limiting (429) from load shedding (503).
**Degradation hierarchy:**
1. **Redis cluster partial failure** (1 shard down): Rate limiting continues on other shards. Keys hashed to the down shard fail open.
2. **Redis cluster full failure:** Circuit breaker opens. All requests are allowed. Alert fires.
3. **Rate limit service crash:** API gateway skips rate check (configurable behavior). Alert fires.
4. **Configuration load failure:** Service continues with last-known-good config. Alert fires.
**Monitoring during degradation:**
- `ratelimit_redis_errors_total` — triggers circuit breaker
- `ratelimit_failopen_total` — tracks how many requests bypassed rate limiting
- `ratelimit_circuit_state` — current circuit breaker state per shard
## Frontend Considerations
### Rate Limit Dashboard
**Problem:** Operations teams need real-time visibility into rate limit decisions, top offenders, and quota utilization across tenants.
**Architecture:**
- Prometheus scrapes metrics from all rate limit service instances
- Grafana dashboards show aggregate decisions/sec, top-10 rate-limited keys, quota utilization by tier
- AlertManager fires on anomalies (sudden spike in 429s, circuit breaker open)
**Key dashboard panels:**
| Panel | Data Source | Purpose |
| ----------------------------- | ---------------------------------- | ----------------- |
| Decisions/sec (allow vs deny) | `ratelimit_decisions_total` | Traffic health |
| Top 10 rate-limited keys | `ratelimit_denied_total` by key | Identify abusers |
| Quota utilization by tier | `ratelimit_remaining` | Capacity planning |
| p99 decision latency | `ratelimit_check_duration_seconds` | SLA monitoring |
| Circuit breaker state | `ratelimit_circuit_state` | Failure detection |
### Client-Side Rate Limit Handling
For API consumers building integrations, the rate limit headers enable client-side backoff:
**Recommended client pattern:**
```typescript title="rate-limit-client.ts" collapse={1-3, 25-35}
// Client-side rate limit handler
// Reads standard headers and implements exponential backoff
interface RateLimitInfo {
limit: number
remaining: number
resetAt: Date
retryAfter: number | null
}
function parseRateLimitHeaders(headers: Headers): RateLimitInfo {
return {
limit: parseInt(headers.get("X-RateLimit-Limit") ?? "0"),
remaining: parseInt(headers.get("X-RateLimit-Remaining") ?? "0"),
resetAt: new Date(parseInt(headers.get("X-RateLimit-Reset") ?? "0") * 1000),
retryAfter: headers.has("Retry-After") ? parseInt(headers.get("Retry-After")!) : null,
}
}
async function fetchWithRateLimit(url: string, options?: RequestInit): Promise {
const response = await fetch(url, options)
if (response.status === 429) {
const info = parseRateLimitHeaders(response.headers)
const delay = info.retryAfter ? info.retryAfter * 1000 : Math.min(60000, 1000 * Math.pow(2, retryCount))
await sleep(delay)
return fetchWithRateLimit(url, options) // Retry
}
return response
}
```
**Pre-emptive throttling:** Clients can read `X-RateLimit-Remaining` and proactively slow down before hitting the limit, avoiding 429s entirely. This is particularly valuable for batch operations.
## Infrastructure Design
### Cloud-Agnostic Architecture
#### Counting Store
**Concept:** Low-latency key-value store for atomic counter operations
**Requirements:**
- Sub-millisecond reads and writes
- Atomic increment operations
- TTL-based key expiration
- Lua/scripting support for complex atomic operations
- Cluster mode for horizontal scaling
**Options:**
- Redis Cluster — Rich scripting, sorted sets, hash types. Industry standard for rate limiting.
- KeyDB — Redis-compatible, multi-threaded. Drop-in replacement with better multi-core utilization.
- DragonflyDB — Redis-compatible, vertically scaled. Higher single-node throughput than Redis.
- Memcached — Simpler, atomic increment, but no scripting or sorted sets.
#### Service Framework
**Concept:** Stateless gRPC service for rate limit decisions
**Options:**
- Go (Envoy's reference implementation) — Low latency, efficient concurrency, gRPC-native
- Rust — Lowest latency, highest throughput, steeper learning curve
- Java/Kotlin — Familiar ecosystem, slightly higher latency from GC
#### Configuration Management
**Concept:** Declarative rule storage with hot reload
**Options:**
- Filesystem (YAML/JSON) + file watcher — Simplest, version-controlled via Git
- etcd / Consul — Distributed consensus, immediate propagation
- PostgreSQL — Relational queries for complex rule management
### AWS Reference Architecture
#### Compute
| Component | Service | Configuration |
| -------------------- | ------------------ | ------------------------------------------- |
| Rate limit service | ECS Fargate | Auto-scaling 3-20 tasks, 1 vCPU / 2 GB each |
| API Gateway nodes | ECS Fargate / EKS | Co-located with application services |
| Quota management API | ECS Fargate | 2-4 tasks (low traffic admin API) |
| Config sync | Lambda (scheduled) | Pull config from S3, validate, deploy |
#### Data Stores
| Data | Service | Rationale |
| ------------------- | ------------------------- | --------------------------------------------------- |
| Rate limit counters | ElastiCache Redis Cluster | Sub-ms latency, cluster mode, Multi-AZ |
| Quota overrides | RDS PostgreSQL (Multi-AZ) | ACID, managed backups, admin queries |
| Rule configuration | S3 + EFS mount | Version-controlled, mounted into service containers |
| Metrics | Amazon Managed Prometheus | Managed TSDB, Grafana integration |
| Audit logs | Kinesis → S3 | Streaming ingest, long-term archival |
#### Self-Hosted Alternatives
| Managed Service | Self-Hosted Option | When to self-host |
| ------------------ | -------------------- | -------------------------------------------- |
| ElastiCache Redis | Redis Cluster on EC2 | Cost at scale, specific modules (redis-cell) |
| RDS PostgreSQL | PostgreSQL on EC2 | Specific extensions, cost optimization |
| ECS Fargate | Kubernetes (EKS/k3s) | Existing K8s infrastructure, multi-cloud |
| Managed Prometheus | Victoria Metrics | Cost at high cardinality, long retention |
### Production Deployment

## Variations
### Load Shedding (Beyond Rate Limiting)
Stripe operates four layers of traffic management, each progressively more aggressive:
1. **Request rate limiter** — Standard per-user rate limiting (token bucket). Handles millions of rejections monthly.
2. **Concurrent request limiter** — Caps in-flight requests to 20 per user. Catches retry storms from slow endpoints.
3. **Fleet usage load shedder** — Reserves capacity for critical operations (e.g., charge creation) by shedding non-critical traffic (e.g., listing charges) during system strain. Returns 503, not 429.
4. **Worker utilization load shedder** — Last resort. When individual workers are overloaded, shed traffic by priority: test mode first, then GETs, then POSTs, then critical methods. Rarely triggered (~100 rejections/month).
The distinction matters: rate limiting (429) protects per-user fairness. Load shedding (503) protects system survival.
### Edge Rate Limiting (Cloudflare Model)
Cloudflare's architecture avoids centralized coordination entirely:
- **Anycast routing** ensures a single IP's traffic reaches the same PoP consistently
- **Per-PoP counters** using Twemproxy + memcache clusters within each PoP
- **Sliding window counter** with asynchronous increments (counting doesn't block the request)
- **Mitigation bit propagation:** Once a source exceeds the threshold, a flag is set. All servers in the PoP check this flag—no further counter queries needed during active mitigation.
**Result:** 0.003% error rate across 400M requests, handling attacks up to 400K RPS per domain.
### GCRA (Generic Cell Rate Algorithm)
A mathematically elegant alternative used by the `redis-cell` Redis module. Instead of tracking counters, GCRA tracks a single value: the TAT (Theoretical Arrival Time)—the earliest time the next request should arrive.
**How it differs from token bucket:** Functionally equivalent, but stores only one value per key (TAT) instead of two (tokens + last_refill). The `redis-cell` module implements GCRA as a native Redis command (`CL.THROTTLE`) with atomic semantics, eliminating the need for Lua scripts.
**Trade-off:** Requires installing a Redis module, which may not be available in managed Redis services (ElastiCache supports limited module sets).
## Conclusion
The rate limiter design centers on three decisions that cascade through the architecture:
1. **Sliding window counter algorithm** provides the best accuracy-to-memory trade-off for distributed rate limiting. Cloudflare's production measurement (0.003% error across 400M requests) validates that the approximation is accurate enough for real-world enforcement, while using O(1) memory per key eliminates the scaling problem of exact approaches like sliding window log.
2. **Centralized Redis with Lua scripts** ensures globally consistent counts across all API gateway nodes without per-node coordination. The Lua script atomically reads the previous window's count, computes the weighted estimate, and conditionally increments—a sequence that cannot be expressed with Redis transactions alone. Hash tags ensure multi-key operations land on the same shard.
3. **Fail-open with circuit breaker** makes the rate limiter self-healing. A Redis outage opens the circuit (allowing all traffic), and probe requests automatically close it when Redis recovers. This follows Stripe's principle: a rate limiter failure must never become an API outage.
**What this design sacrifices:**
- Sliding window counter is an approximation—for exact billing-grade counting, sliding window log is more precise (at O(n) memory cost)
- Centralized Redis adds 1-5ms per request vs. in-process local rate limiting
- Fail-open means short periods of unenforced rate limits during Redis failures
**Future improvements worth considering:**
- Hybrid local + global counting for latency-sensitive paths (Path C)
- Machine learning-based anomaly detection for adaptive rate limits
- Per-tenant Redis clusters for isolation at extreme scale (noisy neighbor elimination)
- Integration with billing systems for real-time quota adjustment based on usage patterns
## Appendix
### Prerequisites
- Distributed systems concepts (consistency models, CAP trade-offs)
- Redis data structures (strings, sorted sets, hashes) and Lua scripting
- HTTP semantics (status codes, headers, caching)
- API gateway patterns (reverse proxy, middleware chains)
### Terminology
| Term | Definition |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| Token bucket | Rate limiting algorithm where tokens are added at a fixed rate and consumed per request; allows bursts up to bucket capacity |
| Leaky bucket | Traffic shaping algorithm that processes requests at a constant output rate, smoothing bursts |
| GCRA | Generic Cell Rate Algorithm—mathematically equivalent to token bucket, tracks a single Theoretical Arrival Time (TAT) value |
| Sliding window counter | Approximation algorithm using weighted interpolation between two fixed window counters to eliminate boundary burst artifacts |
| Fail-open | Failure mode where the rate limiter allows all traffic when its backing store is unavailable |
| Circuit breaker | Pattern that detects failures and temporarily bypasses a failing dependency, with automatic recovery probes |
| Descriptor | Key-value pair identifying a rate limit dimension (e.g., `api_key=abc123`, `endpoint=POST /orders`) |
| Shadow mode | Deployment mode where rate limit rules execute (update counters, emit metrics) but always return ALLOW |
| PoP | Point of Presence—a geographic location where edge servers process traffic |
| TAT | Theoretical Arrival Time—the core tracking value in GCRA; represents the earliest acceptable time for the next request |
### Summary
- **Sliding window counter** is the optimal algorithm for distributed rate limiting: O(1) memory, ~99.997% accuracy (Cloudflare-validated), no boundary burst artifacts.
- **Redis Cluster with Lua scripts** provides atomic check-and-increment without race conditions. Hash tags ensure multi-key operations are shard-local.
- **Fail-open with circuit breaker** prevents rate limiter failures from cascading into API outages. Shadow mode enables safe rollout of new rules.
- **Hierarchical rule evaluation** (global → tier → endpoint → concurrent) prevents circumventing narrow limits via broader ones.
- **IETF `RateLimit`/`RateLimit-Policy` headers** (structured fields) are replacing legacy `X-RateLimit-*` headers for standards-based client communication.
- **Stripe's four-layer model** (rate limit → concurrent limit → fleet load shed → worker load shed) distinguishes per-user fairness (429) from system survival (503).
### References
- [RFC 6585 — Additional HTTP Status Codes](https://datatracker.ietf.org/doc/html/rfc6585) — Defines 429 Too Many Requests; Section 4
- [IETF draft-ietf-httpapi-ratelimit-headers](https://ietf-wg-httpapi.github.io/ratelimit-headers/draft-ietf-httpapi-ratelimit-headers.html) — Standardized `RateLimit` and `RateLimit-Policy` header fields
- [ITU-T Recommendation I.371](https://www.itu.int/rec/T-REC-I.371) — GCRA specification for ATM traffic management (1993)
- [Stripe — Scaling your API with rate limiters](https://stripe.com/blog/rate-limiters) — Four-layer approach: request rate, concurrent, fleet load shed, worker load shed
- [Cloudflare — How we built rate limiting capable of scaling to millions of domains](https://blog.cloudflare.com/counting-things-a-lot-of-different-things/) — Sliding window counter at edge, 0.003% error rate
- [Figma — An alternative approach to rate limiting](https://www.figma.com/blog/an-alternative-approach-to-rate-limiting/) — Sliding window counter variant with sub-window buckets (2017)
- [Envoy Proxy — Global rate limiting](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_features/global_rate_limiting) — Architecture overview for centralized rate limiting
- [envoyproxy/ratelimit](https://github.com/envoyproxy/ratelimit) — Go/gRPC reference implementation with Redis backend
- [Brandur — Rate limiting](https://brandur.org/rate-limiting) — GCRA explanation and comparison with token bucket
- [AWS Builders' Library — Fairness in multi-tenant systems](https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems/) — Quota-based admission control and noisy neighbor prevention
- [GitHub REST API — Rate limits](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api) — Tiered rate limiting: 60/hr unauthenticated, 5000/hr authenticated, secondary limits
- [Token Bucket — Wikipedia](https://en.wikipedia.org/wiki/Token_bucket) — Algorithm overview and formal properties
---
## Design a Cookie Consent Service
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/system-design-problems/design-cookie-consent-service
**Category:** System Design / System Design Problems
**Description:** Building a multi-tenant consent management platform that handles regulatory compliance (GDPR, CCPA, LGPD) at scale. Cookie consent services face unique challenges: read-heavy traffic patterns (every page load queries consent status), strict latency requirements (consent checks block page rendering), regulatory complexity across jurisdictions, and the need to merge anonymous visitor consent with authenticated user profiles. This design covers edge-cached consent delivery, anonymous-to-authenticated identity migration, and a multi-tenant architecture serving thousands of websites.
# Design a Cookie Consent Service
Building a multi-tenant consent management platform that handles regulatory compliance (GDPR, CCPA, LGPD) at scale. Cookie consent services face unique challenges: read-heavy traffic patterns (every page load queries consent status), strict latency requirements (consent checks block page rendering), regulatory complexity across jurisdictions, and the need to merge anonymous visitor consent with authenticated user profiles. This design covers edge-cached consent delivery, anonymous-to-authenticated identity migration, and a multi-tenant architecture serving thousands of websites.

Cookie consent service architecture: Edge-cached SDK for sub-50ms consent checks, read replicas for global distribution, immutable audit log for regulatory compliance, multi-tenant configuration per website.
## Abstract
Cookie consent design balances three competing forces:
1. **Latency vs. compliance** — Consent checks happen on every page load and block tracking scripts. Sub-50ms response times require edge caching, but regulations demand audit trails and user-specific consent records.
2. **Multi-tenancy vs. isolation** — Thousands of websites share infrastructure, but each has unique privacy policies, cookie categories, and regulatory requirements. Tenant configuration must be cacheable yet instantly updatable.
3. **Anonymous vs. authenticated** — Users browse anonymously before logging in, accumulating consent choices tied to device fingerprints. On authentication, these choices must merge with any existing profile consent without data loss or conflicting states.
The mental model: **edge-cached SDK → regional read replica → primary write path → immutable audit log**. Consent reads never touch the primary database; writes go through a single path with guaranteed auditability.
| Design Decision | Tradeoff |
| --------------------------------- | ------------------------------------------------------------------- |
| Edge-cached consent SDK | Sub-50ms reads; stale consent possible for up to 60s |
| Device fingerprint for anonymous | Tracks consent pre-login; privacy concerns, regulatory gray area |
| Read replicas per region | Low latency globally; eventual consistency (acceptable for consent) |
| Immutable audit log | Regulatory proof; storage costs, query complexity |
| Tenant-specific cookie categories | Flexible compliance; configuration explosion |
## Requirements
### Functional Requirements
| Feature | Scope | Notes |
| ------------------------------------ | -------- | ------------------------------------------------------------------- |
| Consent banner rendering | Core | Customizable per tenant, geo-aware |
| Consent collection | Core | Granular per category (essential, functional, analytics, marketing) |
| Consent storage | Core | Persisted with audit trail |
| Consent check API | Core | Called on every page load, must be fast |
| Regulation detection | Core | Auto-detect GDPR, CCPA, LGPD based on user location |
| Multi-tenant configuration | Core | Each website has unique settings |
| Anonymous consent tracking | Core | Device-based consent before login |
| Anonymous-to-authenticated migration | Core | Merge consent on user login |
| Consent withdrawal | Core | As easy as giving consent (GDPR requirement) |
| Consent proof/audit | Core | Immutable record for regulatory audits |
| A/B testing for banners | Extended | Test banner designs, measure consent rates |
| TCF 2.2 support | Extended | IAB Transparency & Consent Framework |
| Google Consent Mode | Extended | Integration with Google services |
### Non-Functional Requirements
| Requirement | Target | Rationale |
| ---------------------- | -------------- | ---------------------------------------------------------- |
| Availability | 99.99% | Consent blocks page functionality; downtime = broken sites |
| Consent check latency | p99 < 50ms | Consent check is render-blocking |
| Consent update latency | p99 < 200ms | User-triggered, less time-sensitive |
| Read/write ratio | 100:1 | Every page load reads; only banner interactions write |
| Tenant count | 100K+ websites | Multi-tenant SaaS model |
| Daily transactions | 500M+ | Based on OneTrust scale (450M+/day) |
| Data retention | 7 years | GDPR audit requirements |
| Consent accuracy | 100% | Must never serve wrong consent status |
### Scale Estimation
**Traffic Profile:**
| Metric | Value | Calculation |
| --------------------------------- | ------- | ----------------------------------- |
| Websites served | 100,000 | Multi-tenant SaaS |
| Average daily page views per site | 10,000 | Mix of small and large sites |
| Total daily page views | 1B | 100K × 10K |
| Consent checks/day | 1B | 1:1 with page views |
| Consent updates/day | 10M | 1% of visitors interact with banner |
| Peak RPS (reads) | 50K | 1B / 86,400 × 4 (peak multiplier) |
| Peak RPS (writes) | 500 | 10M / 86,400 × 4 |
**Storage:**
```
Consent records: 1B unique visitors × 500 bytes = 500GB
Audit logs: 10M updates/day × 1KB × 365 days × 7 years = 25TB
Tenant configurations: 100K × 50KB = 5GB
SDK assets: 100K variants × 100KB = 10GB CDN
```
**Bandwidth:**
```
Consent checks: 50K RPS × 200 bytes = 10MB/s
SDK delivery: 10K RPS × 50KB = 500MB/s (CDN handles most)
```
## Design Paths
### Path A: Edge-First Architecture (Latency-Optimized)
**Best when:**
- Consent check latency is critical (advertising, analytics-heavy sites)
- Global audience with low tolerance for slow consent
- High page view volume per user session
**Architecture:**

**Key characteristics:**
- Consent SDK served from CDN edge
- Consent status cached at edge with short TTL (60s)
- Cache key: `{tenant_id}:{device_fingerprint}:{regulation}`
- Edge computes regulation based on request geo
**Trade-offs:**
- :white_check_mark: Sub-20ms consent checks from edge cache
- :white_check_mark: Scales infinitely at edge
- :white_check_mark: Origin protected from read traffic
- :x: Stale consent for up to 60 seconds after update
- :x: Cache invalidation complexity on consent change
- :x: Edge compute costs for SDK execution
**Real-world example:** OneTrust serves 450M+ consent transactions daily across 250+ CDN locations. Edge caching enables sub-50ms response times globally.
### Path B: Server-Side Rendering (Compliance-First)
**Best when:**
- Regulatory compliance is paramount (financial services, healthcare)
- Real-time consent accuracy required
- Lower traffic volume, higher value per interaction
**Architecture:**

**Key characteristics:**
- Consent fetched server-side before page render
- Consent status embedded in initial HTML
- No client-side consent check needed
- Server has full control over what scripts load
**Trade-offs:**
- :white_check_mark: Always accurate consent (no stale cache)
- :white_check_mark: Full control over script loading
- :white_check_mark: Simpler client-side implementation
- :x: Higher latency (adds to server response time)
- :x: Server must handle all consent checks
- :x: Caching more complex (varies by user)
**Real-world example:** Banking applications requiring strict PCI-DSS compliance use server-side consent to ensure no unauthorized tracking scripts ever load.
### Path Comparison
| Factor | Path A (Edge-First) | Path B (Server-Side) |
| --------------------- | ----------------------------- | -------------------- |
| Consent check latency | 10-50ms | 50-200ms |
| Consent accuracy | Eventual (60s stale) | Real-time |
| Infrastructure cost | Higher (edge compute) | Lower (centralized) |
| Client complexity | Higher (SDK logic) | Lower |
| Server load | Lower | Higher |
| Best for | High-traffic media/e-commerce | Regulated industries |
### This Article's Focus
This article implements **Path A (Edge-First)** because:
1. Most websites prioritize user experience (fast consent checks)
2. 60-second staleness is acceptable for most consent use cases
3. Read-heavy traffic pattern (100:1) demands edge caching
4. Multi-tenant SaaS requires infrastructure efficiency
Path B details are covered in the [Variations](#variations) section.
## High-Level Design
### Component Overview
| Component | Responsibility | Technology |
| ------------------ | ------------------------------------ | ---------------------------- |
| Consent SDK | Client-side consent management | JavaScript, edge-cached |
| Consent Service | Read/write consent operations | Node.js/Go + Redis |
| Regulation Service | Geo-based regulation detection | MaxMind GeoIP + rules engine |
| Tenant Service | Multi-tenant configuration | PostgreSQL + Redis cache |
| Audit Service | Immutable consent logging | Append-only log + S3 |
| Identity Service | Anonymous-to-authenticated migration | Redis + PostgreSQL |
| Banner Service | A/B testing and rendering | Static CDN + configuration |
### Request Flow: Consent Check

### Request Flow: Consent Update

## API Design
### Consent Check API
```http
GET /api/v1/consent
X-Tenant-ID: tenant_abc123
X-Device-ID: fp_xyz789
X-Geo-Country: DE
```
**Response (200 OK):**
```json
{
"consent_id": "con_abc123xyz",
"device_id": "fp_xyz789",
"user_id": null,
"regulation": "gdpr",
"status": "partial",
"categories": {
"essential": { "consented": true, "required": true },
"functional": { "consented": true, "required": false },
"analytics": { "consented": false, "required": false },
"marketing": { "consented": false, "required": false }
},
"consent_timestamp": "2024-03-15T10:30:00Z",
"policy_version": "v2.3",
"expires_at": "2025-03-15T10:30:00Z",
"banner_config": {
"show_banner": false,
"banner_version": "v1.2"
}
}
```
**Cache headers:**
```http
Cache-Control: private, max-age=60
ETag: "abc123"
Vary: X-Device-ID, X-Geo-Country
```
**Error responses:**
- `400 Bad Request`: Missing tenant ID or device ID
- `404 Not Found`: Tenant not configured
- `429 Too Many Requests`: Rate limit exceeded
### Consent Update API
```http
POST /api/v1/consent
X-Tenant-ID: tenant_abc123
X-Device-ID: fp_xyz789
X-Idempotency-Key: idem_123456
{
"categories": {
"functional": true,
"analytics": true,
"marketing": false
},
"policy_version": "v2.3",
"user_agent": "Mozilla/5.0...",
"consent_method": "banner_button",
"banner_version": "v1.2"
}
```
**Response (201 Created):**
```json
{
"consent_id": "con_abc123xyz",
"status": "updated",
"categories": {
"essential": { "consented": true },
"functional": { "consented": true },
"analytics": { "consented": true },
"marketing": { "consented": false }
},
"audit_id": "aud_789xyz",
"next_renewal": "2025-03-15T10:30:00Z"
}
```
**Idempotency:** Duplicate requests with same idempotency key return cached response.
### Consent Withdrawal API
```http
DELETE /api/v1/consent/categories/marketing
X-Tenant-ID: tenant_abc123
X-Device-ID: fp_xyz789
```
**Response (200 OK):**
```json
{
"consent_id": "con_abc123xyz",
"withdrawn_category": "marketing",
"withdrawn_at": "2024-03-15T11:00:00Z",
"audit_id": "aud_790xyz"
}
```
**GDPR requirement:** Withdrawal must be as easy as giving consent. Single API call to withdraw specific category.
### Identity Migration API
```http
POST /api/v1/consent/migrate
X-Tenant-ID: tenant_abc123
{
"device_id": "fp_xyz789",
"user_id": "user_456",
"migration_strategy": "device_wins_recent"
}
```
**Response (200 OK):**
```json
{
"migration_id": "mig_123abc",
"source": {
"device_id": "fp_xyz789",
"consent_timestamp": "2024-03-15T10:30:00Z"
},
"target": {
"user_id": "user_456",
"consent_timestamp": "2024-03-10T08:00:00Z"
},
"result": {
"strategy_applied": "device_wins_recent",
"merged_categories": {
"functional": true,
"analytics": true,
"marketing": false
},
"conflicts_resolved": [
{
"category": "analytics",
"device_value": true,
"user_value": false,
"resolved_value": true,
"reason": "device consent more recent"
}
]
},
"audit_id": "aud_791xyz"
}
```
**Migration strategies:**
- `device_wins_recent`: More recent consent wins (default)
- `user_wins`: Authenticated user consent always wins
- `most_restrictive`: Use most privacy-preserving choice
- `prompt_user`: Flag conflicts for user resolution
### Tenant Configuration API
```http
GET /api/v1/tenants/{tenant_id}/config
```
**Response (200 OK):**
```json
{
"tenant_id": "tenant_abc123",
"domain": "example.com",
"subdomains": ["shop.example.com", "blog.example.com"],
"categories": [
{
"id": "essential",
"name": "Essential Cookies",
"description": "Required for basic website functionality",
"required": true,
"cookies": ["session_id", "csrf_token"]
},
{
"id": "analytics",
"name": "Analytics Cookies",
"description": "Help us understand how visitors use our site",
"required": false,
"cookies": ["_ga", "_gid", "_gat"],
"vendors": ["Google Analytics"]
}
],
"regulations": {
"default": "gdpr",
"overrides": {
"US-CA": "ccpa",
"BR": "lgpd"
}
},
"banner": {
"position": "bottom",
"theme": "light",
"show_reject_all": true,
"consent_renewal_days": 365
},
"tcf_enabled": true,
"google_consent_mode": true
}
```
## Data Modeling
### Consent Record (PostgreSQL)
```sql
CREATE TABLE consent_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id VARCHAR(50) NOT NULL,
device_id VARCHAR(100) NOT NULL,
user_id VARCHAR(100), -- NULL for anonymous
-- Consent state
regulation VARCHAR(20) NOT NULL, -- gdpr, ccpa, lgpd
policy_version VARCHAR(20) NOT NULL,
categories JSONB NOT NULL,
status VARCHAR(20) DEFAULT 'partial', -- none, partial, full
-- Metadata
ip_country VARCHAR(2),
user_agent TEXT,
consent_method VARCHAR(50), -- banner_accept, banner_reject, api
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ,
-- Constraints
UNIQUE (tenant_id, device_id),
UNIQUE (tenant_id, user_id) WHERE user_id IS NOT NULL
);
-- Indexes for common queries
CREATE INDEX idx_consent_tenant_device ON consent_records(tenant_id, device_id);
CREATE INDEX idx_consent_tenant_user ON consent_records(tenant_id, user_id)
WHERE user_id IS NOT NULL;
CREATE INDEX idx_consent_expires ON consent_records(expires_at)
WHERE expires_at IS NOT NULL;
```
**Sharding strategy:** Shard by `tenant_id` to co-locate all consent for a website. High-volume tenants may need dedicated shards.
### Audit Log (Append-Only)
```sql
CREATE TABLE consent_audit (
id BIGSERIAL PRIMARY KEY,
consent_id UUID NOT NULL REFERENCES consent_records(id),
tenant_id VARCHAR(50) NOT NULL,
-- What changed
action VARCHAR(20) NOT NULL, -- create, update, withdraw, migrate
old_categories JSONB,
new_categories JSONB,
-- Context
policy_version VARCHAR(20),
ip_address INET,
user_agent TEXT,
consent_method VARCHAR(50),
idempotency_key VARCHAR(100),
-- Immutable timestamp
created_at TIMESTAMPTZ DEFAULT NOW(),
-- No updates allowed
CONSTRAINT no_updates CHECK (true)
);
-- Partition by month for efficient archival
CREATE TABLE consent_audit_2024_03 PARTITION OF consent_audit
FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');
-- Index for regulatory queries
CREATE INDEX idx_audit_consent ON consent_audit(consent_id, created_at DESC);
CREATE INDEX idx_audit_tenant_time ON consent_audit(tenant_id, created_at DESC);
```
**Retention policy:** Archive to S3 Glacier after 1 year, retain for 7 years total per GDPR requirements.
### Tenant Configuration (PostgreSQL + Redis)
```sql
CREATE TABLE tenants (
id VARCHAR(50) PRIMARY KEY,
domain VARCHAR(255) NOT NULL UNIQUE,
subdomains TEXT[],
-- Configuration
config JSONB NOT NULL,
banner_config JSONB,
-- SDK versioning
sdk_version VARCHAR(20) DEFAULT 'latest',
custom_sdk_url TEXT,
-- Status
status VARCHAR(20) DEFAULT 'active',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_tenants_domain ON tenants(domain);
CREATE INDEX idx_tenants_subdomains ON tenants USING GIN(subdomains);
```
**Redis cache structure:**
```redis
# Tenant config (cached for 5 minutes)
SETEX tenant:config:tenant_abc123 300 "{...json config...}"
# Consent record (cached for 60 seconds)
SETEX consent:tenant_abc123:fp_xyz789 60 "{...consent...}"
# Invalidation on update
DEL consent:tenant_abc123:fp_xyz789
```
### Database Selection Matrix
| Data | Store | Rationale |
| ------------------- | -------------------------- | -------------------------------------------- |
| Consent records | PostgreSQL + Read replicas | ACID, complex queries, regional distribution |
| Consent cache | Redis Cluster | Sub-ms reads, TTL support |
| Audit log | PostgreSQL (partitioned) | Immutable, time-series queries |
| Audit archive | S3 Glacier | Cost-effective long-term storage |
| Tenant config | PostgreSQL + Redis | Infrequent updates, high read frequency |
| SDK assets | S3 + CloudFront | Global distribution, versioning |
| Device fingerprints | Redis | Ephemeral, fast lookup |
## Low-Level Design
### Consent SDK (Client-Side)
The SDK is the first point of contact for consent management. It must:
1. Generate and persist device fingerprints
2. Fetch and apply consent status
3. Block non-consented scripts
4. Handle consent updates
**SDK initialization:**
```typescript collapse={1-15, 55-70}
// consent-sdk.ts
interface ConsentConfig {
tenantId: string
apiEndpoint: string
categories: CategoryConfig[]
regulation?: "auto" | "gdpr" | "ccpa" | "lgpd"
onConsentChange?: (consent: ConsentStatus) => void
}
interface ConsentStatus {
categories: Record
regulation: string
timestamp: string
showBanner: boolean
}
class ConsentSDK {
private config: ConsentConfig
private deviceId: string
private consent: ConsentStatus | null = null
async init(config: ConsentConfig): Promise {
this.config = config
this.deviceId = await this.getOrCreateDeviceId()
// Fetch consent status
this.consent = await this.fetchConsent()
// Apply consent immediately
this.applyConsent(this.consent)
// Show banner if needed
if (this.consent.showBanner) {
this.renderBanner()
}
}
private async getOrCreateDeviceId(): Promise {
// Check localStorage first
let deviceId = localStorage.getItem("_consent_device_id")
if (deviceId) return deviceId
// Generate fingerprint
deviceId = await this.generateFingerprint()
localStorage.setItem("_consent_device_id", deviceId)
return deviceId
}
private async generateFingerprint(): Promise {
// Collect browser signals (privacy-preserving subset)
const signals = {
userAgent: navigator.userAgent,
language: navigator.language,
timezone: Intl.DateTimeFormat().resolvedOptions().timeZone,
screenResolution: `${screen.width}x${screen.height}`,
colorDepth: screen.colorDepth,
// Avoid canvas fingerprinting (privacy concern)
}
// Hash to create stable identifier
const data = JSON.stringify(signals)
const hashBuffer = await crypto.subtle.digest("SHA-256", new TextEncoder().encode(data))
const hashArray = Array.from(new Uint8Array(hashBuffer))
return (
"fp_" +
hashArray
.map((b) => b.toString(16).padStart(2, "0"))
.join("")
.slice(0, 32)
)
}
}
```
**Script blocking implementation:**
```typescript collapse={1-10, 45-60}
// script-blocker.ts
interface BlockedScript {
src: string
category: string
type: "script" | "iframe" | "img"
}
const blockedScripts: BlockedScript[] = []
function applyConsent(consent: ConsentStatus): void {
// Find all scripts with consent requirements
const scripts = document.querySelectorAll("script[data-consent-category]")
scripts.forEach((script) => {
const category = script.getAttribute("data-consent-category")
const consented = consent.categories[category]
if (consented) {
// Enable script
if (script.getAttribute("data-src")) {
script.setAttribute("src", script.getAttribute("data-src")!)
script.removeAttribute("data-src")
}
script.removeAttribute("type") // Remove text/plain blocker
} else {
// Keep blocked (or block if not already)
if (script.getAttribute("src")) {
script.setAttribute("data-src", script.getAttribute("src")!)
script.removeAttribute("src")
script.setAttribute("type", "text/plain")
}
}
})
// Handle dynamically added scripts
observeNewScripts(consent)
}
function observeNewScripts(consent: ConsentStatus): void {
const observer = new MutationObserver((mutations) => {
mutations.forEach((mutation) => {
mutation.addedNodes.forEach((node) => {
if (node.nodeName === "SCRIPT") {
const script = node as HTMLScriptElement
const category = script.getAttribute("data-consent-category")
if (category && !consent.categories[category]) {
// Block dynamically added non-consented script
script.setAttribute("data-src", script.src)
script.removeAttribute("src")
script.type = "text/plain"
}
}
})
})
})
observer.observe(document.documentElement, {
childList: true,
subtree: true,
})
}
```
**Design decisions:**
| Decision | Rationale |
| ---------------------------- | -------------------------------------------------- |
| LocalStorage for device ID | Persists across sessions, faster than cookie |
| SHA-256 fingerprint | Stable identifier without invasive fingerprinting |
| MutationObserver for scripts | Catches dynamically injected tracking scripts |
| `type="text/plain"` blocking | Browser ignores script content without removing it |
### Anonymous-to-Authenticated Migration
When a user logs in, their anonymous device consent must merge with any existing authenticated consent.
**Migration logic:**
```typescript collapse={1-15, 70-85}
// identity-migration.ts
interface MigrationRequest {
tenantId: string
deviceId: string
userId: string
strategy: "device_wins_recent" | "user_wins" | "most_restrictive" | "prompt_user"
}
interface ConsentRecord {
categories: Record
timestamp: Date
source: "device" | "user"
}
async function migrateConsent(request: MigrationRequest): Promise {
const { tenantId, deviceId, userId, strategy } = request
// Fetch both consent records
const [deviceConsent, userConsent] = await Promise.all([
getConsentByDevice(tenantId, deviceId),
getConsentByUser(tenantId, userId),
])
// If no existing user consent, simply link device consent to user
if (!userConsent) {
await linkDeviceToUser(tenantId, deviceId, userId)
return { migrated: true, conflicts: [] }
}
// If no device consent, keep user consent as-is
if (!deviceConsent) {
return { migrated: false, reason: "no_device_consent" }
}
// Merge based on strategy
const merged = mergeConsent(deviceConsent, userConsent, strategy)
// Write merged consent
await updateUserConsent(tenantId, userId, merged.categories)
// Audit the migration
await auditMigration(tenantId, deviceId, userId, deviceConsent, userConsent, merged)
// Invalidate caches
await invalidateConsentCache(tenantId, deviceId)
await invalidateConsentCache(tenantId, userId)
return merged
}
function mergeConsent(device: ConsentRecord, user: ConsentRecord, strategy: string): MergeResult {
const categories = new Set([...Object.keys(device.categories), ...Object.keys(user.categories)])
const merged: Record = {}
const conflicts: Conflict[] = []
for (const category of categories) {
const deviceValue = device.categories[category]
const userValue = user.categories[category]
if (deviceValue === userValue) {
merged[category] = deviceValue
continue
}
// Conflict resolution based on strategy
switch (strategy) {
case "device_wins_recent":
merged[category] = device.timestamp > user.timestamp ? deviceValue : userValue
break
case "user_wins":
merged[category] = userValue ?? deviceValue
break
case "most_restrictive":
merged[category] = deviceValue === false || userValue === false ? false : true
break
case "prompt_user":
conflicts.push({ category, deviceValue, userValue })
break
}
}
return { categories: merged, conflicts }
}
```
**Migration flow:**

**Edge cases:**
| Scenario | Handling |
| --------------------------- | ----------------------------------------------- |
| User has multiple devices | Each device's consent migrates independently |
| User logs out and back in | Device consent may have changed; re-migrate |
| User clears localStorage | New device ID generated; fresh consent flow |
| Conflict with `prompt_user` | Return conflicts to client; user resolves in UI |
### Regulation Detection Service
Automatically detect applicable privacy regulation based on user location.
**Detection logic:**
```typescript collapse={1-12, 50-65}
// regulation-service.ts
import maxmind from "maxmind"
interface RegulationResult {
regulation: "gdpr" | "ccpa" | "lgpd" | "none"
country: string
region?: string
confidence: "high" | "medium" | "low"
}
const geoDb = await maxmind.open("/data/GeoLite2-City.mmdb")
function detectRegulation(ipAddress: string): RegulationResult {
const geo = geoDb.get(ipAddress)
if (!geo || !geo.country) {
return { regulation: "gdpr", country: "unknown", confidence: "low" }
// Default to strictest regulation when unknown
}
const country = geo.country.iso_code
const region = geo.subdivisions?.[0]?.iso_code
// GDPR: EU/EEA countries
const gdprCountries = [
"AT",
"BE",
"BG",
"HR",
"CY",
"CZ",
"DK",
"EE",
"FI",
"FR",
"DE",
"GR",
"HU",
"IE",
"IT",
"LV",
"LT",
"LU",
"MT",
"NL",
"PL",
"PT",
"RO",
"SK",
"SI",
"ES",
"SE",
// EEA
"IS",
"LI",
"NO",
// UK (still applies GDPR-equivalent)
"GB",
]
if (gdprCountries.includes(country)) {
return { regulation: "gdpr", country, confidence: "high" }
}
// CCPA: California residents
if (country === "US" && region === "CA") {
return { regulation: "ccpa", country, region, confidence: "high" }
}
// LGPD: Brazil
if (country === "BR") {
return { regulation: "lgpd", country, confidence: "high" }
}
// Default: no specific regulation (still show banner for best practice)
return { regulation: "none", country, confidence: "high" }
}
// Tenant-specific overrides
function applyTenantOverrides(detected: RegulationResult, tenantConfig: TenantConfig): RegulationResult {
const override =
tenantConfig.regulations.overrides?.[`${detected.country}-${detected.region}`] ||
tenantConfig.regulations.overrides?.[detected.country]
if (override) {
return { ...detected, regulation: override }
}
return detected
}
```
**Regulation behavior matrix:**
| Regulation | Consent Model | Default Blocked | Withdrawal |
| ---------- | ------------- | ------------------- | ------------- |
| GDPR | Opt-in | All non-essential | Required |
| CCPA | Opt-out | Marketing (if sold) | Required |
| LGPD | Opt-in | All non-essential | Required |
| None | Opt-out | None | Best practice |
### Multi-Tenant Configuration Engine
**Configuration hierarchy:**

**Configuration resolution:**
```typescript collapse={1-10, 45-60}
// tenant-config.ts
interface ResolvedConfig {
tenantId: string
domain: string
categories: CategoryConfig[]
banner: BannerConfig
regulations: RegulationConfig
sdk: SDKConfig
}
async function resolveConfig(tenantId: string, domain: string): Promise {
// Check cache first
const cacheKey = `tenant:config:${tenantId}:${domain}`
const cached = await redis.get(cacheKey)
if (cached) return JSON.parse(cached)
// Load tenant base config
const tenant = await db.tenants.findById(tenantId)
if (!tenant) throw new Error("Tenant not found")
// Apply domain-specific overrides
let config = tenant.config
if (tenant.domainOverrides?.[domain]) {
config = deepMerge(config, tenant.domainOverrides[domain])
}
// Merge with global defaults
const resolved = deepMerge(GLOBAL_DEFAULTS, config)
// Cache for 5 minutes
await redis.setex(cacheKey, 300, JSON.stringify(resolved))
return resolved
}
// Configuration update propagation
async function updateTenantConfig(tenantId: string, updates: Partial): Promise {
// Update database
await db.tenants.update(tenantId, updates)
// Invalidate all cached configs for this tenant
const keys = await redis.keys(`tenant:config:${tenantId}:*`)
if (keys.length > 0) {
await redis.del(...keys)
}
// Trigger SDK regeneration if categories changed
if (updates.categories) {
await sdkBuildQueue.add({ tenantId, reason: "category_update" })
}
}
```
### Cache Invalidation Strategy
**Layered cache invalidation:**

**Implementation:**
```typescript collapse={1-8, 35-50}
// cache-invalidation.ts
interface InvalidationTarget {
tenantId: string
identifier: string // device_id or user_id
type: "device" | "user"
}
async function invalidateConsentCache(target: InvalidationTarget): Promise {
const { tenantId, identifier, type } = target
// 1. Invalidate Redis cache
const redisKey = `consent:${tenantId}:${identifier}`
await redis.del(redisKey)
// 2. Invalidate CDN edge cache
const edgeKey = `consent/${tenantId}/${identifier}`
await cdnPurge(edgeKey)
// 3. Publish invalidation event for any connected clients
await pubsub.publish(`consent:invalidate:${tenantId}`, {
identifier,
type,
timestamp: Date.now(),
})
}
// SDK listens for invalidation events
function setupInvalidationListener(tenantId: string): void {
const eventSource = new EventSource(`/api/v1/consent/events?tenant=${tenantId}`)
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data)
if (data.identifier === currentDeviceId) {
// Refetch consent immediately
refreshConsent()
}
}
}
```
**Cache TTLs:**
| Cache Layer | TTL | Rationale |
| ------------------- | ------- | --------------------------------- |
| Edge CDN | 60s | Balance freshness vs. origin load |
| Redis (consent) | 60s | Match edge TTL |
| Redis (config) | 300s | Config changes less frequent |
| Client localStorage | Session | Refresh on page load |
## Frontend Considerations
### Banner Performance
**Render-blocking mitigation:**
```html collapse={1-5, 20-30}
```
**Layout shift prevention:**
```css
/* Reserve space for banner to prevent CLS */
#consent-banner-placeholder {
position: fixed;
bottom: 0;
left: 0;
right: 0;
height: 0;
transition: height 0.3s ease-out;
}
#consent-banner-placeholder.visible {
height: 200px; /* Match actual banner height */
}
/* Alternative: overlay banner (no layout shift) */
.consent-banner-overlay {
position: fixed;
bottom: 0;
left: 0;
right: 0;
z-index: 9999;
/* Doesn't affect layout */
}
```
### State Management
```typescript collapse={1-10, 45-60}
// consent-state.ts
interface ConsentState {
// Consent data
status: ConsentStatus | null
loading: boolean
error: Error | null
// UI state
bannerVisible: boolean
preferencesOpen: boolean
// Pending changes
pendingCategories: Record
}
const consentStore = createStore({
status: null,
loading: true,
error: null,
bannerVisible: false,
preferencesOpen: false,
pendingCategories: {},
})
// Actions
function updateCategory(category: string, value: boolean): void {
consentStore.update((state) => ({
...state,
pendingCategories: {
...state.pendingCategories,
[category]: value,
},
}))
}
async function saveConsent(): Promise {
const { pendingCategories } = consentStore.get()
consentStore.update((state) => ({ ...state, loading: true }))
try {
const result = await api.updateConsent(pendingCategories)
consentStore.update((state) => ({
...state,
status: result,
pendingCategories: {},
bannerVisible: false,
loading: false,
}))
// Persist to localStorage
localStorage.setItem("_consent_status", JSON.stringify(result))
// Apply to scripts
applyConsent(result)
} catch (error) {
consentStore.update((state) => ({
...state,
error,
loading: false,
}))
}
}
```
### Google Consent Mode Integration
```typescript collapse={1-15, 50-65}
// google-consent-mode.ts
interface GoogleConsentState {
ad_storage: "granted" | "denied"
ad_user_data: "granted" | "denied"
ad_personalization: "granted" | "denied"
analytics_storage: "granted" | "denied"
}
function mapConsentToGoogle(consent: ConsentStatus): GoogleConsentState {
return {
ad_storage: consent.categories.marketing ? "granted" : "denied",
ad_user_data: consent.categories.marketing ? "granted" : "denied",
ad_personalization: consent.categories.marketing ? "granted" : "denied",
analytics_storage: consent.categories.analytics ? "granted" : "denied",
}
}
function initGoogleConsentMode(consent: ConsentStatus): void {
// Initialize with default (denied) before consent known
window.dataLayer = window.dataLayer || []
function gtag(...args: any[]) {
dataLayer.push(args)
}
gtag("consent", "default", {
ad_storage: "denied",
ad_user_data: "denied",
ad_personalization: "denied",
analytics_storage: "denied",
wait_for_update: 500, // Wait for consent check
})
// Update once consent is known
const googleConsent = mapConsentToGoogle(consent)
gtag("consent", "update", googleConsent)
}
// On consent change
function updateGoogleConsent(consent: ConsentStatus): void {
const googleConsent = mapConsentToGoogle(consent)
gtag("consent", "update", googleConsent)
}
```
**Google Consent Mode parameters:**
| Parameter | Maps to | Description |
| ------------------ | ----------------- | ------------------------- |
| ad_storage | Marketing cookies | Storage for advertising |
| ad_user_data | Marketing cookies | Sending user data for ads |
| ad_personalization | Marketing cookies | Personalized advertising |
| analytics_storage | Analytics cookies | Storage for analytics |
## Infrastructure Design
### Cloud-Agnostic Components
| Component | Purpose | Requirements |
| --------------- | ------------------------------ | ------------------------------ |
| CDN | SDK delivery, edge caching | Global PoPs, cache purge API |
| Key-value store | Consent cache | Sub-ms reads, TTL support |
| Relational DB | Consent records, tenant config | ACID, read replicas |
| Object storage | Audit archives, SDK assets | Versioning, lifecycle policies |
| Message queue | Async processing | Durability, dead letter queue |
| Geo database | IP to location | Low latency, regular updates |
### AWS Reference Architecture

**Service configuration:**
| Service | Configuration | Rationale |
| -------------- | --------------------------------- | ------------------------------- |
| CloudFront | 250+ PoPs, Lambda@Edge | Global low-latency SDK delivery |
| Lambda@Edge | 128MB, 5s timeout | Regulation detection at edge |
| API Gateway | 10K RPS, WAF | Rate limiting, DDoS protection |
| ECS Fargate | 2 vCPU, 4GB, Auto-scale 2-50 | Consent API servers |
| ElastiCache | Redis Cluster, 3 nodes, r6g.large | Sub-ms consent cache |
| RDS PostgreSQL | Multi-AZ, db.r6g.xlarge | Primary consent store |
| Read Replicas | eu-west-1, ap-south-1 | Regional read latency |
| S3 | Intelligent Tiering | Audit logs with lifecycle |
### Multi-Region Deployment

**Design decisions:**
| Decision | Rationale |
| ---------------------- | ------------------------------------------------- |
| Single write region | Simplifies consistency, GDPR compliance (EU data) |
| Regional read replicas | <50ms read latency globally |
| Local Redis per region | Sub-ms cache hits, no cross-region calls |
| Async replication | Acceptable for consent (eventual consistency) |
### Self-Hosted Alternatives
| Managed Service | Self-Hosted Option | Trade-off |
| --------------- | -------------------- | ------------------------------------ |
| CloudFront | Fastly/Cloudflare | More edge compute options |
| ElastiCache | Redis Cluster on EC2 | More control, operational burden |
| RDS PostgreSQL | PostgreSQL on EC2 | Custom extensions, cost at scale |
| Lambda@Edge | Cloudflare Workers | Better cold start, different pricing |
## Variations
### Server-Side Consent (Path B)
For regulated industries requiring real-time consent accuracy:
```typescript collapse={1-12, 45-60}
// server-side-consent.ts
import { ConsentService } from './consent-service';
interface ServerRenderContext {
request: Request;
consent: ConsentStatus;
allowedScripts: string[];
}
async function renderPageWithConsent(
request: Request,
pageComponent: Component
): Promise {
// Extract device ID from cookie
const deviceId = request.cookies.get('_consent_device_id');
// Fetch consent server-side
const consent = await consentService.getConsent(
TENANT_ID,
deviceId,
request.headers.get('cf-ipcountry')
);
// Determine which scripts to include
const allowedScripts = getScriptsForConsent(consent);
// Render page with consent context
const html = renderToString(
);
// Set consent cookie for client-side access
const response = new Response(html);
response.headers.set(
'Set-Cookie',
`_consent_status=${JSON.stringify(consent)}; Path=/; SameSite=Lax`
);
return response;
}
function getScriptsForConsent(consent: ConsentStatus): string[] {
const scripts: string[] = ['essential.js'];
if (consent.categories.analytics) {
scripts.push('analytics.js');
}
if (consent.categories.marketing) {
scripts.push('marketing.js');
}
return scripts;
}
```
### IAB TCF 2.2 Support
For publishers requiring TCF compliance:
```typescript collapse={1-15, 55-70}
// tcf-support.ts
interface TCFConsent {
tcString: string // Base64-encoded TC string
gdprApplies: boolean
purposeConsents: Record
vendorConsents: Record
specialFeatureOptins: Record
}
function generateTCString(consent: ConsentStatus): string {
// TCF 2.2 requires specific format
const tcData = {
version: 2,
created: consent.timestamp,
lastUpdated: consent.timestamp,
cmpId: 123, // Registered CMP ID
cmpVersion: 1,
consentScreen: 1,
consentLanguage: "EN",
vendorListVersion: 123,
tcfPolicyVersion: 4,
isServiceSpecific: false,
useNonStandardStacks: false,
purposeConsents: mapCategoryToTCFPurpose(consent.categories),
vendorConsents: {}, // Populated from GVL
}
return encodeTCString(tcData)
}
function mapCategoryToTCFPurpose(categories: Record): Record {
// TCF Purpose mapping
return {
1: categories.essential, // Store/access information
2: categories.functional, // Select basic ads
3: categories.marketing, // Create personalized ads profile
4: categories.marketing, // Select personalized ads
5: categories.marketing, // Create personalized content profile
6: categories.functional, // Select personalized content
7: categories.analytics, // Measure ad performance
8: categories.analytics, // Measure content performance
9: categories.analytics, // Apply market research
10: categories.functional, // Develop and improve products
11: categories.essential, // Use limited data to select content
}
}
// Expose TC string via __tcfapi
function setupTCFAPI(tcfConsent: TCFConsent): void {
window.__tcfapi = (command, version, callback, parameter) => {
switch (command) {
case "getTCData":
callback({ tcData: tcfConsent, success: true })
break
case "addEventListener":
// Register listener for consent changes
consentListeners.push(callback)
break
case "removeEventListener":
// Remove listener
break
}
}
}
```
### Consent Rate A/B Testing
```typescript collapse={1-10, 40-55}
// ab-testing.ts
interface BannerVariant {
id: string
position: "top" | "bottom" | "center"
layout: "minimal" | "detailed"
showRejectAll: boolean
primaryColor: string
}
async function getBannerVariant(tenantId: string, deviceId: string): Promise {
// Check if device already assigned to variant
const existingVariant = await redis.get(`ab:${tenantId}:${deviceId}`)
if (existingVariant) return JSON.parse(existingVariant)
// Get active experiment
const experiment = await db.experiments.findActive(tenantId)
if (!experiment) return DEFAULT_BANNER
// Deterministic assignment based on device ID
const hash = hashDeviceId(deviceId)
const bucket = hash % 100
const variant = experiment.variants.find((v) => bucket >= v.startBucket && bucket < v.endBucket)
// Store assignment
await redis.setex(`ab:${tenantId}:${deviceId}`, 86400 * 30, JSON.stringify(variant))
return variant
}
// Track consent events for A/B analysis
async function trackConsentEvent(
tenantId: string,
deviceId: string,
variantId: string,
action: "accept_all" | "reject_all" | "customize" | "close",
): Promise {
await analytics.track({
event: "consent_action",
properties: {
tenantId,
variantId,
action,
timestamp: Date.now(),
},
})
}
```
**A/B test metrics:**
| Metric | Description | Target |
| ----------------- | ---------------------------------- | ------ |
| Consent rate | Users accepting any non-essential | 60-80% |
| Full consent rate | Users accepting all categories | 30-50% |
| Interaction rate | Users engaging with banner | 70-90% |
| Time to decision | Seconds from banner show to action | <10s |
## Conclusion
Cookie consent services must balance regulatory compliance, user experience, and technical performance:
1. **Edge-first architecture**: CDN-cached SDK and consent status enable sub-50ms consent checks. Origin servers handle only writes and cache misses.
2. **Multi-tenant isolation**: Tenant configuration cached at multiple layers. Each website has independent categories, regulations, and banner designs without infrastructure overhead.
3. **Anonymous-to-authenticated migration**: Device fingerprints track consent before login. On authentication, consent merges based on configurable strategies (most recent, most restrictive, or user choice).
4. **Regulatory flexibility**: Geo-detection applies GDPR, CCPA, or LGPD rules automatically. Tenant overrides handle edge cases and business requirements.
5. **Immutable audit trail**: Every consent change logged with full context. Partitioned tables and S3 archival support 7-year retention for regulatory audits.
**What this design optimizes for:**
- Read latency (sub-50ms consent checks)
- Regulatory compliance (audit trail, withdrawal support)
- Multi-tenant efficiency (shared infrastructure, isolated configuration)
- Anonymous user tracking (device fingerprint based)
**What it sacrifices:**
- Real-time consent accuracy (60s stale window acceptable)
- Write complexity (single primary region)
- Device fingerprint privacy (regulatory gray area)
**Known limitations:**
- Device fingerprint changes on browser update/reinstall
- Cross-device consent requires user authentication
- TCF 2.2 vendor list management adds operational complexity
- A/B testing may conflict with regulatory "don't manipulate consent" guidance
## Appendix
### Prerequisites
- Distributed systems fundamentals (caching, replication)
- Privacy regulations basics (GDPR, CCPA concepts)
- CDN and edge computing patterns
- Database sharding and read replicas
### Terminology
| Term | Definition |
| ------------------ | ----------------------------------------------------------------------------- |
| CMP | Consent Management Platform - system for collecting and managing user consent |
| TCF | Transparency & Consent Framework - IAB standard for consent communication |
| TC String | Transparency and Consent String - encoded consent record per TCF spec |
| Device Fingerprint | Browser/device characteristics hashed to create persistent identifier |
| GVL | Global Vendor List - IAB-maintained list of advertising vendors |
| DPO | Data Protection Officer - organization's privacy compliance lead |
### Summary
- **Edge-cached SDK** delivers consent checks in sub-50ms globally; writes route to single primary region
- **Multi-tenant architecture** isolates configuration per website while sharing infrastructure; tenant config cached at edge and Redis layers
- **Anonymous consent** tracked via device fingerprint before login; migrates to authenticated profile with configurable merge strategies
- **Regulation detection** uses MaxMind GeoIP to auto-apply GDPR, CCPA, or LGPD rules; tenant overrides support edge cases
- **Immutable audit log** records every consent change with full context; partitioned by month, archived to S3 after 1 year, retained 7 years
- **Read replicas per region** reduce global latency; eventual consistency (sub-60s) acceptable for consent use cases
### References
- [GDPR Article 6 - Lawfulness of Processing](https://gdpr.eu/article-6-how-to-process-personal-data-legally/) - Legal basis for consent
- [GDPR Article 7 - Conditions for Consent](https://gdpr.eu/article-7-how-to-get-consent-to-collect-personal-data/) - Consent requirements
- [IAB TCF 2.2 Specification](https://github.com/InteractiveAdvertisingBureau/GDPR-Transparency-and-Consent-Framework) - Technical specification
- [Google Consent Mode Documentation](https://developers.google.com/tag-platform/security/guides/consent) - Integration guide
- [CNIL Cookie Guidelines](https://www.cnil.fr/en/cookies-and-other-tracking-devices-cnil-publishes-new-guidelines) - French DPA recommendations
- [OneTrust Performance Architecture](https://www.onetrust.com/blog/global-cdn-asynchronous-loading/) - Industry reference
- [Cloudflare OneTrust Case Study](https://www.cloudflare.com/case-studies/onetrust/) - Scale benchmarks
- [ePrivacy Directive](https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A32002L0058) - EU cookie law
- [CCPA Text](https://oag.ca.gov/privacy/ccpa) - California privacy law
---
## CRDTs for Collaborative Systems
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/crdt-for-collaborative-systems
**Category:** System Design / Core Distributed Patterns
**Description:** Conflict-free Replicated Data Types (CRDTs) are data structures mathematically guaranteed to converge to the same state across distributed replicas without coordination. They solve the fundamental challenge of distributed collaboration: allowing concurrent updates while ensuring eventual consistency without locking or consensus protocols.This article covers CRDT fundamentals, implementation variants, production deployments, and when to choose CRDTs over Operational Transformation (OT).
# CRDTs for Collaborative Systems
Conflict-free Replicated Data Types (CRDTs) are data structures mathematically guaranteed to converge to the same state across distributed replicas without coordination. They solve the fundamental challenge of distributed collaboration: allowing concurrent updates while ensuring eventual consistency without locking or consensus protocols.
This article covers CRDT fundamentals, implementation variants, production deployments, and when to choose CRDTs over Operational Transformation (OT).

CRDTs guarantee convergence through mathematical properties of the merge function—order of operations, network duplicates, and timing are irrelevant.
## Abstract
CRDTs guarantee convergence through **join-semilattice** properties: merge operations must be commutative, associative, and idempotent. This makes them fundamentally different from consensus-based systems—no coordination overhead, no leader election, no blocking on network partitions.
**Core mental model:**
- **State-based (CvRDT)**: Send full state, merge via lattice join. Simple semantics, any delivery order works. Cost: state transfer size.
- **Operation-based (CmRDT)**: Send operations only. Small messages, but requires exactly-once causal delivery. Cost: delivery infrastructure.
- **Delta-state**: Hybrid—send incremental state changes. Best of both when implemented correctly.
**Key insight**: The choice between variants is a **delivery infrastructure vs. merge complexity** tradeoff. State-based pushes complexity into the merge function and tolerates unreliable networks. Operation-based pushes complexity into the delivery layer and achieves smaller messages.
**Production reality**: Most systems use hybrid approaches. Figma uses operation-based with server ordering. Yjs and Automerge use optimized delta-state. Riak uses state-based with delta optimization.
## The Problem
### Why Naive Solutions Fail
**Approach 1: Last-Writer-Wins with Wall Clocks**
```typescript
// Naive LWW - seems simple
function merge(a: { value: T; timestamp: number }, b: { value: T; timestamp: number }) {
return a.timestamp > b.timestamp ? a : b
}
```
Fails because:
- **Clock skew**: Node A's clock is 5 seconds ahead. Its writes always win regardless of actual order.
- **Lost updates**: Two users edit simultaneously. One user's entire work disappears.
- **Non-deterministic**: Same updates processed in different order can produce different results across replicas.
**Approach 2: Locking/Pessimistic Concurrency**
Fails because:
- **Unavailable during partitions**: Can't acquire lock when network splits.
- **Latency penalty**: Every operation requires round-trip to lock server.
- **Deadlocks**: Distributed deadlock detection is complex and slow.
**Approach 3: Consensus on Every Write (Paxos/Raft)**
Fails because:
- **Unavailable during partitions**: Consensus requires majority quorum.
- **High latency**: Multiple round-trips per write (Paxos: 2 RTTs minimum).
- **Doesn't scale**: Every write touches every node. P2P scenarios impossible.
### The Core Challenge
The fundamental tension: **strong consistency requires coordination, coordination requires availability sacrifice** (CAP theorem).
CRDTs resolve this by **designing data structures where concurrent operations commute**. Instead of preventing conflicts, they make conflicts mathematically impossible to produce divergent states.
## CRDT Foundations
### Join-Semilattice Mathematics
CRDTs are built on **join-semilattices**—partially ordered sets where any two elements have a least upper bound (join/merge).
**Formal requirements for merge function ⊔:**
| Property | Definition | Why It Matters |
| ----------- | ------------------------- | ---------------------------------------- |
| Commutative | A ⊔ B = B ⊔ A | Order of receiving updates is irrelevant |
| Associative | (A ⊔ B) ⊔ C = A ⊔ (B ⊔ C) | Grouping of merges is irrelevant |
| Idempotent | A ⊔ A = A | Duplicate messages are harmless |
**Monotonicity constraint**: Updates must be **inflations**—they can only move "up" in the lattice ordering. You can never return to a previous state.
> "Any state-based object satisfying the monotonic semilattice property is strongly eventually consistent."
> — Shapiro et al., "A Comprehensive Study of Convergent and Commutative Replicated Data Types" (2011)
### Strong Eventual Consistency (SEC)
CRDTs provide **Strong Eventual Consistency**:
1. **Eventual delivery**: Every update eventually reaches every replica
2. **Convergence**: Replicas that have received the same updates are in identical states
3. **Termination**: All operations complete locally without blocking
This is stronger than eventual consistency (which only guarantees convergence "eventually") because SEC guarantees **immediate local completion** and **deterministic convergence**.
## Design Paths
### Path 1: State-Based CRDTs (CvRDT)
**How it works:**
1. Each replica maintains full local state
2. Periodically, replicas exchange their complete state
3. Receiving replica merges incoming state with local state using join operation
4. Merge function satisfies semilattice properties
**When to choose this path:**
- Network is unreliable (messages may be lost, duplicated, reordered)
- State size is bounded or compressible
- Merge function is computationally cheap
- Simpler implementation is preferred over message efficiency
**Key characteristics:**
- Any gossip protocol works—no ordering guarantees needed
- Duplicate messages are automatically handled (idempotence)
- Full state must be transmitted on every sync
**G-Counter example (state-based):**
```typescript collapse={1-2}
type NodeId = string
interface GCounter {
counts: Map // Each node tracks its own increment count
}
function increment(counter: GCounter, nodeId: NodeId): GCounter {
const newCounts = new Map(counter.counts)
newCounts.set(nodeId, (counter.counts.get(nodeId) ?? 0) + 1)
return { counts: newCounts }
}
function merge(a: GCounter, b: GCounter): GCounter {
// Pairwise maximum - satisfies all semilattice properties
const merged = new Map()
const allNodes = new Set([...a.counts.keys(), ...b.counts.keys()])
for (const nodeId of allNodes) {
merged.set(nodeId, Math.max(a.counts.get(nodeId) ?? 0, b.counts.get(nodeId) ?? 0))
}
return { counts: merged }
}
function value(counter: GCounter): number {
return [...counter.counts.values()].reduce((sum, n) => sum + n, 0)
}
```
**Trade-offs:**
| Advantage | Disadvantage |
| --------------------------------------------- | ------------------------------------------------ |
| Simple delivery—any order, duplicates OK | State transfer can be expensive |
| Easy to reason about—state is self-describing | State can grow unbounded (actor IDs, tombstones) |
| Works over unreliable networks | Merge computation on every sync |
**Real-world: Riak** uses state-based CRDTs with delta optimization. They maintain full state but only transmit deltas when possible.
### Path 2: Operation-Based CRDTs (CmRDT)
**How it works:**
1. Each replica maintains local state
2. Operations are applied locally and broadcast to other replicas
3. Receiving replicas apply operations to their local state
4. Operations must commute when applied concurrently
**When to choose this path:**
- Reliable, exactly-once, causally-ordered delivery available
- Operations are small relative to state size
- Low-latency propagation is critical
- Can invest in delivery infrastructure
**Key characteristics:**
- Small message size (just the operation)
- Requires reliable causal broadcast—significant infrastructure investment
- Must handle late joiners (replay history or checkpoint + recent ops)
**G-Counter example (operation-based):**
```typescript
type Operation = { type: "increment"; nodeId: string; amount: number }
function apply(counter: number, op: Operation): number {
// Operations must be delivered exactly once, in causal order
return counter + op.amount
}
// The delivery layer must guarantee:
// 1. No duplicates
// 2. No lost messages
// 3. Causal ordering (if A caused B, A delivered before B)
```
**Trade-offs:**
| Advantage | Disadvantage |
| -------------------------------- | ---------------------------------------- |
| Small messages (operations only) | Requires reliable causal delivery layer |
| Immediate propagation possible | Must track history for late joiners |
| Lower bandwidth | More complex reasoning about concurrency |
**Real-world: Figma** uses operation-based approach with their own delivery layer. Server provides ordering and validation. They invested heavily in transport infrastructure to achieve low-latency sync.
### Path 3: Delta-State CRDTs
**How it works:**
1. Track changes since last sync as "deltas"
2. Send only the delta (incremental state change), not full state
3. Deltas are themselves CRDTs—can be merged like states
4. Falls back to full state sync when deltas unavailable
**When to choose this path:**
- Want small messages like op-based but unreliable network like state-based
- State is large but changes are typically small
- Can track sync points between replicas
**Key insight from research:**
> "Delta-state CRDTs combine the distributed nature of operation-based CRDTs with the uniquely simple model of state-based CRDTs."
> — Almeida et al., "Delta State Replicated Data Types" (2018)
**Trade-offs:**
| Advantage | Disadvantage |
| ----------------------------------- | ------------------------------ |
| Small messages in common case | Must track sync state per peer |
| Works over unreliable networks | Delta storage overhead |
| Falls back gracefully to full state | More complex implementation |
**Real-world: Yjs and Automerge** both use delta-state approaches with sophisticated compression.
### Comparison Matrix
| Factor | State-Based | Operation-Based | Delta-State |
| --------------------------- | ------------------ | -------------------- | -------------------- |
| Message size | Full state | Single operation | Incremental delta |
| Delivery requirement | Any (gossip OK) | Exactly-once, causal | Any |
| Late joiner handling | Send current state | Replay history | Send deltas or state |
| Implementation complexity | In merge function | In delivery layer | In delta tracking |
| Network partition tolerance | Excellent | Poor | Excellent |
| Typical latency | Higher (batched) | Lower (immediate) | Medium |
### Decision Framework

## Common CRDT Data Structures
### Counters
**G-Counter (Grow-Only):** Covered above. Foundation for other counters.
**PN-Counter (Positive-Negative):** Two G-Counters—one for increments, one for decrements.
```typescript collapse={1-8}
interface PNCounter {
P: GCounter // Positive increments
N: GCounter // Negative increments (decrements)
}
function increment(counter: PNCounter, nodeId: string): PNCounter {
return { ...counter, P: GCounter.increment(counter.P, nodeId) }
}
function decrement(counter: PNCounter, nodeId: string): PNCounter {
return { ...counter, N: GCounter.increment(counter.N, nodeId) }
}
function value(counter: PNCounter): number {
return GCounter.value(counter.P) - GCounter.value(counter.N)
}
function merge(a: PNCounter, b: PNCounter): PNCounter {
return {
P: GCounter.merge(a.P, b.P),
N: GCounter.merge(a.N, b.N),
}
}
```
### Registers
**LWW-Register (Last-Writer-Wins):** Associates timestamp with each update. Highest timestamp wins.
```typescript
interface LWWRegister {
value: T
timestamp: number // Lamport timestamp, NOT wall clock
nodeId: string // Tie-breaker for equal timestamps
}
function merge(a: LWWRegister, b: LWWRegister): LWWRegister {
if (a.timestamp > b.timestamp) return a
if (b.timestamp > a.timestamp) return b
// Equal timestamps: deterministic tie-breaker
return a.nodeId > b.nodeId ? a : b
}
```
**Critical**: Use Lamport timestamps, not wall clocks. Wall clock skew causes non-deterministic behavior.
**MV-Register (Multi-Value):** Preserves all concurrent values instead of picking winner.
```typescript collapse={1-4}
interface MVRegister {
values: Map // All concurrent values
}
function write(reg: MVRegister, value: T, clock: VectorClock): MVRegister {
// Remove all values that this write supersedes
const newValues = new Map()
for (const [vc, v] of reg.values) {
if (!clock.dominates(vc)) {
newValues.set(vc, v) // Keep concurrent values
}
}
newValues.set(clock, value)
return { values: newValues }
}
function read(reg: MVRegister): T[] {
return [...reg.values.values()] // May return multiple values
}
```
**Design decision**: MV-Register pushes conflict resolution to the application. Use when automatic resolution (LWW) loses important data.
### Sets
**G-Set (Grow-Only):** Elements can only be added, never removed.
**2P-Set (Two-Phase):** Add-set and remove-set. Element present if in add-set but not remove-set.
- **Limitation**: Once removed, element can never be re-added.
**OR-Set (Observed-Remove):** Most practical set CRDT. Each element tagged with unique ID. Remove only affects observed tags.
```typescript collapse={1-6}
type Tag = string // Globally unique (e.g., UUID or nodeId + sequence)
interface ORSet {
elements: Map> // Element -> set of tags
}
function add(set: ORSet, element: T, tag: Tag): ORSet {
const tags = new Set(set.elements.get(element) ?? [])
tags.add(tag)
const newElements = new Map(set.elements)
newElements.set(element, tags)
return { elements: newElements }
}
function remove(set: ORSet, element: T): ORSet {
// Remove only tags we've "observed" - concurrent adds survive
const newElements = new Map(set.elements)
newElements.delete(element)
return { elements: newElements }
}
function merge(a: ORSet, b: ORSet): ORSet {
const merged = new Map>()
const allElements = new Set([...a.elements.keys(), ...b.elements.keys()])
for (const element of allElements) {
const tagsA = a.elements.get(element) ?? new Set()
const tagsB = b.elements.get(element) ?? new Set()
const unionTags = new Set([...tagsA, ...tagsB])
if (unionTags.size > 0) {
merged.set(element, unionTags)
}
}
return { elements: merged }
}
function has(set: ORSet, element: T): boolean {
return (set.elements.get(element)?.size ?? 0) > 0
}
```
**Add-wins semantics**: Concurrent add and remove → element present. This matches user intuition in most applications.
## Sequence CRDTs for Text Editing
Sequence CRDTs enable collaborative text editing. Each character gets a unique, ordered identifier that persists across all replicas.
### The Interleaving Problem
**The challenge**: Users A and B both type at position 5. A types "foo", B types "bar". Naive merge produces "fboaor" instead of "foobar" or "barfoo".
```
Initial: "Hello|World" (both cursors at position 5)
User A: "Hello|foo|World"
User B: "Hello|bar|World"
Naive: "Hellofboaor World" ← Characters interleaved!
Correct: "HellofoobarWorld" or "HellobarfooWorld"
```
### Major Algorithms
| Algorithm | Approach | Interleaving | ID Growth | Notes |
| ----------- | ----------------------------- | -------------- | --------- | ------------------------------- |
| RGA | Linked list + timestamps | Can interleave | Linear | Good general performance |
| Logoot/LSEQ | Fractional positions | Can interleave | Unbounded | Simple but IDs grow |
| Fugue | Designed for non-interleaving | Minimal | Bounded | Proven maximal non-interleaving |
| Eg-walker | Event graph replay | Minimal | Bounded | State-of-the-art performance |
**Fugue algorithm** (2023): Specifically designed to satisfy "maximal non-interleaving"—concurrent inserts at the same position are never interleaved.
> "We prove that Fugue satisfies a maximally strong non-interleaving property."
> — Gentle et al., "The Art of the Fugue" (2023)
### Rich Text: Peritext
**Peritext** handles inline formatting (bold, italic, links) in CRDTs.
**Key insight**: Formatting spans are linked to stable character identifiers, not positions. Formatting marks accumulate—remove marks don't delete add marks, they add counter-marks.
**Design decisions:**
- **Expand on edges**: When typing at the end of bold text, new characters inherit bold formatting
- **Anchor to characters**: Formatting boundaries attached to character IDs, not positions
- **Never delete marks**: Add "remove bold" marks rather than deleting "add bold" marks
## Production Implementations
### Figma: Server-Ordered Operations
**Context**: Real-time collaborative design tool. Millions of concurrent users editing complex documents.
**Implementation choices:**
- **Pattern variant**: Operation-based with server ordering
- **Architecture**: WebSocket to multiplayer servers; server is authoritative
- **Conflict resolution**: LWW per-property-per-object. Two users changing different properties don't conflict.
- **Persistence**: State in-memory, checkpointed to S3 every 30-60 seconds. Transaction log in DynamoDB.

**What worked:**
- Simplified conflict resolution by making server authoritative for ordering
- LWW per-property avoids complex merge logic for most operations
**What was hard:**
- Text editing required more sophisticated merging—adopted **Eg-walker** algorithm for code layers (2024)
- Handling network partitions while maintaining responsiveness
**Source**: [Figma Engineering Blog - How Figma's Multiplayer Technology Works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/)
### Yjs: Delta-State with Optimization
**Context**: Most popular CRDT library. 900k+ weekly npm downloads. Powers many collaborative editors.
**Implementation choices:**
- **Pattern variant**: Delta-state with sophisticated encoding
- **Architecture**: Network-agnostic. Works P2P, client-server, or hybrid.
- **Data model**: Shared types (Y.Map, Y.Array, Y.Text) that mirror JavaScript types
**Internal architecture:**

**Key optimizations:**
- **Block merging**: Consecutive operations from same client merged into single block
- **Efficient encoding**: V2 encoding influenced by Automerge research—run-length encoding for sequences
- **Deleted content removal**: Can delete content from deleted structs (tombstones kept, content discarded)
**What worked:**
- Network-agnostic design enables diverse deployment scenarios
- Provider architecture separates sync logic from data logic
**What was hard:**
- Garbage collection—tombstones accumulate over time
- Large documents can have significant memory overhead
**Source**: [Yjs Documentation](https://docs.yjs.dev/), [Yrs Architecture Deep Dive](https://www.bartoszsypytkowski.com/yrs-architecture/)
### Automerge: Local-First Philosophy
**Context**: Designed for local-first software where user's device is primary. "PostgreSQL for your local-first app."
**Implementation choices:**
- **Pattern variant**: State-based with optimized sync
- **Architecture**: Rust core, bindings for JS/WASM, C, Swift
- **Philosophy**: Automatic merging—no git-style merge conflicts exposed to users
**Key differentiators:**
- **Rigorous proofs**: Convergence verified with Isabelle theorem prover
- **Compact storage**: Sophisticated compression format
- **Deterministic conflict resolution**: Lamport timestamps + actor IDs ensure same result everywhere
**What worked:**
- Clean API hides CRDT complexity from application developers
- Excellent offline support out of the box
**What was hard:**
- Performance at scale—led to complete Rust rewrite
- Sync protocol complexity
**Source**: [Automerge Documentation](https://automerge.org/), [Local-First Software](https://www.inkandswitch.com/local-first/)
### Riak: Database-Level CRDTs
**Context**: Distributed key-value database. First production database to adopt CRDTs (2012).
**Implementation choices:**
- **Pattern variant**: State-based with delta optimization
- **Types supported**: Counters, Sets, Maps, HyperLogLogs (bucket-level); Flags, Registers (embedded)
- **Consistency**: Vector clocks for causality, optional causal consistency mode
**Production scale (League of Legends):**
- 7.5 million concurrent users
- 11,000 messages per second
- In-game chat powered by Riak CRDTs
**Challenges encountered:**
- Sets perform poorly for writes as cardinality grows
- Sets >500KB have issues—addressed with delta-replication and decomposition
- Recommended: decompose large sets into multiple entries
**Source**: [Riak Data Types Documentation](https://docs.riak.com/riak/kv/2.2.3/learn/concepts/crdts/index.html)
### Implementation Comparison
| Aspect | Figma | Yjs | Automerge | Riak |
| --------------- | ----------------- | ---------------- | ---------------- | ---------------- |
| Variant | Op-based (server) | Delta-state | State-based | State + delta |
| Architecture | Centralized | Any | P2P/local-first | Distributed DB |
| Offline support | Limited | Excellent | Excellent | N/A (server) |
| Rich text | Eg-walker | Native | Peritext | N/A |
| Maturity | Production | Production | Production | Production |
| Best for | Real-time SaaS | Editor libraries | Local-first apps | Key-value stores |
## Operational Concerns
### Garbage Collection
**The tombstone problem**: Deleted elements leave markers (tombstones) that must persist until all replicas have seen them. These accumulate over time.
**Why tombstones exist**: If replica A has tombstone for X, and replica B still has X, A must keep tombstone so X gets deleted when B syncs. Otherwise, X "resurrects."
**Strategies:**
| Strategy | Mechanism | Trade-off |
| --------------- | ------------------------------------------------------------- | ------------------------------------- |
| Epoch-based | Split into live/compacted portions at version vector boundary | Requires version vector tracking |
| Stability-based | Remove when update known to all replicas | Requires global knowledge |
| Time-based | Remove after `gc_grace_seconds` (Cassandra) | May resurrect if replica rejoins late |
| Consensus-based | Paxos/2PC to agree on removal | Defeats purpose of coordination-free |
**Production approach (Cassandra)**: `gc_grace_seconds` defaults to 10 days. Tombstones older than this are removed during compaction. Set based on expected node recovery time.
### Causality Tracking
**Lamport timestamps**: Simple logical clock. Provides partial ordering but cannot distinguish concurrent events.
```typescript
// Lamport timestamp: increment on every event, max+1 on receive
let clock = 0
function localEvent(): number {
return ++clock
}
function receiveEvent(remoteTimestamp: number): number {
clock = Math.max(clock, remoteTimestamp) + 1
return clock
}
```
**Vector clocks**: Full causality tracking. Can determine happened-before, happened-after, or concurrent.
```typescript collapse={1-4}
type VectorClock = Map
function increment(vc: VectorClock, nodeId: NodeId): VectorClock {
const newVc = new Map(vc)
newVc.set(nodeId, (vc.get(nodeId) ?? 0) + 1)
return newVc
}
function merge(a: VectorClock, b: VectorClock): VectorClock {
const merged = new Map()
const allNodes = new Set([...a.keys(), ...b.keys()])
for (const nodeId of allNodes) {
merged.set(nodeId, Math.max(a.get(nodeId) ?? 0, b.get(nodeId) ?? 0))
}
return merged
}
function happenedBefore(a: VectorClock, b: VectorClock): boolean {
// a < b iff all entries in a <= corresponding entry in b, and at least one <
let hasLess = false
for (const [nodeId, aTime] of a) {
const bTime = b.get(nodeId) ?? 0
if (aTime > bTime) return false
if (aTime < bTime) hasLess = true
}
for (const [nodeId, bTime] of b) {
if (!a.has(nodeId) && bTime > 0) hasLess = true
}
return hasLess
}
function concurrent(a: VectorClock, b: VectorClock): boolean {
return !happenedBefore(a, b) && !happenedBefore(b, a)
}
```
**Cost**: O(N) space per operation where N is number of nodes. For large systems, consider hybrid logical clocks or bounded vector clocks.
### Handling Late Joiners
**Challenge**: New replica needs full state. For large documents with long histories, this is expensive.
**Strategies:**
1. **Full state transfer**: Simple but costly. Works for state-based CRDTs.
2. **Checkpoint + recent ops**: Periodically snapshot state; new nodes get snapshot + ops since.
3. **Delta sync**: Track version vectors; send deltas since last known state.
**Eg-walker approach**: Store event graph on disk, document state as plain text. New joiners get current text + subset of event graph needed for future merges.
## CRDT vs Operational Transformation
### Historical Context
- **OT**: Invented late 1980s. Powers Google Docs. Requires central server for transformation.
- **CRDT**: First proposed 2006 (WOOT). Designed for decentralized/offline systems.
### Fundamental Difference
**OT**: Transform operations against each other to preserve intent.
```
User A: insert("X", 5) → transform against B's op → insert("X", 6)
User B: insert("Y", 3) → transform against A's op → insert("Y", 3)
```
**CRDT**: Design operations that commute naturally.
```
User A: insert("X", id_a) → apply directly
User B: insert("Y", id_b) → apply directly
Result: determined by ID ordering, not transformation
```
### Comparison
| Aspect | OT | CRDT |
| ------------------------- | -------------------------------------------- | ---------------------------- |
| Architecture | Requires central server | Works P2P or centralized |
| Offline support | Poor (needs server) | Excellent |
| Intent preservation | Better (transforms designed for it) | Varies by algorithm |
| Implementation complexity | High (transform functions error-prone) | Moderate |
| Proven correctness | Difficult (many flawed algorithms published) | Mathematical proofs possible |
### When to Choose Each
**Choose OT when:**
- Always-online application with reliable connectivity
- Centralized architecture already exists
- Intent preservation is critical (e.g., cursor positions in concurrent edits)
- Using existing OT infrastructure (Google Docs API, ShareDB)
**Choose CRDT when:**
- Offline-first is required
- P2P or decentralized architecture
- Multi-device sync with unreliable networks
- Edge computing scenarios
- Want mathematical guarantees of convergence
### The Convergence: Eg-walker
Recent research (Eg-walker, 2024) combines benefits of both:
> "Eg-walker achieves order of magnitude less memory than existing CRDTs, orders of magnitude faster document loading, and orders of magnitude faster branch merging than OT—all while working P2P without a central server."
> — Gentle & Kleppmann, "Collaborative Text Editing with Eg-walker" (EuroSys 2025)
**Key insight**: Store the event graph (DAG of edits) on disk. Keep document state as plain text in memory. Build CRDT structure temporarily during merge, then discard.
## Common Pitfalls
### 1. Unbounded State Growth
**The mistake**: Not planning for tombstone/history cleanup.
**Example**: Notion-like app storing all operations. After 1 year, loading a page requires replaying 100K operations taking 30+ seconds.
**Solutions:**
- Periodic state snapshots with operation truncation
- Tombstone GC with grace period
- Compaction strategies (Eg-walker approach)
### 2. Assuming Strong Consistency
**The mistake**: Building features that assume immediate consistency.
**Example**: "Undo" feature that undoes "last operation"—fails with concurrent edits because "last" is ambiguous.
**Solutions:**
- Design for concurrent operations from the start
- Use causal consistency, not wall-clock ordering
- Undo should undo "my last operation," not "the last operation"
### 3. Clock Skew with LWW
**The mistake**: Using wall clocks for LWW timestamps.
**Example**: Server with clock 5 seconds ahead always wins conflicts, even against more recent actual edits.
**Solutions:**
- Use Lamport timestamps or vector clocks
- If wall clocks required, use hybrid logical clocks (HLC)
### 4. Large Set Operations
**The mistake**: Using OR-Set for large, frequently-modified collections.
**Example (Riak)**: OR-Sets >500KB have significant performance issues—each element carries metadata.
**Solutions:**
- Decompose into multiple smaller sets
- Use specialized data structures (CRDTs for counters, not sets, when counting)
- Consider application-level sharding
### 5. Ignoring Merge Complexity
**The mistake**: Assuming merge is always fast.
**Example**: State-based CRDT with O(n²) merge. Works fine with 100 elements, falls over with 10,000.
**Solutions:**
- Analyze merge complexity during design
- Use delta-state to reduce merge frequency
- Profile with realistic data sizes
## Implementation Guide
### Starting Point Decision

### Library Recommendations
| Library | Language | Best For | Maturity |
| --------------------- | -------- | ------------------------- | ----------------------------------- |
| Yjs | JS/TS | Collaborative editing | Production (900k+ weekly downloads) |
| Automerge | JS/Rust | Local-first apps | Production |
| Loro | Rust/JS | Rich text, moveable trees | Production |
| Diamond Types | Rust | High-performance text | Production |
| Akka Distributed Data | JVM | Actor-based systems | Production |
| riak_dt | Erlang | Key-value stores | Production |
### Building Custom: Checklist
Only build custom when existing libraries genuinely don't fit:
- [ ] Define merge semantics precisely before coding
- [ ] Prove commutativity/associativity/idempotence mathematically
- [ ] Handle clock skew (use logical clocks)
- [ ] Plan garbage collection strategy upfront
- [ ] Test with network partition simulation (e.g., Jepsen)
- [ ] Test with artificial latency injection
- [ ] Benchmark with realistic data sizes
- [ ] Consider formal verification (TLA+, Isabelle)
## Conclusion
CRDTs provide a mathematically rigorous solution to distributed collaboration. The semilattice properties—commutative, associative, idempotent merge—guarantee convergence without coordination.
**Key decisions:**
1. **State vs operation vs delta**: Trade delivery complexity against message size
2. **Data structure selection**: Match CRDT type to application semantics (counters, sets, sequences)
3. **Build vs buy**: Use existing libraries unless you have exceptional requirements and CRDT expertise
**The ecosystem is mature**: Yjs, Automerge, and Loro are production-ready. Recent algorithms (Fugue, Eg-walker) solve longstanding problems like interleaving and state growth.
**Start simple**: LWW-Register and OR-Set cover most use cases. Graduate to sequence CRDTs only when building collaborative text editing.
## Appendix
### Prerequisites
- Distributed systems fundamentals (network partitions, consistency models)
- CAP theorem and eventual consistency
- Basic understanding of partial orders and lattice theory helpful but not required
### Terminology
| Term | Definition |
| --------------------- | --------------------------------------------------------------------------------------------- |
| **CvRDT** | Convergent (state-based) CRDT |
| **CmRDT** | Commutative (operation-based) CRDT |
| **Tombstone** | Marker indicating deleted element; must persist for convergence |
| **Vector clock** | Logical clock tracking causality across multiple nodes |
| **Lamport timestamp** | Simple logical clock; partial ordering only |
| **Join-semilattice** | Set with a join (merge) operation that is commutative, associative, idempotent |
| **SEC** | Strong Eventual Consistency—replicas converge to identical state after receiving same updates |
| **OT** | Operational Transformation—alternative approach requiring central server |
### Summary
- CRDTs guarantee convergence through **join-semilattice properties**: merge must be commutative, associative, and idempotent
- **State-based** CRDTs tolerate unreliable networks but transfer full state; **operation-based** require reliable causal delivery but send small messages; **delta-state** is the practical hybrid
- **OR-Set** is the most practical set CRDT; **Fugue** and **Eg-walker** are state-of-the-art for text editing
- Production systems (Figma, Yjs, Automerge, Riak) all use hybrid approaches optimized for their use cases
- **Garbage collection** is the primary operational challenge—plan tombstone strategy from the start
- **Use existing libraries** (Yjs, Automerge, Loro) unless requirements are exceptional
### References
**Foundational Papers:**
- [A Comprehensive Study of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588/document) - Shapiro et al., INRIA 2011. The definitive CRDT reference.
- [Conflict-free Replicated Data Types (SSS 2011)](https://link.springer.com/chapter/10.1007/978-3-642-24550-3_29) - Shapiro et al. Conference paper version.
- [Delta State Replicated Data Types](https://arxiv.org/abs/1603.01529) - Almeida et al. Delta-state CRDT formalization.
- [Pure Operation-Based Replicated Data Types](https://arxiv.org/abs/1710.04469) - Baquero & Shoker. Pure op-based CRDTs.
**Text Editing:**
- [The Art of the Fugue: Minimizing Interleaving in Collaborative Text Editing](https://arxiv.org/pdf/2305.00583) - Gentle et al., 2023. Fugue algorithm.
- [Collaborative Text Editing with Eg-walker](https://arxiv.org/abs/2409.14252) - Gentle & Kleppmann, EuroSys 2025. State-of-the-art performance.
- [Peritext: A CRDT for Rich Text](https://www.inkandswitch.com/peritext/) - Litt et al. Rich text formatting in CRDTs.
**Production Implementations:**
- [How Figma's Multiplayer Technology Works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/) - Figma Engineering Blog.
- [Yjs Documentation](https://docs.yjs.dev/) - Yjs official docs.
- [Automerge](https://automerge.org/) - Automerge project site.
- [Riak Data Types](https://docs.riak.com/riak/kv/2.2.3/learn/concepts/crdts/index.html) - Riak CRDT documentation.
- [Loro](https://loro.dev/) - Loro CRDT library.
**Additional Resources:**
- [crdt.tech](https://crdt.tech/) - Official CRDT resources, papers, and implementations list.
- [Local-First Software](https://www.inkandswitch.com/local-first/) - Ink & Switch essay on local-first architecture.
- [CRDT Glossary](https://crdt.tech/glossary) - Standard terminology definitions.
---
## Operational Transformation
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/operational-transformation
**Category:** System Design / Core Distributed Patterns
**Description:** Deep-dive into Operational Transformation (OT): the algorithm powering Google Docs, with its design variants, correctness properties, and production trade-offs.OT enables real-time collaborative editing by transforming concurrent operations so that all clients converge to the same document state. Despite being the foundation of nearly every production collaborative editor since 1995, OT has a troubled academic history—most published algorithms were later proven incorrect. This article covers why OT is hard, which approaches actually work, and how production systems sidestep the theoretical pitfalls.
# Operational Transformation
Deep-dive into Operational Transformation (OT): the algorithm powering Google Docs, with its design variants, correctness properties, and production trade-offs.
OT enables real-time collaborative editing by transforming concurrent operations so that all clients converge to the same document state. Despite being the foundation of nearly every production collaborative editor since 1995, OT has a troubled academic history—most published algorithms were later proven incorrect. This article covers why OT is hard, which approaches actually work, and how production systems sidestep the theoretical pitfalls.

OT core flow: concurrent operations are transformed against each other to preserve intent while achieving convergence.
## Abstract
OT solves the problem of concurrent edits in collaborative systems through a deceptively simple idea: when operations conflict, transform them so they can both apply correctly. The core mental model:
- **Operations** (insert, delete) carry position information that becomes stale when concurrent edits occur
- **Transformation functions** adjust an operation's position based on another operation that happened concurrently
- **TP1** (convergence): applying O1 then T(O2,O1) must equal applying O2 then T(O1,O2)—both paths reach the same state
- **TP2** (commutativity): transforming O3 against different orderings of O1,O2 must yield the same result—this property is notoriously hard to satisfy and most published algorithms fail it
- **Production systems avoid TP2** by using a central server to impose a canonical operation order, making the math tractable at the cost of requiring server round-trips
The key insight: practical OT is not distributed. Google Docs, Google Wave, and nearly all production systems use client-server architecture where the server decides operation order, eliminating the need for TP2 satisfaction.
## The Problem
### Why Naive Solutions Fail
**Approach 1: Last-write-wins**
Two users edit "hello" concurrently:
- User A inserts "x" at position 0 → "xhello"
- User B deletes character at position 4 → "hell"
With last-write-wins, one edit is lost. Collaborative editors cannot discard user intent.
**Approach 2: Locking**
Lock the document (or region) during edits. Users see "Document locked by User A" and must wait.
Fails because: latency makes locking unusable. With 100ms round-trip, a user typing 5 characters/second would spend most of their time waiting for locks. Google Docs reports 40+ concurrent editors on popular documents—locking would serialize all edits.
**Approach 3: Apply operations as-received**
User A and B both start with "abc":
- A: Insert 'x' at position 0 → sends Ins(0,'x')
- B: Delete at position 2 → sends Del(2)
If B receives A's insert first, B's document becomes "xabc". Then B's own Del(2) executes, deleting 'a' instead of 'c'. Result: "xbc" for B, "xab" for A. **Divergence.**
### The Core Challenge
The fundamental tension: **preserving user intent while achieving convergence across all replicas**.
Each operation encodes intent relative to a specific document state. When operations are concurrent (neither causally depends on the other), their positions become meaningless without transformation. OT exists to adjust operations so they execute their intended effect regardless of the order they're applied.
## Pattern Overview
### Core Mechanism
OT works by defining **transformation functions** that take two concurrent operations and produce adjusted versions:
```
T(O1, O2) → O1' // O1 transformed against O2
T(O2, O1) → O2' // O2 transformed against O1
```
The transformed operations are designed so that:
- Starting from state S
- Applying O1 then O2' yields the same state as
- Applying O2 then O1'
For string operations, transformation typically adjusts positions:
```typescript
// Transform insert against insert
function transformInsertInsert(
op1: { pos: number; char: string },
op2: { pos: number; char: string },
): { pos: number; char: string } {
if (op1.pos < op2.pos) {
return op1 // No adjustment needed
} else if (op1.pos > op2.pos) {
return { ...op1, pos: op1.pos + 1 } // Shift right
} else {
// Tie-breaking: use site ID or timestamp
// Both insert at same position
return { ...op1, pos: op1.pos + 1 } // Arbitrary but consistent
}
}
// Transform delete against insert
function transformDeleteInsert(del: { pos: number }, ins: { pos: number; char: string }): { pos: number } {
if (del.pos < ins.pos) {
return del
} else {
return { pos: del.pos + 1 }
}
}
```
### Key Invariants
1. **Convergence (TP1)**: All sites reach the same document state after applying all operations, regardless of arrival order
2. **Causality preservation**: Operations are applied respecting their causal dependencies (Lamport's happened-before)
3. **Intention preservation**: The effect of an operation matches what the user intended, even after transformation
### Failure Modes
| Failure | Impact | Mitigation |
| ------------------- | ---------------------------------------- | ------------------------------------------------------- |
| TP1 violation | Documents diverge permanently | Formal verification of transformation functions |
| TP2 violation | Divergence with 3+ concurrent operations | Use server-ordered architecture (avoid TP2 requirement) |
| Causality violation | Operations apply out of order | Vector clocks or server-assigned sequence numbers |
| Transformation bugs | Silent corruption, divergence | Checksums, periodic reconciliation |
## Design Paths
### Path 1: Client-Server OT (Jupiter/Wave Model)
**When to choose this path:**
- Network connectivity is reliable
- Latency tolerance allows server round-trips
- Simplicity is prioritized over offline capability
- Team wants proven, production-tested approach
**How it works:**
The server maintains a single canonical operation history. Clients send operations to the server, which:
1. Transforms incoming operations against any operations the client hasn't seen
2. Assigns a sequence number
3. Broadcasts the transformed operation to all clients
Clients maintain three states:
- **Server state**: Last acknowledged server revision
- **Pending operations**: Sent to server, awaiting acknowledgment
- **Local operations**: Not yet sent (buffered during server round-trip)

Client-server OT: server assigns canonical ordering, clients transform against operations they haven't seen.
**Critical design choice**: Clients must wait for acknowledgment before sending the next operation batch. Google Wave enforced this strictly—during the wait, clients buffer local operations and send them in bulk after ACK.
**Why this avoids TP2:**
TP2 is required when operations can be transformed along different paths (different orderings of concurrent operations). With a central server:
- The server decides the canonical order
- All transformations follow a single path
- TP2 never comes into play
**Trade-offs vs other paths:**
| Aspect | Client-Server | Peer-to-Peer |
| ------------------------- | ------------------------------ | ----------------------- |
| TP2 requirement | Not needed | Required |
| Correctness proofs | Straightforward | Notoriously difficult |
| Server dependency | Single point of failure | No central dependency |
| Latency | Round-trip per operation batch | Direct client-to-client |
| Offline support | Limited (buffer operations) | Native |
| Implementation complexity | Moderate | Very high |
**Real-world example:**
Google Docs uses client-server OT derived from the Jupiter/Wave codebase. According to Google's engineering blog, every character change is saved as an event in a revision log. The document is rendered by replaying the log from the start.
Joseph Gentle (ShareJS author, ex-Google Wave): "Whatever the bugs are, I don't think they were ever fixed in the opensource [Wave] version. It's all just too complicated... You're right about OT—it gets crazy complicated if you implement it in a distributed fashion. But implementing it in a centralized fashion is actually not so bad."
### Path 2: Peer-to-Peer OT (adOPTed, GOTO)
**When to choose this path:**
- Offline-first requirement is critical
- No server infrastructure available
- Academic research context
- Willing to accept significant complexity
**How it works:**
Each site maintains its own operation history and transforms incoming operations against local history. Without a central server, sites can receive operations in different orders, requiring TP2 satisfaction.
The adOPTed algorithm (Ressel et al. 1996) introduced an N-dimensional interaction graph to track all valid transformation paths. Each operation is placed in this graph based on state vectors, and transformations follow edges to find equivalent operations.
**The TP2 problem:**
For three concurrent operations O1, O2, O3, TP2 requires:
```
T(O3, O1 ∘ T(O2, O1)) ≡ T(O3, O2 ∘ T(O1, O2))
```
Meaning: transforming O3 past O1-then-O2' must equal transforming O3 past O2-then-O1'.
**Why it fails in practice:**
Randolph et al. (2013) proved formally: "There is no IT function, based on classical parameters of delete and insert operations, which satisfies both TP1 and TP2."
Using controller synthesis techniques, they showed that position and character parameters are necessary but not sufficient. Every published TP2-claiming algorithm was subsequently shown to have counterexamples:
| Algorithm | Year | TP2 Claim | Status |
| --------- | ---- | ---------------- | ------------------- |
| dOPT | 1989 | Implied | Disproven |
| adOPTed | 1996 | Asserted | Disproven |
| SOCT2 | 1998 | "Proven" | Disproven |
| SDT | 2002 | Claimed | Disproven |
| IMOR | 2003 | Machine-verified | Invalid assumptions |
Raph Levien: "For a decade or so, TP2 was something of a philosopher's stone, with several alchemists of OT claiming that they had found a set of transforms satisfying TP2, only for counterexamples to emerge later."
**Trade-offs:**
- ✅ No server dependency
- ✅ Offline operation is native
- ✅ Lower latency (direct peer communication)
- ❌ No production-proven correct implementation exists
- ❌ Extreme implementation complexity
- ❌ Debugging distributed state divergence is brutal
**Real-world example:**
Google Wave attempted federated (peer-to-peer between servers) OT. From Joseph Gentle: "We got it working, kinda, but it was complex and buggy. We ended up with a scheme where every wave would arrange a tree of wave servers... But it never really worked."
Wave was eventually shut down; its federation protocol remains unproven in production.
### Path 3: Tombstone Transformation Functions (TTF)
**When to choose this path:**
- Need provably correct transformation functions
- Can tolerate memory overhead
- Building a new implementation from scratch
**How it works:**
TTF (Oster et al. 2006) sidesteps the TP2 impossibility by changing the data model. Instead of a string, the document is a sequence of (character, visible) pairs. Deletes set visible=false but retain the character as a tombstone.
```typescript
interface TombstoneChar {
char: string
visible: boolean
id: UniqueId // For ordering
}
type Document = TombstoneChar[]
function deleteAt(doc: Document, pos: number): Document {
let visibleCount = 0
for (let i = 0; i < doc.length; i++) {
if (doc[i].visible) {
if (visibleCount === pos) {
doc[i].visible = false
return doc
}
visibleCount++
}
}
return doc
}
```
**Why this satisfies TP2:**
With tombstones, positions are stable. A deleted character still occupies its logical position, so concurrent operations don't shift each other's targets unpredictably. The paper includes a theorem prover verification that TTF satisfies both TP1 and TP2.
**Trade-offs:**
- ✅ Provably correct (machine-verified)
- ✅ Simpler transformation logic
- ❌ Unbounded memory growth (tombstones accumulate)
- ❌ Need garbage collection strategy
- ❌ Operations may reference expired tombstones
**Garbage collection challenge:**
Tombstones can be removed only when all sites have seen all operations that reference them. This requires distributed garbage collection—itself a hard problem. Production systems typically use periodic snapshots with tombstone truncation, accepting that very late operations may fail.
### Path 4: Context-Based OT (COT)
**When to choose this path:**
- Need undo/redo support
- Want to avoid TP2 at the function level
- Building a distributed system with known operation context
**How it works:**
COT (Sun & Sun 2009) replaces history buffers with **context vectors** that capture the exact document state where an operation was created. Instead of transforming operations pairwise, COT transforms based on context differences.
The key insight: TP2 is required because operations can be transformed along different paths. If you track the exact context (which operations have been applied), you can always reconstruct the correct transformation path.
**Advantages:**
- Supports undo/redo of any operation at any time
- Does NOT require TP2 at the transformation function level
- Transformation functions only need TP1
- Proven correct for specific operation sets
**Real-world implementations:**
COT powers several academic and commercial systems: CoMaya (3D modeling), CoWord (Microsoft Word), CodoxWord, and IBM OpenCoWeb.
**Trade-offs:**
- ✅ Avoids TP2 requirement through context tracking
- ✅ Native undo/redo support
- ✅ Formally verified for specific operations
- ❌ Context vectors grow with operation count
- ❌ More complex than client-server approach
- ❌ Less production exposure than Jupiter-derived systems
### Decision Framework

Decision tree for OT variant selection. Client-server dominates production use.
## Production Implementations
### Google Docs
**Context:** World's most widely used collaborative editor, 40+ concurrent editors on popular documents.
**Implementation choices:**
- Architecture: Client-server (Jupiter-derived)
- Transport: WebSocket with operation batching
- Persistence: Event-sourced revision log
**Architecture:**
Each document stores a revision log of operations. The visible document is the result of replaying all operations from the initial state. This enables:
- Revision history ("See previous versions")
- Conflict-free persistence (append-only log)
- Recovery from any point in time
**Specific details:**
- Operations use a streaming format: Retain(n), Insert(chars), Delete(n)
- Attributes (bold, font) are separate operations on ranges
- Checkpoints created periodically to avoid replaying full history
**What worked:**
- Simplicity of client-server model enabled rapid iteration
- Operation batching amortizes round-trip cost
- Event sourcing simplified persistence and history
**What was hard:**
- Rich text requires tree-structured operations (paragraphs, lists, tables)
- Cursor/selection state must be transformed alongside content
- Presence indicators (who's editing where) add additional sync complexity
**Source:** Google Drive Blog (2010), Google Wave Whitepapers
### Apache Wave (Open Source)
**Context:** Open-sourced Google Wave codebase, reference implementation for Wave protocol.
**Implementation choices:**
- Architecture: Client-server with federation attempt
- Protocol: Wave Federation Protocol (XMPP-based)
- Operation types: Document mutations, wavelet operations
**Architecture:**
Wave operations are hierarchical:
- **Wavelet**: Container for participants and documents
- **Document**: Rich text with elements and annotations
- **Operations**: Retain, InsertCharacters, DeleteCharacters, InsertElement, etc.
**Specific details from whitepaper:**
- Every character, start tag, or end tag is an "item"
- Gaps between items are "positions"
- Operations compose: B∙A satisfies (B∙A)(d) = B(A(d))
- Clients must wait for ACK before sending next batch
**What worked:**
- Document model handles rich text well
- Composition of operations simplifies some transformations
- Open protocol enabled third-party clients
**What was hard:**
- Federation never achieved production stability
- Complexity overwhelmed both users and developers
- Performance issues with large documents
**Source:** Apache Wave Whitepapers, Joseph Gentle retrospectives
### CKEditor 5
**Context:** Commercial rich-text editor with real-time collaboration, 4+ years R&D.
**Implementation choices:**
- Architecture: Client-server
- Key innovation: OT for tree-structured documents
- Transport: WebSocket
**The tree challenge:**
Rich text isn't a string—it's a tree (paragraphs containing spans containing text). Operations like "split paragraph" or "merge cells" don't map to simple insert/delete. CKEditor claims to be the first production system to solve OT for trees.
**Specific details:**
- Operations can "break" during transformation (one op becomes multiple)
- Selective undo: users only undo their own changes
- Custom transformation functions for 50+ operation types
**What worked:**
- Tree model handles complex formatting (tables, lists, images)
- Selective undo improves UX in multi-user scenarios
- Years of investment produced stable implementation
**What was hard:**
- Operation breaking adds significant complexity
- Testing all transformation combinations is combinatorial
- Performance optimization for large documents
**Source:** CKEditor Blog (2020), "Lessons Learned from Creating a Rich-Text Editor with Real-Time Collaboration"
### Implementation Comparison
| Aspect | Google Docs | Apache Wave | CKEditor 5 |
| -------------- | ----------------------- | ---------------------------- | --------------------------- |
| Architecture | Client-server | Client-server + federation | Client-server |
| Document model | Streaming ops | Hierarchical wavelets | Tree-structured |
| Offline | Limited buffer | Limited buffer | Planned sync queue |
| Rich text | Yes (ops on ranges) | Yes (elements + annotations) | Yes (native tree ops) |
| Undo model | Global history | Global history | Selective (user's ops only) |
| Scale | Proven at massive scale | Abandoned | Production-proven |
## Implementation Guide
### Starting Point Decision

Implementation decision tree. Building OT from scratch is rarely justified.
### Library Options
| Library | Language | Architecture | Maturity | Best For |
| --------------- | ---------- | ------------- | ---------- | ------------------- |
| ShareDB | Node.js | Client-server | Production | JSON documents |
| ot.js | JavaScript | Client-server | Production | Plain text |
| Yjs | JavaScript | P2P (CRDT) | Production | Offline-first |
| CKEditor 5 | JavaScript | Client-server | Production | Rich text |
| Quill + y-quill | JavaScript | P2P (CRDT) | Production | Rich text + offline |
### Building Custom (Rare)
**When to build custom:**
- Document model doesn't fit existing libraries
- Need specific consistency guarantees
- Performance requirements exceed library capabilities
- Regulatory requirements prevent third-party code
**Implementation checklist:**
- [ ] Choose architecture: client-server strongly recommended
- [ ] Define operation types (keep minimal)
- [ ] Implement TP1-satisfying transformation functions
- [ ] Add server sequencing to avoid TP2
- [ ] Implement client state machine (pending, sent, acknowledged)
- [ ] Add conflict detection with checksums
- [ ] Build reconciliation for detected divergence
- [ ] Test with artificial latency and packet loss
- [ ] Fuzz test transformation functions exhaustively
Joseph Gentle's estimate: "Wave took 2 years to write and if we rewrote it today, it would take almost as long."
## Common Pitfalls
### 1. Attempting Peer-to-Peer OT
**The mistake:** Implementing distributed OT because it seems "cleaner" than server dependency.
**Why it fails:** Every published P2P OT algorithm has been proven incorrect. The TP2 property that P2P requires is likely impossible to satisfy with standard string operations.
**Example:** A startup builds P2P OT for "serverless collaboration." After 6 months, users report documents diverging. Debugging reveals the transformation functions fail with 3+ concurrent editors—a case the test suite never covered.
**Solution:** Use client-server architecture. The server round-trip cost is far lower than the correctness risk of P2P.
### 2. Ignoring Operation Composition
**The mistake:** Treating operations as independent units instead of composable.
**Why it matters:** Client sends Op1, then Op2 before receiving Op1's ACK. Server receives both, but Op2 was created assuming Op1 applied. If server processes them as independent, transformation fails.
**Solution:** Wave protocol requires clients to wait for ACK, OR clients must track pending operations and transform new operations against pending ones before sending.
```typescript
// Client state machine
interface ClientState {
serverRevision: number
pending: Operation | null // Sent, awaiting ACK
buffer: Operation[] // Not yet sent
}
function onLocalEdit(state: ClientState, op: Operation): ClientState {
if (state.pending) {
// Transform against pending before buffering
const transformed = transform(op, state.pending)
return { ...state, buffer: [...state.buffer, transformed] }
} else {
// Send immediately
send(op)
return { ...state, pending: op }
}
}
```
### 3. Unbounded History Growth
**The mistake:** Storing all operations forever for "history" feature.
**Why it fails:** Document with 1 year of active editing accumulates millions of operations. Loading requires replaying all of them.
**Example:** Notion-like app stores every keystroke. After 6 months, loading a busy document takes 30 seconds as the client replays 500K operations.
**Solutions:**
- Periodic snapshots: Store document state at intervals, only replay recent operations
- Checkpoint compaction: Combine operations into single state updates
- Lazy loading: Load recent ops, fetch history on demand
### 4. Testing Only Happy Path
**The mistake:** Testing with low latency, no packet loss, and 2 concurrent users.
**Why it fails:** OT bugs emerge with high concurrency, reordered packets, and edge-case timing. The "dOPT puzzle" only manifests with specific 3-user concurrent patterns.
**Solution:**
- Fuzz testing: Random operation sequences with random timing
- Chaos engineering: Inject latency, reordering, duplication
- Invariant checking: Assert convergence after every operation sequence
- Property-based testing: Generate operation sequences, verify TP1 holds
```typescript
// Property-based test for TP1
test("TP1: transformation paths converge", () => {
fc.assert(
fc.property(arbitraryOperation(), arbitraryOperation(), arbitraryDocument(), (op1, op2, doc) => {
const path1 = apply(apply(doc, op1), transform(op2, op1))
const path2 = apply(apply(doc, op2), transform(op1, op2))
expect(path1).toEqual(path2)
}),
)
})
```
### 5. Assuming Transform Functions Are Symmetric
**The mistake:** Assuming T(op1, op2) and T(op2, op1) are computed the same way.
**Why it fails:** Transformation depends on which operation is being adjusted. Insert-vs-insert with same position needs tie-breaking; the "winner" doesn't transform, the "loser" shifts.
**Example:** Two users insert at position 5. Without consistent tie-breaking, one client computes T(A,B)=A (A wins), other computes T(A,B)=shift(A) (B wins). Divergence.
**Solution:** Always use consistent tie-breaking: site IDs, timestamps, or operation IDs. The rule must be deterministic and known to all clients.
## Conclusion
OT solves collaborative editing through operation transformation, but its apparent simplicity hides decades of failed correctness proofs. The practical takeaway: **use client-server architecture**. It sidesteps TP2, enables straightforward implementation, and is the only approach proven at scale.
For new projects, prefer existing libraries (ShareDB, CKEditor, or CRDT-based alternatives like Yjs) over custom implementations. If building custom, invest heavily in property-based testing and invariant checking—the bugs that matter only emerge with specific concurrent operation sequences that manual testing will miss.
OT remains the foundation of Google Docs and most production collaborative editors, but its dominance is being challenged by CRDTs for offline-first applications. The choice depends on your consistency requirements, offline needs, and tolerance for implementation complexity.
## Appendix
### Prerequisites
- Understanding of distributed systems consistency models
- Familiarity with event sourcing concepts
- Basic knowledge of Lamport timestamps and causal ordering
- For CRDTs comparison: see Martin Kleppmann's ["CRDTs: The Hard Parts"](https://martin.kleppmann.com/2020/07/06/crdt-hard-parts-hydra.html)
### Terminology
- **OT (Operational Transformation)**: Algorithm that transforms concurrent operations to achieve convergence
- **TP1 (Transformation Property 1)**: Convergence property—different transformation paths yield same result
- **TP2 (Transformation Property 2)**: Commutativity property—transformation is independent of operation ordering
- **Convergence**: All replicas reach identical state after applying all operations
- **Intent preservation**: Transformed operation achieves user's original goal
- **Tombstone**: Marker for deleted content, retained for transformation stability
- **COT (Context-based OT)**: Variant tracking operation context for transformation
- **TTF (Tombstone Transformation Functions)**: Provably correct transformation using tombstones
### Summary
- OT transforms concurrent operations by adjusting positions based on what other operations have done
- **TP1 is mandatory**: transformation paths must converge to the same state
- **TP2 is impractical**: no standard string operations satisfy it; use client-server to avoid requiring it
- Production systems (Google Docs, CKEditor) all use client-server architecture for this reason
- Libraries (ShareDB, ot.js) are strongly preferred over custom implementations
- Testing must include fuzz testing and property-based tests—manual testing misses the bugs that matter
### References
- Ellis, C.A., Gibbs, S.J. (1989). ["Concurrency Control in Groupware Systems"](https://dl.acm.org/doi/10.1145/67544.66963) - ACM SIGMOD'89. Original dOPT algorithm.
- Nichols, D.A. et al. (1995). ["High-latency, low-bandwidth windowing in the Jupiter collaboration system"](https://dl.acm.org/doi/10.1145/215585.215706) - ACM UIST. Foundation for all practical OT.
- Ressel, M. et al. (1996). ["An Integrating, Transformation-Oriented Approach to Concurrency Control and Undo"](https://dl.acm.org/doi/10.1145/240080.240305) - ACM CSCW'96. Defined TP1/TP2.
- Sun, C., Ellis, C. (1998). ["Operational transformation in real-time group editors"](https://dl.acm.org/doi/10.1145/289444.289469) - ACM CSCW'98. GOT/GOTO algorithms.
- Oster, G. et al. (2006). ["Tombstone Transformation Functions for Ensuring Consistency"](https://hal.science/inria-00109039) - CollaborateCom. First provably correct OT.
- Sun, D., Sun, C. (2009). ["Context-Based Operational Transformation"](https://ieeexplore.ieee.org/document/4668339/) - IEEE TPDS. COT algorithm.
- Randolph, A. et al. (2013). ["On Consistency of Operational Transformation Approach"](https://arxiv.org/abs/1302.3292) - arXiv. Impossibility proof for TP2.
- Levien, R. (2016). ["Towards a unified theory of Operational Transformation and CRDT"](https://medium.com/@raphlinus/towards-a-unified-theory-of-operational-transformation-and-crdt-70485876f72f) - Medium. Excellent synthesis.
- [Apache Wave OT Whitepaper](https://svn.apache.org/repos/asf/incubator/wave/whitepapers/operational-transform/operational-transform.html) - Detailed protocol specification.
- [Google Drive Blog](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html) - Google Docs architecture overview.
- [CKEditor Blog](https://ckeditor.com/blog/lessons-learned-from-creating-a-rich-text-editor-with-real-time-collaboration/) - Production lessons for rich text OT.
- [ShareDB GitHub](https://github.com/share/sharedb) - Production-ready OT library.
---
## Distributed Locking
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/distributed-locking
**Category:** System Design / Core Distributed Patterns
**Description:** Distributed locks coordinate access to shared resources across multiple processes or nodes. Unlike single-process mutexes, they must handle network partitions, clock drift, process pauses, and partial failures—all while providing mutual exclusion guarantees that range from “best effort” to “correctness critical.”This article covers lock implementations (Redis, ZooKeeper, etcd, Chubby), the Redlock controversy, fencing tokens, lease-based expiration, and when to avoid locks entirely.
# Distributed Locking
Distributed locks coordinate access to shared resources across multiple processes or nodes. Unlike single-process mutexes, they must handle network partitions, clock drift, process pauses, and partial failures—all while providing mutual exclusion guarantees that range from "best effort" to "correctness critical."
This article covers lock implementations (Redis, ZooKeeper, etcd, Chubby), the Redlock controversy, fencing tokens, lease-based expiration, and when to avoid locks entirely.

Distributed locks must handle failures that single-process locks never face—network partitions, clock drift, and process pauses can all cause multiple clients to believe they hold the same lock simultaneously.
## Abstract
Distributed locking is fundamentally harder than it appears. The safety property—at most one client holds the lock at any time—requires either **consensus protocols** (ZooKeeper, etcd) or careful timing assumptions that can fail under realistic conditions (Redlock).
**Core mental model:**
- **Efficiency locks**: Prevent duplicate work. Occasional double-execution is tolerable. Redis single-node or Redlock works.
- **Correctness locks**: Protect invariants. Double-execution corrupts data. Requires consensus + fencing tokens.
**Key insight**: Most lock implementations provide **leases** (auto-expiring locks) rather than indefinite locks. Leases prevent deadlock from crashed clients but introduce the fundamental problem: **what if the lease expires while the client is still working?**
**Fencing tokens** solve this: the lock service issues a monotonically increasing token with each lock grant. The protected resource rejects operations with tokens lower than the highest it has seen. This transforms lease expiration from a safety violation into a detected-and-rejected stale operation.
**Decision framework:**
| Requirement | Implementation | Trade-off |
| ------------------------------- | ------------------------ | ------------------------------ |
| Best-effort deduplication | Redis single-node | Single point of failure |
| Efficiency with fault tolerance | Redlock (5 nodes) | No fencing, timing assumptions |
| Correctness critical | ZooKeeper/etcd + fencing | Operational complexity |
| Already using PostgreSQL | Advisory locks | Limited to single database |
## The Problem
### Why Naive Solutions Fail
**Approach 1: File-based locks across NFS**
```typescript
// Naive NFS lock - seems simple
async function acquireLock(path: string): Promise {
try {
await fs.writeFile(path, process.pid, { flag: "wx" }) // exclusive create
return true
} catch {
return false // file exists
}
}
```
Fails because:
- **NFS semantics vary**: `O_EXCL` isn't atomic on all NFS implementations
- **No expiration**: If the process crashes, lock file persists forever
- **No fencing**: Stale lock holders can still access the resource
**Approach 2: Database row locks**
```sql
-- Lock by inserting a row
INSERT INTO locks (resource_id, holder, acquired_at)
VALUES ('resource-1', 'client-a', NOW())
ON CONFLICT DO NOTHING;
```
Fails because:
- **No automatic expiration**: Crashed clients leave orphan locks
- **Clock drift**: `acquired_at` timestamps unreliable across nodes
- **Single point of failure**: Database becomes bottleneck
**Approach 3: Redis SETNX without TTL**
```
SETNX resource:lock client-id
```
Fails because:
- **No expiration**: Crashed client locks resource forever
- **Race on release**: Client must check-then-delete atomically
### The Core Challenge
The fundamental tension: **distributed systems are asynchronous**—there are no bounded delays on message delivery, no bounded process pauses, and no bounded clock drift.
Distributed locks exist to provide **mutual exclusion** across this asynchronous environment. The challenge: you cannot distinguish a slow client from a crashed client, and you cannot trust clocks.
> "Distributed locks are not just a scaling challenge—they're a correctness challenge. The algorithm must be correct even when clocks are wrong, networks are partitioned, and processes pause unexpectedly."
> — Martin Kleppmann, "How to do distributed locking" (2016)
## Lease-Based Locking
All practical distributed locks use **leases**—time-bounded locks that expire automatically. This prevents indefinite lock holding by crashed clients.
### Core Mechanism

### TTL Selection Formula
```
MIN_VALIDITY = TTL - (T_acquire - T_start) - CLOCK_DRIFT
```
Where:
- **TTL**: Initial lease duration
- **T_acquire - T_start**: Time elapsed acquiring the lock
- **CLOCK_DRIFT**: Maximum expected clock skew between client and server
**Practical guidance:**
- **JVM applications**: TTL ≥ 60s (stop-the-world GC can pause for seconds)
- **Go/Rust applications**: TTL ≥ 30s (less GC concern, but network issues)
- **General rule**: TTL should be 10x your expected operation duration
### Clock Skew Issues
**Wall-clock danger:** Redis uses wall-clock time for TTL expiration. If the server's clock jumps forward (NTP adjustment, manual change), leases expire prematurely.
**Example failure scenario:**
1. Client acquires lock with TTL=30s at server time T
2. NTP adjusts server clock forward by 20s
3. Lock expires at "T+30s" = actual T+10s
4. Client still working; another client acquires lock
5. **Two clients now hold the "same" lock**
**Mitigation:** Use monotonic clocks where possible. Linux `clock_gettime(CLOCK_MONOTONIC)` measures elapsed time without wall-clock adjustments.
> **Prior to Redis 7.0:** TTL expiration relied entirely on wall-clock time. Redis 7.0+ uses monotonic clocks internally for some operations, but the fundamental issue remains for distributed Redlock scenarios where multiple independent clocks are involved.
## Design Paths
### Path 1: Redis Single-Node Lock
**When to choose:**
- Lock is for efficiency (prevent duplicate work), not correctness
- Single point of failure is acceptable
- Lowest latency requirement
**Implementation:**
```typescript collapse={1-2}
import { Redis } from "ioredis"
async function acquireLock(redis: Redis, resource: string, clientId: string, ttlMs: number): Promise {
// SET with NX (only if not exists) and PX (millisecond expiry)
const result = await redis.set(resource, clientId, "NX", "PX", ttlMs)
return result === "OK"
}
async function releaseLock(redis: Redis, resource: string, clientId: string): Promise {
// Lua script: atomic check-and-delete
// Only delete if we still own the lock
const script = `
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
`
const result = await redis.eval(script, 1, resource, clientId)
return result === 1
}
```
**Why the Lua script for release:** Without atomic check-and-delete, this race exists:
1. Client A's lock expires
2. Client B acquires lock
3. Client A (still thinking it has lock) calls `DEL`
4. Client A deletes Client B's lock
**Trade-offs:**
| Advantage | Disadvantage |
| ------------------------- | -------------------------- |
| Simple implementation | Single point of failure |
| Low latency (~1ms) | No automatic failover |
| Well-understood semantics | Lost locks on master crash |
**Real-world:** This approach works well for rate limiting, cache stampede prevention, and other scenarios where occasional double-execution is tolerable.
### Path 2: Redlock (Multi-Node Redis)
**When to choose:**
- Need fault tolerance for efficiency locks
- Can tolerate timing assumptions
- Want Redis ecosystem (Lua scripts, familiar API)
**Algorithm (N=5 independent Redis instances):**
1. Get current time in milliseconds
2. Try to acquire lock on ALL N instances sequentially, with small timeout per instance
3. Lock is acquired if: majority (N/2 + 1) succeeded AND total elapsed time < TTL
4. Validity time = TTL - elapsed time
5. If failed, release lock on ALL instances (even those that succeeded)
```typescript collapse={1-8}
import { Redis } from "ioredis"
import { randomBytes } from "crypto"
interface RedlockResult {
acquired: boolean
validity: number
value: string
}
async function redlockAcquire(instances: Redis[], resource: string, ttlMs: number): Promise {
const value = randomBytes(20).toString("hex")
const startTime = Date.now()
const quorum = Math.floor(instances.length / 2) + 1
let acquired = 0
for (const redis of instances) {
try {
const result = await redis.set(resource, value, "NX", "PX", ttlMs)
if (result === "OK") acquired++
} catch {
// Instance unavailable, continue
}
}
const elapsed = Date.now() - startTime
const validity = ttlMs - elapsed
if (acquired >= quorum && validity > 0) {
return { acquired: true, validity, value }
}
// Failed - release all locks
await Promise.all(instances.map((r) => releaseLock(r, resource, value)))
return { acquired: false, validity: 0, value }
}
```
**Critical limitation:** Redlock generates random values (20 bytes from `/dev/urandom`), not monotonically increasing tokens. **You cannot use Redlock values for fencing** because resources cannot determine which token is "newer."
**Trade-offs vs single-node:**
| Aspect | Single-Node | Redlock (N=5) |
| ----------------- | ----------- | -------------------------- |
| Fault tolerance | None | Survives N/2 failures |
| Latency | ~1ms | ~5ms (sequential attempts) |
| Complexity | Low | Medium |
| Fencing support | No | No |
| Clock assumptions | Server only | All N servers + client |
### Path 3: ZooKeeper
**When to choose:**
- Correctness-critical locks (fencing required)
- Already running ZooKeeper (Kafka, HBase ecosystem)
- Can tolerate higher latency for stronger guarantees
**Ephemeral sequential node recipe:**

**Algorithm:**
1. Client creates ephemeral sequential node under `/locks/resource`
2. Client lists all children, sorts by sequence number
3. If client's node has lowest sequence: lock acquired
4. Otherwise: set watch on the node with next-lowest sequence number
5. When watch fires: repeat step 2
**Why watch predecessor, not parent:**
- Watching parent causes **thundering herd**: all N clients wake when lock releases
- Watching predecessor: only next client wakes
**Fencing via zxid:** ZooKeeper's transaction ID (zxid) is a monotonically increasing 64-bit number. Use the zxid of your lock node as a fencing token.
```java collapse={1-6}
import org.apache.zookeeper.*;
import java.util.List;
import java.util.Collections;
public class ZkLock {
private final ZooKeeper zk;
private final String lockPath;
private String myNode;
public long acquireLock(String resource) throws Exception {
// Create ephemeral sequential node
myNode = zk.create(
"/locks/" + resource + "/lock-",
new byte[0],
ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL_SEQUENTIAL
);
while (true) {
List children = zk.getChildren("/locks/" + resource, false);
Collections.sort(children);
String smallest = children.get(0);
if (myNode.endsWith(smallest)) {
// We have the lock - return zxid as fencing token
Stat stat = zk.exists(myNode, false);
return stat.getCzxid();
}
// Find predecessor and watch it
int myIndex = children.indexOf(myNode.substring(myNode.lastIndexOf('/') + 1));
String predecessor = children.get(myIndex - 1);
// This blocks until predecessor is deleted
Stat stat = zk.exists("/locks/" + resource + "/" + predecessor, true);
if (stat != null) {
// Wait for watch notification
synchronized (this) { wait(); }
}
}
}
}
```
**Trade-offs:**
| Advantage | Disadvantage |
| ----------------------------------- | ----------------------------- |
| Strong consistency (Zab consensus) | Higher latency (2+ RTTs) |
| Automatic cleanup (ephemeral nodes) | Operational complexity |
| Fencing tokens (zxid) | Session management overhead |
| No clock assumptions | Quorum unavailable = no locks |
### Path 4: etcd
**When to choose:**
- Kubernetes-native environment
- Prefer gRPC over custom protocols
- Need distributed KV store beyond just locking
**Lease-based locking:**
etcd provides first-class lease primitives. A lease is a token with a TTL; keys can be attached to leases and are automatically deleted when the lease expires.
```go collapse={1-10}
package main
import (
"context"
"time"
clientv3 "go.etcd.io/etcd/client/v3"
"go.etcd.io/etcd/client/v3/concurrency"
)
func acquireLock(client *clientv3.Client, resource string) (*concurrency.Mutex, error) {
// Create session with 30s TTL
session, err := concurrency.NewSession(client, concurrency.WithTTL(30))
if err != nil {
return nil, err
}
// Create mutex and acquire
mutex := concurrency.NewMutex(session, "/locks/"+resource)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if err := mutex.Lock(ctx); err != nil {
return nil, err
}
// Use mutex.Header().Revision as fencing token
return mutex, nil
}
```
**Fencing via revision:** etcd assigns a globally unique, monotonically increasing revision to every modification. Use `mutex.Header().Revision` as your fencing token.
**Critical limitation (Jepsen finding):** Under network partitions, etcd locks can fail to provide mutual exclusion. Jepsen testing found ~18% loss of acknowledged updates when locks protected concurrent modifications. The root cause: etcd must sacrifice correctness to preserve liveness in asynchronous systems.
> "etcd's lock is not safe. It is possible for two processes to simultaneously hold the same lock, even in healthy clusters."
> — Kyle Kingsbury, Jepsen analysis of etcd 3.4.3 (2020)
**Trade-offs:**
| Advantage | Disadvantage |
| ----------------------------------- | ------------------------------ |
| Raft consensus (strong consistency) | Jepsen found safety violations |
| Native lease support | Higher latency than Redis |
| Kubernetes integration | Operational complexity |
| Revision-based fencing | Quorum unavailable = no locks |
### Path 5: Database Advisory Locks (PostgreSQL)
**When to choose:**
- Already using PostgreSQL
- Lock scope is single database
- Don't want external dependencies
**Session-level advisory locks:**
```sql
-- Acquire lock (blocks until available)
SELECT pg_advisory_lock(hashtext('resource-1'));
-- Try acquire (returns immediately)
SELECT pg_try_advisory_lock(hashtext('resource-1'));
-- Release
SELECT pg_advisory_unlock(hashtext('resource-1'));
```
**Transaction-level advisory locks:**
```sql
-- Automatically released at transaction end
SELECT pg_advisory_xact_lock(hashtext('resource-1'));
-- Then do your work within the transaction
UPDATE resources SET ... WHERE id = 'resource-1';
```
**Lock ID generation:** Advisory locks take a 64-bit integer key. Use `hashtext()` for string-based resource IDs, or encode your own scheme.
**Connection pooling danger:** Session-level locks are tied to the database connection. With connection pooling (PgBouncer), your "session" may be reused by another client, leaking locks. **Use transaction-level locks with connection pooling.**
**Trade-offs:**
| Advantage | Disadvantage |
| ----------------------------- | --------------------------- |
| No external dependencies | Single database scope |
| ACID guarantees | Connection pooling issues |
| Already have PostgreSQL | Not for multi-database |
| Automatic transaction cleanup | Lock ID collisions possible |
### Comparison Matrix
| Factor | Redis Single | Redlock | ZooKeeper | etcd | PostgreSQL |
| ---------------------- | ------------ | --------------- | ------------ | ---------------- | --------------- |
| Fault tolerance | None | N/2 failures | N/2 failures | N/2 failures | Database HA |
| Fencing tokens | No | No | Yes (zxid) | Yes (revision) | No |
| Latency (acquire) | ~1ms | ~5-10ms | ~10-50ms | ~10-50ms | ~1-5ms |
| Clock assumptions | Yes | Yes (all nodes) | No | No | No |
| Correctness guarantee | No | No | Yes | Partial (Jepsen) | Yes (single DB) |
| Operational complexity | Low | Medium | High | Medium | Low |
### Decision Framework

## Fencing Tokens
### The Problem They Solve
Leases expire. When they do, a "stale" lock holder may still be executing its critical section. Without fencing, this corrupts the protected resource.
**Example failure scenario:**

### How Fencing Tokens Work
1. Lock service issues **monotonically increasing token** with each grant
2. Client includes token with every operation on protected resource
3. Resource tracks **highest token ever seen**
4. Resource **rejects** operations with token < highest seen

### Implementation Pattern
**Lock service side:**
```typescript collapse={1-4}
interface LockGrant {
token: bigint // Monotonically increasing
expiresAt: number
}
class FencingLockService {
private nextToken: bigint = 1n
private locks: Map = new Map()
acquire(resource: string, clientId: string, ttlMs: number): LockGrant | null {
const existing = this.locks.get(resource)
if (existing && existing.expiresAt > Date.now()) {
return null // Lock held
}
const token = this.nextToken++
const expiresAt = Date.now() + ttlMs
this.locks.set(resource, { holder: clientId, token, expiresAt })
return { token, expiresAt }
}
}
```
**Resource side:**
```typescript collapse={1-6}
interface FencedWrite {
token: bigint
data: unknown
}
class FencedStorage {
private highestToken: Map = new Map()
private data: Map = new Map()
write(resource: string, write: FencedWrite): boolean {
const highest = this.highestToken.get(resource) ?? 0n
if (write.token < highest) {
// Stale token - reject
return false
}
// Accept write, update highest seen
this.highestToken.set(resource, write.token)
this.data.set(resource, write.data)
return true
}
}
```
### Why Random Values Don't Work
Redlock uses random values (20 bytes), not ordered tokens. A resource cannot determine if `abc123` is "newer" than `xyz789`. This is why Redlock cannot provide fencing—the values lack the **ordering property** required to reject stale operations.
> "To make the lock safe with fencing, you need not just a random token, but a monotonically increasing token. And the only way to generate a monotonically increasing token is to use a consensus protocol."
> — Martin Kleppmann
### ZooKeeper zxid as Fencing Token
ZooKeeper's transaction ID (zxid) is perfect for fencing:
- **Monotonically increasing**: Every ZK transaction increments it
- **Globally ordered**: All clients see same ordering
- **Available at lock time**: `Stat.getCzxid()` returns creation zxid
```java
// When acquiring lock
Stat stat = zk.exists(myLockNode, false);
long fencingToken = stat.getCzxid();
// When accessing resource
resource.write(data, fencingToken);
```
## The Redlock Controversy
### Kleppmann's Critique (2016)
Martin Kleppmann identified fundamental problems with Redlock:
**1. Timing assumptions violated by real systems:**
Redlock assumes bounded network delay, bounded process pauses, and bounded clock drift. Real systems violate all three:
- Network packets can be delayed arbitrarily (TCP retransmits, routing changes)
- GC pauses can exceed lease TTL (observed: 1+ minutes in production JVMs)
- Clock skew can be seconds under adversarial NTP conditions
**2. No fencing capability:**
Even if Redlock worked perfectly, it generates random values, not monotonic tokens. Resources cannot reject stale operations.
**3. Clock jump scenario:**
1. Client acquires lock on 3 of 5 Redis instances
2. Clock on one instance jumps forward (NTP sync)
3. Lock expires prematurely on that instance
4. Another client acquires on 3 instances (the jumped one + 2 others)
5. **Two clients now hold majority**
### Antirez's Response
Salvatore Sanfilippo (Redis creator) responded:
**1. Random values + CAS = sufficient:**
> "The token is a random string. If you use check-and-set (CAS), you can use the random string to ensure that only the lock owner can modify the resource."
**2. Post-acquisition time check:**
Redlock spec includes checking elapsed time after acquisition. If elapsed > TTL, the lock is considered invalid. This allegedly handles delayed responses.
**3. Monotonic clocks:**
Proposed using `CLOCK_MONOTONIC` instead of wall clocks to eliminate clock jump issues.
### The Verdict
Neither argument is fully satisfying:
| Kleppmann's points | Antirez's counterpoints | Reality |
| ------------------------ | ---------------------------- | ------------------------------------------------------------------- |
| GC pauses violate timing | Post-acquisition check helps | Pauses can happen _during_ resource access, not just during acquire |
| No fencing possible | Random + CAS works | CAS requires resource to store lock value; not always feasible |
| Clock jumps break safety | Use monotonic clocks | Cross-machine monotonic clocks don't exist |
**Practical guidance:**
- **Efficiency locks**: Redlock is acceptable. Double-execution is annoying but not catastrophic.
- **Correctness locks**: Use consensus-based systems (ZooKeeper) with fencing tokens. Redlock's random values cannot fence.
## Production Implementations
### Google Chubby: The Original
**Context:** Internal distributed lock service powering GFS, BigTable, and other Google infrastructure. Open-sourced concept inspired ZooKeeper.
**Architecture:**
- 5-replica Paxos cluster
- Replicas elect master using Paxos; master lease is several seconds
- Client sessions with grace periods (45s default)
- Files + locks (locks are files with special semantics)
**Key design decisions:**
- **Coarse-grained locks**: Designed for locks held minutes to hours, not milliseconds
- **Advisory locks by default**: Files don't prevent access without explicit lock checks
- **Master lease renewal**: Master doesn't lose leadership on brief network blips
- **Client grace period**: On leader change, clients have 45s to reconnect before session (and locks) invalidate
**Fencing mechanism:** Chubby supports sequencers (fencing tokens). The lock service hands out sequencers; resources can verify them with Chubby before accepting writes.
> "If a process's lease has expired, the lock server will refuse to validate the sequencer."
> — Mike Burrows, "The Chubby Lock Service" (2006)
**Scale:** Chubby is not designed for high-frequency locking. It's optimized for reliability of infrequent operations, not throughput.
### Uber: Driver Assignment
**Context:** When a rider requests a cab, multiple nearby drivers could be assigned. Exactly one driver must be assigned per ride.
**Problem:**
- Multiple matching service instances receive the same request
- Race condition: both try to assign the same driver
- Result: driver assigned to multiple rides, customer experience failure
**Solution:**
- Distributed lock on `driver:{driver_id}` before assignment
- Lock held only during assignment operation (~10-100ms)
- Redis-based (likely Redlock or single-node with replication)
**Why it works:** This is an efficiency lock. If two services somehow both assign the same driver (lock failure), the booking system downstream rejects the duplicate. Occasional failures are detected and handled.
### Netflix: Job Deduplication
**Context:** Millions of distributed jobs, some triggered by events that may arrive multiple times.
**Problem:**
- Event arrives at multiple consumer instances
- Same job should execute exactly once
- Idempotency alone doesn't help if job has side effects
**Solution approach:**
- Acquire lock before processing event
- Lock key: `job:{event_id}:{job_type}`
- TTL: Expected job duration + buffer
- Combined with idempotency keys in downstream services
**Insight:** Netflix uses a layered approach—locks provide first-line deduplication, idempotent operations provide safety net, and monitoring detects drift.
### Implementation Comparison
| Aspect | Google Chubby | Uber | Netflix |
| --------- | -------------------------- | -------------------------- | -------------------------- |
| Lock type | Correctness | Efficiency | Efficiency |
| Duration | Minutes-hours | Milliseconds | Seconds |
| Backend | Paxos (custom) | Redis | Redis/ZK hybrid |
| Fencing | Sequencers | Downstream checks | Idempotency keys |
| Scale | Low freq, high reliability | High freq, acceptable loss | High freq, acceptable loss |
## Lock-Free Alternatives
### When to Avoid Locks Entirely
Distributed locks add complexity and failure modes. Before reaching for a lock, consider:
**1. Idempotent operations:**
If your operation can safely execute multiple times with the same result, you don't need a lock.
```typescript
// Bad: non-idempotent
async function incrementCounter(id: string) {
const current = await db.get(id)
await db.set(id, current + 1)
}
// Good: idempotent with versioning
async function setCounterIfMatch(id: string, expectedVersion: number, newValue: number) {
await db
.update(id)
.where("version", expectedVersion)
.set({ value: newValue, version: expectedVersion + 1 })
}
```
**2. Compare-and-Swap (CAS):**
Many databases support atomic CAS. Use it instead of external locks.
```sql
-- CAS-based update
UPDATE resources
SET value = 'new-value', version = version + 1
WHERE id = 'resource-1' AND version = 42;
-- Check rows affected - if 0, retry with fresh version
```
**3. Optimistic concurrency:**
Assume no conflicts; detect and retry on collision.
```typescript collapse={1-6}
interface VersionedResource {
data: unknown
version: number
}
async function optimisticUpdate(id: string, transform: (data: unknown) => unknown) {
while (true) {
const resource = await db.get(id)
const newData = transform(resource.data)
const updated = await db.update(id, {
data: newData,
version: resource.version + 1,
_where: { version: resource.version },
})
if (updated) return // Success
// Version conflict - retry
}
}
```
**4. Queue-based serialization:**
Route all operations for a resource to a single queue/partition.

This eliminates concurrent access by design.
### Decision: Lock vs Lock-Free
| Factor | Use Distributed Lock | Use Lock-Free |
| ----------------------- | ---------------------------- | -------------------------------- |
| Operation complexity | Multi-step, non-decomposable | Single atomic operation |
| Conflict frequency | Rare | Frequent (CAS retries expensive) |
| Side effects | External (can't retry) | Local (can retry) |
| Existing infrastructure | Lock service available | Database has CAS |
| Team expertise | Lock patterns understood | Lock-free patterns understood |
## Common Pitfalls
### 1. Holding Locks Across Async Boundaries
**The mistake:** Acquiring lock, then making RPC calls or doing I/O while holding it.
```typescript
// Dangerous: lock held during external call
const lock = await acquireLock(resource)
const data = await externalService.fetch() // Network call!
await db.update(resource, data)
await releaseLock(lock)
```
**What goes wrong:**
- External call takes 10s; lock TTL is 5s
- Lock expires while you're still working
- Another client acquires and corrupts data
**Solution:** Minimize lock scope. Fetch data first, then lock-update-unlock quickly.
```typescript
// Better: minimize lock duration
const data = await externalService.fetch()
const lock = await acquireLock(resource)
await db.update(resource, data)
await releaseLock(lock)
```
### 2. Ignoring Lock Acquisition Failure
**The mistake:** Assuming lock acquisition always succeeds.
```typescript
// Dangerous: no failure handling
await acquireLock(resource)
await criticalOperation()
await releaseLock(resource)
```
**What goes wrong:**
- Lock service unavailable → operation proceeds without lock
- Lock contention → silent failure, concurrent access
**Solution:** Always check acquisition result and handle failure.
```typescript
const acquired = await acquireLock(resource)
if (!acquired) {
throw new Error("Failed to acquire lock - cannot proceed")
}
try {
await criticalOperation()
} finally {
await releaseLock(resource)
}
```
### 3. Lock-Release Race with TTL
**The mistake:** Releasing a lock you no longer own (it expired and was re-acquired).
```typescript
// Dangerous: release without ownership check
await lock.release() // May delete another client's lock!
```
**What goes wrong:**
1. Your lock expires due to slow operation
2. Another client acquires the lock
3. Your `release()` deletes their lock
4. Third client acquires, now two clients think they have it
**Solution:** Atomic release that checks ownership (shown in Redis Lua script earlier).
### 4. Thundering Herd on Lock Release
**The mistake:** All waiting clients wake simultaneously when lock releases.
**What goes wrong with ZooKeeper naive implementation:**
- 1000 clients watch `/locks/resource` parent node
- Lock releases, all 1000 receive watch notification
- All 1000 call `getChildren()` simultaneously
- ZooKeeper overloaded, lock acquisition stalls
**Solution:** Watch predecessor only (shown in ZooKeeper recipe earlier). Only one client wakes per release.
### 5. Missing Fencing on Correctness-Critical Locks
**The mistake:** Using Redlock (or any lease-based lock) without fencing for correctness-critical operations.
```typescript
// Dangerous: no fencing
const lock = await redlock.acquire(resource)
await storage.write(data) // Stale lock holder can overwrite!
await redlock.release(lock)
```
**Solution:** Either use a lock service with fencing tokens (ZooKeeper) or accept that this lock is efficiency-only.
### 6. Session-Level Locks with Connection Pooling
**The mistake:** Using PostgreSQL session-level advisory locks with PgBouncer.
```sql
-- Acquired by connection in pool
SELECT pg_advisory_lock(12345);
-- Connection returned to pool
-- Other client reuses connection
-- Lock is still held by "other" client!
```
**Solution:** Use transaction-level locks with pooling.
```sql
BEGIN;
SELECT pg_advisory_xact_lock(12345);
-- Do work
COMMIT; -- Lock automatically released
```
## Conclusion
Distributed locking is a coordination primitive that requires careful consideration of failure modes, timing assumptions, and fencing requirements.
**Key decisions:**
1. **Efficiency vs correctness:** Most locks are for efficiency (preventing duplicate work). These can use simpler implementations with known failure modes. Correctness-critical locks require consensus protocols and fencing.
2. **Fencing is non-negotiable for correctness:** Without fencing tokens, lease expiration during long operations corrupts data. Random lock values (Redlock) cannot fence.
3. **Timing assumptions are dangerous:** Redlock's safety depends on bounded network delays, process pauses, and clock drift. Real systems violate all three.
4. **Consider lock-free alternatives:** Idempotent operations, CAS, optimistic concurrency, and queue-based serialization often work better than distributed locks.
**Start simple:** Single-node Redis locks work for most efficiency scenarios. Graduate to ZooKeeper with fencing only when correctness is critical and you understand the operational cost.
## Appendix
### Prerequisites
- Distributed systems fundamentals (network partitions, consensus)
- CAP theorem and consistency models
- Basic understanding of lease-based coordination
### Terminology
| Term | Definition |
| ------------------ | --------------------------------------------------------------------------------- |
| **Lease** | Time-bounded lock that expires automatically |
| **Fencing token** | Monotonically increasing identifier that resources use to reject stale operations |
| **TTL** | Time-To-Live; duration before lease expires |
| **Quorum** | Majority of nodes (N/2 + 1) required for consensus |
| **Split-brain** | Network partition where multiple partitions believe they are authoritative |
| **zxid** | ZooKeeper transaction ID; monotonically increasing, usable as fencing token |
| **Advisory lock** | Lock that doesn't prevent access—just signals intention |
| **Ephemeral node** | ZooKeeper node that is automatically deleted when the client session ends |
### Summary
- Distributed locks are **harder than they appear**—network partitions, clock drift, and process pauses all cause multiple clients to believe they hold the same lock
- **Leases** (auto-expiring locks) prevent deadlock but introduce the lease-expiration-during-work problem
- **Fencing tokens** solve this by having the resource reject operations from stale lock holders
- **Redlock** provides fault-tolerant efficiency locks but **cannot fence** (random values lack ordering)
- **ZooKeeper/etcd** provide fencing tokens (zxid/revision) but add operational complexity
- **Lock-free alternatives** (CAS, idempotency, queues) often work better than distributed locks
- For **correctness-critical** locks: use consensus + fencing; for **efficiency** locks: Redis single-node is often sufficient
### References
**Foundational:**
- [How to do distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html) - Martin Kleppmann's analysis of Redlock.
- [Is Redlock safe?](https://antirez.com/news/101) - Salvatore Sanfilippo's response.
- [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/the-chubby-lock-service-for-loosely-coupled-distributed-systems/) - Mike Burrows, OSDI 2006.
**Implementation Documentation:**
- [Redis Distributed Locks](https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/) - Official Redis distributed lock documentation.
- [ZooKeeper Recipes and Solutions](https://zookeeper.apache.org/doc/current/recipes.html) - Official ZooKeeper lock recipe.
- [etcd Concurrency API](https://etcd.io/docs/v3.5/dev-guide/api_concurrency_ref/) - etcd lease and lock APIs.
- [PostgreSQL Advisory Locks](https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS) - PostgreSQL documentation.
**Testing and Analysis:**
- [Jepsen: etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3) - Kyle Kingsbury's analysis finding safety violations in etcd locks.
- [Designing Data-Intensive Applications](https://dataintensive.net/) - Martin Kleppmann. Chapter 8 covers distributed coordination.
**Libraries:**
- [Redisson](https://redisson.org/) - Redis Java client with distributed locks.
- [node-redlock](https://github.com/mike-marcacci/node-redlock) - Redlock implementation for Node.js.
- [Curator](https://curator.apache.org/) - ZooKeeper recipes including distributed locks.
---
## Exactly-Once Delivery
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/exactly-once-delivery
**Category:** System Design / Core Distributed Patterns
**Description:** True exactly-once delivery is impossible in distributed systems—the Two Generals Problem (1975) and FLP impossibility theorem (1985) prove this mathematically. What we call “exactly-once” is actually “effectively exactly-once”: at-least-once delivery combined with idempotency and deduplication mechanisms that ensure each message’s effect occurs exactly once, even when the message itself is delivered multiple times.
# Exactly-Once Delivery
True exactly-once delivery is impossible in distributed systems—the Two Generals Problem (1975) and FLP impossibility theorem (1985) prove this mathematically. What we call "exactly-once" is actually "effectively exactly-once": at-least-once delivery combined with idempotency and deduplication mechanisms that ensure each message's effect occurs exactly once, even when the message itself is delivered multiple times.

Exactly-once semantics: at-least-once delivery + idempotent consumption = effectively exactly-once effect.
## Abstract
The mental model for exactly-once semantics:
1. **Network unreliability is fundamental**—messages can be lost, duplicated, or reordered. No protocol can guarantee exactly-once delivery at the network layer.
2. **Exactly-once is a composition**: at-least-once delivery (never lose messages) + idempotency/deduplication (make duplicates harmless) = exactly-once effect.
3. **Three implementation layers**:
- **Producer side**: Idempotent producers with sequence numbers (Kafka), idempotency keys (Stripe)
- **Broker side**: Deduplication windows (SQS: 5 min, Azure: up to 7 days), FIFO ordering, transactional commits
- **Consumer side**: Idempotent operations, processed-message tracking, atomic state updates
4. **The deduplication window trade-off**: Every system must choose how long to remember message IDs. Shorter windows save storage but risk duplicates from slow retries. Longer windows add overhead but catch more duplicates.
5. **Boundary matters**: Most exactly-once guarantees (Kafka, Pub/Sub) stop at system boundaries. Cross-system exactly-once requires additional patterns (transactional outbox, idempotent consumers).
## The Problem
### Why Naive Solutions Fail
**Approach 1: Fire-and-forget (at-most-once)**
Send the message once with no retries. If the network drops it, the message is lost forever.
- Fails because: Message loss is common—TCP connections drop, services restart, packets get corrupted
- Example: Payment notification lost → customer never knows payment succeeded → duplicate payment attempt
**Approach 2: Retry until acknowledged (at-least-once)**
Keep retrying until you receive an acknowledgment. Never lose a message.
- Fails because: The acknowledgment itself can be lost. Producer retries a message that was actually processed.
- Example: Transfer $100 → ack lost → retry → transfer $100 again → $200 withdrawn
**Approach 3: Distributed transactions (two-phase commit)**
Coordinate sender and receiver in a distributed transaction to ensure atomic delivery.
- Fails because: Blocks on coordinator availability, doesn't handle network partitions, terrible performance
- Example: 2PC coordinator fails while holding locks → all participants blocked indefinitely
### The Core Challenge
The fundamental tension: **reliability requires retries, but retries create duplicates**.
The Two Generals Problem proves this mathematically. Two parties cannot achieve certainty of agreement over an unreliable channel—any finite sequence of confirmations leaves doubt about whether the final message arrived.
> **FLP Impossibility (Fischer, Lynch, Patterson, 1985)**: No deterministic algorithm can solve consensus in an asynchronous system where even one process may fail. The theorem assumes reliable message delivery but unbounded delays—any fault-tolerant consensus algorithm has runs that never terminate.
Practical systems circumvent FLP through randomized algorithms (Ben-Or, Rabin), partial synchrony assumptions (Paxos, Raft), or failure detectors. The implication for exactly-once: we cannot guarantee it at the protocol level, so we make duplicates harmless instead.
## Delivery Semantics
### At-Most-Once
Each message is delivered zero or one times. Messages may be lost but are never redelivered.
**Implementation**: Send once, no retries, no acknowledgment tracking.
**Trade-offs**:
- ✅ Lowest latency and complexity
- ✅ No duplicate handling needed
- ❌ Data loss is guaranteed over time
- ❌ Unsuitable for critical operations
**Use cases**: Metrics collection, logging, real-time analytics where occasional loss is acceptable.
### At-Least-Once
Each message is delivered one or more times. Messages are never lost, but duplicates occur.
**Implementation**: Retry with exponential backoff until acknowledgment received. Store unacked messages durably.
**Trade-offs**:
- ✅ No data loss
- ✅ Simple to implement
- ❌ Consumer must handle duplicates
- ❌ Ordering not guaranteed with retries
**Use cases**: Event sourcing, audit logs, any system where data loss is unacceptable and consumers are idempotent.
### Exactly-Once (Effectively)
Each message's effect occurs exactly once. The message may be delivered multiple times, but the system ensures idempotent processing.
**Implementation**: At-least-once delivery + one of:
- Idempotent operations (natural or designed)
- Deduplication at consumer (track processed message IDs)
- Transactional processing (atomic read-process-write)
**Trade-offs**:
- ✅ No data loss, no duplicate effects
- ❌ Higher complexity and latency
- ❌ Requires coordination between producer, broker, and consumer
- ❌ Deduplication window creates edge cases
**Use cases**: Financial transactions, order processing, any operation where duplicates cause real-world harm.
### Comparison
| Aspect | At-Most-Once | At-Least-Once | Exactly-Once |
| -------------- | ------------ | ------------- | ------------------------- |
| Message loss | Possible | Never | Never |
| Duplicates | Never | Possible | Prevented |
| Complexity | Low | Medium | High |
| Latency | Lowest | Medium | Highest |
| State required | None | Retry queue | Dedup store + retry queue |
## Design Paths
### Path 1: Idempotent Operations
Make the operation itself idempotent—applying it multiple times produces the same result as applying it once.
**When to choose this path:**
- Operations are naturally idempotent (SET vs INCREMENT)
- You control the consumer's state model
- Minimal infrastructure investment desired
**Key characteristics:**
- No deduplication storage required
- Works regardless of delivery semantics
- Requires careful operation design
**Natural idempotency examples:**
```typescript
// SET operations are naturally idempotent
await db.query("UPDATE users SET email = $1 WHERE id = $2", [email, userId])
// DELETE with specific criteria is idempotent
await db.query("DELETE FROM sessions WHERE user_id = $1 AND token = $2", [userId, token])
// GET operations are always idempotent
const user = await db.query("SELECT * FROM users WHERE id = $1", [userId])
```
**Non-idempotent operations that need transformation:**
```typescript
// ❌ Non-idempotent: INCREMENT
await db.query("UPDATE accounts SET balance = balance + $1 WHERE id = $2", [amount, accountId])
// ✅ Idempotent version: SET with version check
await db.query(
`
UPDATE accounts
SET balance = $1, version = $2
WHERE id = $3 AND version = $4
`,
[newBalance, newVersion, accountId, expectedVersion],
)
```
**Trade-offs vs other paths:**
| Aspect | Idempotent Operations | Deduplication |
| ----------------- | --------------------------- | ----------------------- |
| Storage overhead | None | Message ID store |
| Design complexity | Higher (rethink operations) | Lower (add dedup layer) |
| Failure modes | Version conflicts | Window expiry |
| Latency | Lower | Higher (dedup lookup) |
### Path 2: Idempotency Keys (API Pattern)
Client generates a unique key per logical operation. Server tracks keys and returns cached results for duplicates.
**When to choose this path:**
- Exposing APIs to external clients
- Operations are not naturally idempotent
- Client controls retry behavior
**Key characteristics:**
- Client generates unique key (UUID v4)
- Server stores operation result keyed by idempotency key
- Subsequent requests with same key return cached result
- Keys expire after a window (typically 24 hours)
**Implementation approach:**
```typescript collapse={1-8, 26-35}
// Server-side idempotency key handling
import { Redis } from "ioredis"
interface IdempotencyRecord {
status: "processing" | "completed" | "failed"
response?: unknown
createdAt: number
}
async function handleWithIdempotency(
redis: Redis,
idempotencyKey: string,
operation: () => Promise,
): Promise<{ cached: boolean; response: unknown }> {
// Check for existing record
const existing = await redis.get(`idem:${idempotencyKey}`)
if (existing) {
const record: IdempotencyRecord = JSON.parse(existing)
if (record.status === "completed") {
return { cached: true, response: record.response }
}
// Still processing - return 409 Conflict
throw new Error("Request already in progress")
}
// Mark as processing (with TTL to handle crashes)
await redis.set(
`idem:${idempotencyKey}`,
JSON.stringify({ status: "processing", createdAt: Date.now() }),
"EX",
3600, // 1 hour TTL for processing state
"NX", // Only set if not exists
)
// Execute operation and store result
// ... operation execution and result caching
}
```
**Stripe's implementation details:**
- Keys stored in Redis cluster shared across all API servers
- 24-hour retention window
- Keys recycled after window expires
- Response includes original status code and body
**Real-world example:**
Stripe processes millions of payment requests daily. Their idempotency key system:
- Client includes `Idempotency-Key` header with UUID
- Server returns `Idempotent-Replayed: true` header for cached responses
- First request that fails partway through is re-executed on retry
- First request that succeeds is returned from cache on retry
Result: Zero duplicate charges from network retries.
### Path 3: Broker-Side Deduplication
Message broker tracks message IDs and filters duplicates before delivery to consumers.
**When to choose this path:**
- Using a message broker that supports deduplication
- Want to offload deduplication from consumers
- Willing to accept deduplication window constraints
**Key characteristics:**
- Producer assigns unique message ID
- Broker maintains recent message IDs in memory/storage
- Duplicates filtered before consumer delivery
- Window-based: IDs forgotten after expiry
**Kafka idempotent producer (since 0.11, default since 3.0):**
The broker assigns a 64-bit Producer ID (PID) to each producer instance. The producer assigns monotonically increasing 32-bit sequence numbers per topic-partition:
```
Producer → [PID: 12345, Seq: 0] → Broker (accepts)
Producer → [PID: 12345, Seq: 1] → Broker (accepts)
Producer → [PID: 12345, Seq: 1] → Broker (duplicate, rejects)
Producer → [PID: 12345, Seq: 3] → Broker (out-of-order, error)
```
**Configuration (Kafka 3.0+):**
```properties
# Defaults changed in Kafka 3.0 - these are now on by default
enable.idempotence=true
acks=all
```
> **Prior to Kafka 3.0**: `enable.idempotence` defaulted to `false` and `acks` defaulted to `1`. Enabling idempotence required explicit configuration.
**Key limitation**: Idempotence is only guaranteed within a producer session. If the producer crashes and restarts without a `transactional.id`, it gets a new PID and sequence numbers reset—previously sent messages may be duplicated.
**AWS SQS FIFO deduplication:**
- **Fixed** 5-minute deduplication window (cannot be changed)
- Two methods: explicit `MessageDeduplicationId` or content-based (SHA-256 of body)
- After window expires, same ID can be submitted again
- Best practice: anchor `MessageDeduplicationId` to business context (e.g., `order-12345.payment`)
- With partitioning enabled: `MessageDeduplicationId + MessageGroupId` determines uniqueness
**Trade-offs vs other paths:**
| Aspect | Broker-Side | Consumer-Side |
| -------------------- | -------------- | -------------------- |
| Consumer complexity | Lower | Higher |
| Dedup window control | Broker-defined | Application-defined |
| Cross-broker dedup | No | Yes |
| Storage location | Broker | Application database |
### Path 4: Consumer-Side Deduplication
Consumer tracks processed message IDs and skips duplicates.
**When to choose this path:**
- Broker doesn't support deduplication
- Need longer deduplication windows than broker provides
- Want application-level control over dedup logic
**Key characteristics:**
- Consumer stores processed message IDs durably
- Check before processing; skip if seen
- ID storage must be in same transaction as state updates
- Flexible window: can retain IDs indefinitely
**Implementation with database constraints:**
```typescript collapse={1-6, 28-35}
// Idempotent consumer with database constraints
import { Pool } from "pg"
interface Message {
id: string
payload: unknown
}
async function processIdempotently(
pool: Pool,
subscriberId: string,
message: Message,
handler: (payload: unknown) => Promise,
): Promise<{ processed: boolean }> {
const client = await pool.connect()
try {
await client.query("BEGIN")
// Insert message ID - fails if duplicate (primary key violation)
const result = await client.query(
`INSERT INTO processed_messages (subscriber_id, message_id, processed_at)
VALUES ($1, $2, NOW())
ON CONFLICT DO NOTHING
RETURNING message_id`,
[subscriberId, message.id],
)
if (result.rowCount === 0) {
// Duplicate - skip processing
await client.query("ROLLBACK")
return { processed: false }
}
// Process message (state updates happen here)
await handler(message.payload)
await client.query("COMMIT")
return { processed: true }
} finally {
client.release()
}
}
```
**Schema:**
```sql
CREATE TABLE processed_messages (
subscriber_id VARCHAR(255) NOT NULL,
message_id VARCHAR(255) NOT NULL,
processed_at TIMESTAMP NOT NULL DEFAULT NOW(),
PRIMARY KEY (subscriber_id, message_id)
);
-- Index for cleanup queries
CREATE INDEX idx_processed_messages_time
ON processed_messages (processed_at);
```
**Real-world example:**
A payment processor handling webhook retries:
- Each webhook includes unique `event_id`
- Before processing: check if `event_id` exists in `processed_webhooks` table
- If exists: return 200 OK immediately (idempotent response)
- If not: process event, insert ID, return 200 OK
- Daily job: delete records older than 30 days
Result: Webhooks can be retried indefinitely without duplicate effects.
### Path 5: Transactional Processing
Wrap read-process-write into an atomic transaction. Either all effects happen or none do.
**When to choose this path:**
- Using Kafka with exactly-once requirements
- Processing involves read → transform → write pattern
- Need atomicity across multiple output topics/partitions
**Key characteristics:**
- Producer, consumer, and state updates are transactional
- Consumer offset committed as part of transaction
- Aborted transactions don't affect state
- Requires `isolation.level=read_committed` on consumers
**Kafka transactional producer/consumer:**
```typescript collapse={1-12, 45-55}
// Kafka exactly-once consume-transform-produce
import { Kafka, EachMessagePayload } from "kafkajs"
const kafka = new Kafka({ brokers: ["localhost:9092"] })
const producer = kafka.producer({
transactionalId: "my-transactional-producer",
maxInFlightRequests: 1,
idempotent: true,
})
const consumer = kafka.consumer({
groupId: "my-group",
readUncommitted: false, // read_committed isolation
})
async function processExactlyOnce() {
await producer.connect()
await consumer.connect()
await consumer.subscribe({ topic: "input-topic" })
await consumer.run({
eachMessage: async ({ topic, partition, message }: EachMessagePayload) => {
const transaction = await producer.transaction()
try {
// Transform message
const result = transform(message.value)
// Produce to output topic (within transaction)
await transaction.send({
topic: "output-topic",
messages: [{ value: result }],
})
// Commit consumer offset (within same transaction)
await transaction.sendOffsets({
consumerGroupId: "my-group",
topics: [{ topic, partitions: [{ partition, offset: message.offset }] }],
})
await transaction.commit()
} catch (error) {
await transaction.abort()
throw error
}
},
})
}
function transform(value: Buffer | null): string {
// Your transformation logic
return value?.toString().toUpperCase() ?? ""
}
```
**Kafka's transactional guarantees:**
- **Atomicity**: All messages in transaction commit together or none commit
- **Isolation**: Consumers with `read_committed` only see committed messages
- **Durability**: Committed transactions survive broker failures
**Trade-offs vs other paths:**
| Aspect | Transactional | Consumer-Side Dedup |
| ------------ | --------------------- | ------------------------ |
| Latency | Higher (coordination) | Lower |
| Complexity | Framework handles | Application handles |
| Cross-system | Kafka ecosystem only | Works with any broker |
| Recovery | Automatic | Manual offset management |
### Path 6: Transactional Outbox
Solve the dual-write problem by writing business data and events to the same database transaction, then asynchronously publishing events to the message broker.
**When to choose this path:**
- Need to update a database AND publish an event atomically
- Cannot tolerate lost events or phantom events (event published but DB write failed)
- Using a broker that doesn't support distributed transactions with your database
**Key characteristics:**
- Business data and outbox event written in single database transaction
- Background process reads outbox and publishes to broker
- Events marked as published or deleted after successful publish
- Requires idempotent consumers (relay may publish duplicates on crash recovery)
**Implementation approaches:**
| Approach | Description | Trade-offs |
| ------------------------- | ------------------------------------------------------ | --------------------------------------------------- |
| Polling Publisher | Background service polls outbox table periodically | Simple, adds latency (polling interval) |
| Change Data Capture (CDC) | Tools like Debezium read database transaction logs | Lower latency, preserves order, more infrastructure |
| Log-only outbox | PostgreSQL logical decoding without materializing rows | Minimal database growth, Postgres-specific |
**Implementation with polling publisher:**
```typescript collapse={1-10, 35-50}
// Transactional outbox pattern
import { Pool, PoolClient } from "pg"
interface OutboxEvent {
id: string
aggregateType: string
aggregateId: string
eventType: string
payload: unknown
createdAt: Date
}
async function createOrderWithEvent(pool: Pool, order: Order): Promise {
const client = await pool.connect()
try {
await client.query("BEGIN")
// Write business data
await client.query(
`INSERT INTO orders (id, customer_id, total, status)
VALUES ($1, $2, $3, $4)`,
[order.id, order.customerId, order.total, "created"],
)
// Write event to outbox in same transaction
await client.query(
`INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload)
VALUES ($1, $2, $3, $4, $5)`,
[crypto.randomUUID(), "Order", order.id, "OrderCreated", JSON.stringify(order)],
)
await client.query("COMMIT")
} catch (error) {
await client.query("ROLLBACK")
throw error
} finally {
client.release()
}
}
// Polling publisher (separate process)
async function publishOutboxEvents(pool: Pool, publisher: MessagePublisher): Promise {
const client = await pool.connect()
// ... polling and publishing logic
}
```
**Outbox table schema:**
```sql
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate_type VARCHAR(255) NOT NULL,
aggregate_id VARCHAR(255) NOT NULL,
event_type VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
published_at TIMESTAMP NULL
);
CREATE INDEX idx_outbox_unpublished ON outbox (created_at)
WHERE published_at IS NULL;
```
**Trade-offs vs direct publishing:**
| Aspect | Transactional Outbox | Direct to Broker |
| -------------- | ------------------------- | ------------------ |
| Atomicity | Guaranteed | Dual-write problem |
| Latency | Higher (async relay) | Lower (direct) |
| Complexity | Higher (outbox + relay) | Lower |
| Ordering | Preserved (by created_at) | Depends on broker |
| Infrastructure | Database + relay process | Broker only |
**Real-world usage:**
Debezium's outbox connector reads the outbox table via CDC and publishes to Kafka. This eliminates the need for a custom polling publisher and provides exactly-once delivery when combined with Kafka transactions.
### Decision Framework

## Production Implementations
### Kafka: Confluent's EOS
**Context:** Apache Kafka, originally developed at LinkedIn, now maintained by Confluent. Processes trillions of messages per day across major tech companies.
**Implementation choices:**
- Pattern variant: Idempotent producer + transactional processing
- Key customization: Producer ID (PID) with per-partition sequence numbers
- Scale: Tested at 1M+ messages/second with exactly-once guarantees
**Architecture:**

**Specific details:**
- Broker assigns 64-bit Producer ID to each producer on init
- Sequence numbers are per topic-partition, 32-bit integers
- Broker maintains last 5 sequence numbers in memory (configurable)
- `transactional.id` persists PID across producer restarts
- Transaction coordinator manages two-phase commit for multi-partition writes
**Version evolution:**
| Version | Change |
| ---------- | ----------------------------------------------------------------------------------- |
| Kafka 0.11 | KIP-98 introduced idempotent producers and transactions |
| Kafka 2.5 | EOS v2 introduced (improved scalability via KIP-447) |
| Kafka 3.0 | `enable.idempotence=true` and `acks=all` become defaults |
| Kafka 3.3 | KIP-618 added exactly-once for Kafka Connect source connectors |
| Kafka 4.0 | Will remove deprecated `exactly_once` and `exactly_once_beta` processing guarantees |
> **KIP-939 (in progress, 2024)**: Enables Kafka to participate in externally-coordinated 2PC transactions with databases. Adds a "prepare" RPC to the transaction coordinator, enabling atomic dual-writes between Kafka and external databases without the transactional outbox pattern.
**What worked:**
- Zero-overhead idempotency when enabled by default (Kafka 3.0+)
- Transactional writes perform within 3% of non-transactional in benchmarks
**What was hard:**
- Transaction coordinator becomes single point of coordination
- `transactional.id` management across producer instances
- Consumer rebalancing during transaction can cause duplicates if not using `read_committed`
- **EOS stops at the Kafka cluster boundary**—beyond Kafka, consumers must be idempotent
**Source:** [KIP-98 - Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging)
### Stripe: Idempotency Keys
**Context:** Payment processing platform handling millions of API requests daily. A single duplicate charge causes real financial harm.
**Implementation choices:**
- Pattern variant: Client-generated idempotency keys with server-side caching
- Key customization: 24-hour retention, Redis-backed distributed cache
- Scale: Handles all Stripe API traffic with idempotency support
**Architecture:**

**Specific details:**
- Keys are user-provided strings up to 255 characters
- Response cached includes status code, headers, and body
- `Idempotent-Replayed: true` header indicates cached response
- Keys in "processing" state return 409 Conflict on retry
- Separate Redis cluster for idempotency to isolate failure domains
**What worked:**
- Completely eliminates duplicate charges from network issues
- Clients can safely retry with exponential backoff
- No application logic changes needed for idempotent endpoints
**What was hard:**
- Determining correct 24-hour window (too short = duplicates, too long = storage cost)
- Handling partial failures (charge succeeded but idempotency record write failed)
- Cross-datacenter replication of idempotency store
**Source:** [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency)
### Google Pub/Sub: Exactly-Once Delivery
**Context:** Google Cloud's managed messaging service. Added exactly-once delivery (GA December 2022).
**Implementation choices:**
- Pattern variant: Broker-side deduplication with acknowledgment tracking
- Key customization: Regional scope, unique message IDs
- Scale: Google-scale messaging with exactly-once in single region
**Specific details:**
- Exactly-once guaranteed **within a single cloud region only**
- Uses unique message IDs assigned by Pub/Sub
- Subscribers receive acknowledgment confirmation (acknowledge succeeded or failed)
- Only the **latest acknowledgment ID** can acknowledge a message—previous IDs fail
- Default ack deadline: 60 seconds if unspecified
**Supported configurations:**
| Feature | Exactly-Once Support |
| -------------------- | -------------------- |
| Pull subscriptions | Yes |
| StreamingPull API | Yes |
| Push subscriptions | No |
| Export subscriptions | No |
**Performance trade-offs:**
- **Higher latency**: Significantly higher publish-to-subscribe latency vs regular subscriptions
- **Throughput limitation**: ~1,000 messages/second when combined with ordered delivery
- **Publish-side duplicates**: Subscription may still receive duplicates originating from the publish side
**Client library minimum versions (required for exactly-once):**
| Language | Version |
| -------- | --------- |
| Python | v2.13.6+ |
| Java | v1.139.0+ |
| Go | v1.25.1+ |
| Node.js | v3.2.0+ |
**What worked:**
- Eliminates need for application-level deduplication in many cases
- Ack confirmation tells subscriber definitively if message was processed
**What was hard:**
- **Regional constraint**: Cross-region subscribers may receive duplicates
- Push subscriptions excluded (no ack confirmation mechanism)
- Still requires idempotent handlers for regional failover scenarios
- "The feature does not provide any guarantees around exactly-once side effects"—side effects are outside scope
**Source:** [Cloud Pub/Sub exactly-once delivery](https://cloud.google.com/pubsub/docs/exactly-once-delivery)
### Azure Service Bus: Duplicate Detection
**Context:** Microsoft's managed messaging service with configurable deduplication windows.
**Implementation choices:**
- Pattern variant: Broker-side deduplication with configurable window
- Key customization: Window from 20 seconds to 7 days
- Limitation: Standard/Premium tiers only (not Basic)
**Specific details:**
- Tracks `MessageId` of all messages during the deduplication window
- Duplicate messages are **accepted** (send succeeds) but **instantly dropped**
- Cannot enable/disable duplicate detection after queue/topic creation
- With partitioning: `MessageId + PartitionKey` determines uniqueness
**Configuration:**
```
duplicateDetectionHistoryTimeWindow: P7D // ISO 8601 duration, max 7 days
```
**Best practice for MessageId:**
```
{application-context}.{message-subject}
Example: purchase-order-12345.payment
```
**Trade-off:** Longer windows provide better duplicate protection but consume more storage for tracking message IDs.
**Source:** [Azure Service Bus duplicate detection](https://learn.microsoft.com/en-us/azure/service-bus-messaging/duplicate-detection)
### NATS JetStream: Message Deduplication
**Context:** High-performance messaging system with built-in deduplication.
**Implementation choices:**
- Pattern variant: Header-based deduplication with sliding window
- Key customization: Configurable window (default 2 minutes)
- Alternative: Infinite deduplication via `DiscardNewPerSubject` (NATS 2.9.0+)
**Specific details:**
- Uses `Nats-Msg-Id` header for duplicate detection
- Server tracks message IDs within deduplication window
- **Double acknowledgment** mechanism prevents erroneous re-sends after failures
**Infinite deduplication pattern (NATS 2.9.0+):**
```
// Stream configuration for infinite deduplication
{
"discard": "new",
"discard_new_per_subject": true,
"max_msgs_per_subject": 1
}
// Include unique ID in subject
Subject: orders.create.{order-id}
```
Publish fails if a message with that subject already exists—provides infinite exactly-once publication.
**Source:** [JetStream Model Deep Dive](https://docs.nats.io/using-nats/developer/develop_jetstream/model_deep_dive)
### Apache Pulsar: Transactions
**Context:** Multi-tenant distributed messaging with native exactly-once support since Pulsar 2.8.0.
**Implementation choices:**
- Pattern variant: Transaction API for atomic produce and acknowledgement
- Key customization: Cross-topic atomicity
- Scale: Used in production at Yahoo, Tencent, Verizon
**Specific details:**
- Transaction API enables atomic produce and acknowledgement across multiple topics
- Idempotent producer + exactly-once semantics at single partition level
- If transaction aborts, all writes and acknowledgments roll back
**Transaction flow:**
```
1. Begin transaction
2. Produce to topic A (within transaction)
3. Produce to topic B (within transaction)
4. Acknowledge consumed message (within transaction)
5. Commit or abort
```
**Integration:** Works with Apache Flink via `TwoPhaseCommitSinkFunction` for end-to-end exactly-once.
**Source:** [Pulsar Transactions](https://pulsar.apache.org/docs/next/txn-what/)
### Implementation Comparison
| Aspect | Kafka EOS | Stripe Idempotency | Pub/Sub | Azure Service Bus | NATS JetStream | Pulsar |
| -------------- | ----------------------- | ------------------- | --------------- | ----------------- | --------------- | -------------------- |
| Variant | Producer + transactions | Client keys + cache | Broker dedup | Broker dedup | Header dedup | Transactions |
| Scope | Kafka cluster | Any HTTP client | Single region | Queue/Topic | Stream | Multi-topic |
| Dedup window | Session/configurable | 24 hours | Regional | 20s–7 days | 2 min (default) | Transaction |
| Latency impact | 3% | Cache lookup | Significant | Minimal | Minimal | Transaction overhead |
| Client changes | Config only | Add header | Library upgrade | None | Add header | Transaction API |
## Common Pitfalls
### 1. Deduplication Window Expiry
**The mistake:** Retry timeout longer than deduplication window.
**Example:**
- Send message with ID "X" at T=0
- AWS SQS FIFO has 5-minute deduplication window
- Client retry policy: exponential backoff up to 10 minutes
- At T=6 minutes: client retries, SQS accepts as new message
- Result: Duplicate processing despite deduplication "guarantee"
**Solutions:**
- Ensure max retry delay < deduplication window
- Use exponential backoff with cap: `min(2^attempt * 100ms, windowSize * 0.8)`
- For critical operations: implement consumer-side deduplication as backup
### 2. Producer Restart Losing Sequence State
**The mistake:** Idempotent producer without `transactional.id` loses sequence state on restart.
**Example:**
- Kafka producer with `enable.idempotence=true` but no `transactional.id`
- Producer crashes after sending message with seq=42
- Producer restarts, gets new PID, sequence resets to 0
- Messages with seq 0-42 are accepted again as "new"
- Result: 43 duplicate messages
**Solutions:**
- Set `transactional.id` for producers that must survive restarts
- Or: accept potential duplicates and ensure consumer idempotency
### 3. Consumer Rebalancing Race Condition
**The mistake:** Processing message but not committing offset before rebalance.
**Example:**
- Consumer processes message from partition 0
- Before offset commit: rebalance triggered (session timeout, new consumer joins)
- Partition 0 reassigned to different consumer
- New consumer reads from last committed offset (before the processed message)
- Result: Message processed twice by two different consumers
**Solutions:**
- Use transactional consumers (offset committed with output)
- Implement idempotent consumer pattern (database constraint on message ID)
- Increase `session.timeout.ms` for slow processing
- Use cooperative rebalancing (`partition.assignment.strategy=CooperativeStickyAssignor`)
### 4. Assuming Idempotency Key Uniqueness
**The mistake:** Using predictable keys that collide across users/operations.
**Example:**
- Developer uses `orderId` as idempotency key
- User A creates order 12345, key = "12345"
- User B creates order 12345 in different tenant, same key = "12345"
- User B's request returns User A's cached response
- Result: Data leakage between tenants
**Solutions:**
- Include tenant/user ID in key: `{tenantId}:{operationId}`
- Use client-generated UUIDs (UUID v4)
- Never derive keys solely from user-provided identifiers
### 5. Idempotency for GET Requests
**The mistake:** Adding idempotency keys to read operations.
**Example:**
- Developer adds idempotency keys to all endpoints including GET
- GET /user/123 with key "abc" returns user data, cached
- User updates their profile
- GET /user/123 with key "abc" returns stale cached data
- Result: Clients see outdated data indefinitely
**Solutions:**
- Idempotency keys only for state-changing operations (POST, PUT, DELETE)
- GET requests are naturally idempotent—no key needed
- If caching reads, use standard HTTP caching (ETags, Cache-Control)
### 6. Clock Skew in Last-Write-Wins
**The mistake:** Using wall-clock timestamps for conflict resolution in distributed system.
**Example:**
- Node A (clock +100ms skew) writes value V1 at local time T1
- Node B (accurate clock) writes value V2 at local time T2
- T1 > T2 due to clock skew, but V2 was actually written later
- LWW comparison: V1 wins because T1 > T2
- Result: Causally later write (V2) is discarded
**Solutions:**
- Use Lamport timestamps or vector clocks instead of wall clocks
- Use hybrid logical clocks (HLC) for ordering with physical time hints
- Accept that LWW with physical clocks is eventually consistent, not causally consistent
### 7. Assuming EOS Extends Beyond System Boundaries
**The mistake:** Believing Kafka's exactly-once guarantees extend to downstream systems.
**Example:**
- Kafka Streams app with `processing.guarantee=exactly_once_v2`
- App reads from Kafka, processes, writes to PostgreSQL
- Assumption: "Kafka handles exactly-once, so PostgreSQL writes are safe"
- Reality: Kafka EOS only covers Kafka-to-Kafka. The PostgreSQL write is outside the transaction.
- Result: Consumer crashes after PostgreSQL write but before Kafka offset commit → duplicate write on restart
**Solutions:**
- Use transactional outbox pattern for database writes
- Implement idempotent database operations (upsert with message ID)
- Use KIP-939 (when available) for native 2PC with external databases
- Always design downstream consumers as idempotent regardless of upstream guarantees
## Implementation Guide
### Starting Point Decision

### System Selection Guide
| Requirement | Recommended System | Reason |
| --------------------- | ------------------- | -------------------------------- |
| Kafka ecosystem | Kafka with EOS v2 | Native support, minimal overhead |
| Serverless/managed | SQS FIFO or Pub/Sub | No infrastructure to manage |
| Configurable window | Azure Service Bus | 20s to 7 days window |
| High performance | NATS JetStream | Low latency, simple model |
| Cross-topic atomicity | Pulsar or Kafka | Transaction APIs |
| HTTP API idempotency | Redis + custom code | Stripe pattern |
### When to Build Custom
**Build custom when:**
- Existing solutions don't fit your consistency requirements
- Cross-system exactly-once needed (Kafka → external database)
- Need longer deduplication windows than broker provides
- Performance requirements exceed library capabilities
**Implementation checklist:**
- [ ] Define deduplication key format (unique, collision-resistant)
- [ ] Choose deduplication storage (Redis, database, in-memory)
- [ ] Set deduplication window (longer than max retry delay)
- [ ] Implement atomic state update + dedup record insert
- [ ] Add cleanup job for expired deduplication records
- [ ] Test with network partition simulation
- [ ] Test with producer/consumer restart scenarios
- [ ] Document failure modes and recovery procedures
### Testing Exactly-Once
**Unit tests:**
- Same message ID processed twice → single state change
- Different message IDs → independent state changes
- Concurrent identical requests → single effect
**Integration tests:**
- Producer crash mid-send → no duplicates after restart
- Consumer crash mid-process → message reprocessed once
- Broker failover → no duplicates or losses
**Chaos testing:**
- Network partition between producer and broker
- Kill consumer during processing
- Slow consumer causing rebalance
- Clock skew between nodes
## Conclusion
Exactly-once delivery is a misnomer—true exactly-once is mathematically impossible in distributed systems. What we achieve is "effectively exactly-once": at-least-once delivery combined with idempotency mechanisms that ensure each message's effect occurs exactly once.
The key insight is that exactly-once is a **composition**, not a primitive:
1. Never lose messages (at-least-once delivery with retries and persistence)
2. Make duplicates harmless (idempotent operations, deduplication tracking, or transactional processing)
Choose your implementation based on your constraints:
- **Idempotent operations** when you control the state model
- **Idempotency keys** for external-facing APIs
- **Broker-side deduplication** when your broker supports it (Kafka, SQS FIFO, Pub/Sub, Azure Service Bus, NATS JetStream)
- **Consumer-side deduplication** for maximum control and longer windows
- **Transactional processing** for Kafka/Pulsar consume-transform-produce patterns
- **Transactional outbox** when you need atomic database + event writes
Every approach has failure modes around the deduplication window. Design your retry policies to fit within the window, and consider layered approaches (broker + consumer deduplication) for critical paths.
**Critical reminder**: Most exactly-once guarantees stop at system boundaries. Kafka EOS doesn't extend to your PostgreSQL database. Pub/Sub exactly-once is regional. Always design downstream consumers as idempotent, regardless of upstream guarantees.
## Appendix
### Prerequisites
- Understanding of distributed systems fundamentals (network failures, partial failures)
- Familiarity with message brokers (Kafka, SQS, or similar)
- Basic knowledge of database transactions
### Terminology
- **Idempotency**: Property where applying an operation multiple times produces the same result as applying it once
- **PID (Producer ID)**: Unique 64-bit identifier assigned to a Kafka producer instance by the broker
- **Deduplication window**: Time period during which the system remembers message IDs for duplicate detection
- **EOS (Exactly-Once Semantics)**: Kafka's term for effectively exactly-once processing guarantees
- **2PC (Two-Phase Commit)**: Distributed transaction protocol that ensures atomic commits across multiple participants
- **CDC (Change Data Capture)**: Technique for reading database changes from transaction logs
- **Transactional outbox**: Pattern where events are written to an outbox table in the same database transaction as business data
### Summary
- True exactly-once delivery is impossible (Two Generals 1975, FLP 1985)
- "Exactly-once" means at-least-once delivery + idempotent consumption
- Six implementation paths: idempotent operations, idempotency keys, broker-side dedup, consumer-side dedup, transactional processing, transactional outbox
- Every deduplication mechanism has a window—design retry policies accordingly
- Kafka EOS: idempotent producers (PID + sequence) + transactional consumers (`read_committed`); default since Kafka 3.0
- Deduplication windows vary: SQS FIFO (5 min fixed), Azure Service Bus (20s–7 days), NATS (2 min default)
- Most exactly-once guarantees stop at system boundaries—always design downstream consumers as idempotent
- Test with chaos: network partitions, restarts, rebalancing, clock skew
### References
**Theoretical Foundations**
- [Impossibility of Distributed Consensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf) - FLP impossibility theorem (1985)
- [A Brief Tour of FLP Impossibility](https://www.the-paper-trail.org/post/2008-08-13-a-brief-tour-of-flp-impossibility/) - Accessible explanation of FLP
- [Two Generals' Problem](https://en.wikipedia.org/wiki/Two_Generals%27_Problem) - First computer communication problem proven unsolvable
**Apache Kafka**
- [KIP-98 - Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging) - Kafka's exactly-once specification
- [KIP-129 - Streams Exactly-Once Semantics](https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics) - Kafka Streams exactly-once
- [KIP-447 - Producer scalability for exactly once semantics](https://cwiki.apache.org/confluence/display/KAFKA/KIP-447:+Producer+scalability+for+exactly+once+semantics) - EOS v2 improvements
- [KIP-939 - Support Participation in 2PC](https://cwiki.apache.org/confluence/display/KAFKA/KIP-939:+Support+Participation+in+2PC) - Kafka + external database atomic writes
- [Message Delivery Guarantees for Apache Kafka](https://docs.confluent.io/kafka/design/delivery-semantics.html) - Confluent official docs
- [Exactly-once Support in Apache Kafka](https://medium.com/@jaykreps/exactly-once-support-in-apache-kafka-55e1fdd0a35f) - Jay Kreps on Kafka EOS
**Cloud Messaging Services**
- [Cloud Pub/Sub exactly-once delivery](https://cloud.google.com/pubsub/docs/exactly-once-delivery) - Google Pub/Sub implementation
- [AWS SQS FIFO exactly-once processing](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues-exactly-once-processing.html) - AWS implementation
- [Azure Service Bus duplicate detection](https://learn.microsoft.com/en-us/azure/service-bus-messaging/duplicate-detection) - Azure implementation
**Other Messaging Systems**
- [JetStream Model Deep Dive](https://docs.nats.io/using-nats/developer/develop_jetstream/model_deep_dive) - NATS deduplication
- [Pulsar Transactions](https://pulsar.apache.org/docs/next/txn-what/) - Apache Pulsar exactly-once
**API Idempotency**
- [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency) - Stripe's idempotency key pattern
- [Implementing Stripe-like Idempotency Keys in Postgres](https://brandur.org/idempotency-keys) - Detailed implementation guide
**Patterns**
- [Transactional outbox pattern](https://microservices.io/patterns/data/transactional-outbox.html) - Microservices.io pattern reference
- [Reliable Microservices Data Exchange With the Outbox Pattern](https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/) - Outbox pattern with CDC
- [Idempotent Consumer Pattern](https://microservices.io/patterns/communication-style/idempotent-consumer.html) - Microservices.io pattern reference
**General**
- [The impossibility of exactly-once delivery](https://blog.bulloak.io/post/20200917-the-impossibility-of-exactly-once/) - Theoretical foundations
- [You Cannot Have Exactly-Once Delivery](https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/) - Why true exactly-once is impossible
---
## Event Sourcing
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/event-sourcing-deep-dive
**Category:** System Design / Core Distributed Patterns
**Description:** A deep-dive into event sourcing: understanding the core pattern, implementation variants, snapshot strategies, schema evolution, and production trade-offs across different architectures.
# Event Sourcing
A deep-dive into event sourcing: understanding the core pattern, implementation variants, snapshot strategies, schema evolution, and production trade-offs across different architectures.

Event sourcing architecture: commands produce events stored immutably; projections derive read models; state rebuilt by replaying events from snapshots.
## Abstract
Event sourcing stores application state as a sequence of immutable events rather than current state. The core insight: **events are facts that happened—they cannot be deleted or modified, only appended**. State is derived by replaying events (a "left fold" over the event stream).
Key mental model:
- **Write path**: Commands → Validation → Event(s) → Append to stream
- **Read path**: Subscribe to events → Project into read-optimized models
- **Recovery**: Snapshot + Replay events since snapshot = Current state
Design decisions after choosing event sourcing:
1. **Stream granularity**: Per-aggregate (common) vs shared streams (rare)
2. **Projection strategy**: Synchronous (strong consistency) vs async (eventual)
3. **Snapshot policy**: Event-count vs time-based vs state-triggered
4. **Schema evolution**: Upcasting (transform on read) vs stream transformation (migrate data)
Event sourcing is not CQRS (Command Query Responsibility Segregation), but with event sourcing you must use CQRS—reads come from projections, not the event store.
## The Problem
### Why Naive Solutions Fail
**Approach 1: Mutable state in a database (CRUD)**
- **Fails because**: Current state overwrites history. You cannot answer "what was the balance at 3pm yesterday?" without separate audit logging.
- **Example**: A bank account shows $500. A customer disputes a charge. Without event history, proving what transactions occurred requires mining application logs—if they exist.
**Approach 2: Adding audit tables alongside CRUD**
- **Fails because**: Audit tables and current state can diverge. The audit is a second-class citizen—queries hit current state, audits are bolted on.
- **Example**: A bug updates the balance without writing to the audit table. Now the audit and state disagree. Which is correct?
**Approach 3: Database triggers for audit logging**
- **Fails because**: Triggers capture "what changed," not "why it changed." You see "balance changed from 600 to 500" but not "customer withdrew $100 at ATM #4421."
- **Example**: Debugging why an order was cancelled. Trigger logs show "status changed to CANCELLED" but not "cancelled by customer due to shipping delay."
### The Core Challenge
The fundamental tension: **optimized writes vs queryable history**. CRUD optimizes for writes (overwrite current state) but sacrifices history. Audit logging preserves history but creates dual-write complexity and semantic loss.
Event sourcing exists to make **events the single source of truth**—current state becomes a derived view that can be rebuilt at any time.
## Pattern Overview
### Core Mechanism
Every state change is captured as an immutable event appended to an event stream. Current state is derived by replaying events from the beginning (or from a snapshot).
From Martin Fowler's foundational definition:
> "Capture all changes to an application state as a sequence of events."
Events are structured facts:
```typescript collapse={1-3}
// Event structure
interface DomainEvent {
eventId: string // Unique identifier
streamId: string // Aggregate this event belongs to
eventType: string // e.g., "OrderPlaced", "ItemAdded"
data: unknown // Event-specific payload
metadata: {
timestamp: Date
version: number // Position in stream
causationId?: string // What caused this event
correlationId?: string // Request/saga identifier
}
}
```
State reconstruction follows a deterministic left-fold:
```
currentState = events.reduce(applyEvent, initialState)
```
### Key Invariants
1. **Append-only**: Events are never updated or deleted. The event store is an immutable log.
2. **Deterministic replay**: Given the same events in the same order, replay produces identical state.
3. **Event ordering**: Events within a stream are totally ordered by version number.
4. **Idempotent projections**: Projections must handle duplicate event delivery gracefully.
### Failure Modes
| Failure | Impact | Mitigation |
| -------------------------------- | ----------------------------------------------------- | --------------------------------------------------- |
| Event store unavailable | Commands blocked; reads from projections may continue | Multi-region replication; read replicas |
| Projection lag | Stale read models | Monitor lag; circuit breaker on staleness threshold |
| Event schema mismatch | Projection crashes or produces incorrect state | Schema registry; upcasting; versioned projections |
| Unbounded event growth | Replay becomes prohibitively slow | Snapshots; event archival to cold storage |
| Concurrent writes to same stream | Optimistic concurrency violation | Retry with conflict resolution; smaller aggregates |
## Design Paths
### Path 1: Pure Event Sourcing
**When to choose this path:**
- Audit trail is a regulatory requirement (finance, healthcare)
- Temporal queries are core functionality ("what was the state on date X?")
- Domain naturally expresses state changes as events
- You need to rebuild projections for new query patterns
**Key characteristics:**
- Events are the **only** source of truth
- All read models are projections derived from events
- No direct database writes outside the event store
- State can be fully rebuilt from events at any time
**Implementation approach:**
```typescript collapse={1-5, 15-25}
// Pure event sourcing: aggregate loaded from events
interface Aggregate {
state: TState
version: number
}
// Command handler pattern
async function handle(
command: Command,
loadAggregate: (id: string) => Promise>,
decide: (state: TState, command: Command) => TEvent[],
appendEvents: (streamId: string, expectedVersion: number, events: TEvent[]) => Promise,
): Promise {
const aggregate = await loadAggregate(command.aggregateId)
const newEvents = decide(aggregate.state, command)
await appendEvents(command.aggregateId, aggregate.version, newEvents)
}
// Loading aggregate from event store
async function loadAggregate(
streamId: string,
eventStore: EventStore,
evolve: (state: TState, event: TEvent) => TState,
initialState: TState,
): Promise> {
const events = await eventStore.readStream(streamId)
const state = events.reduce(evolve, initialState)
return { state, version: events.length }
}
```
**Real-world example:**
LMAX Exchange processes 6 million orders per second using pure event sourcing. Their Business Logic Processor runs entirely in-memory, single-threaded. Events are the recovery mechanism—on restart, they replay from last snapshot plus subsequent events to reconstruct state in milliseconds.
Key implementation details:
- Ring buffer (Disruptor) handles I/O concurrency: 20 million slots for input
- Latency: sub-50 nanosecond event processing (mean latency 3 orders of magnitude below queue-based approaches)
- Nightly snapshots; BLP restarts every night with zero downtime via replication
**Trade-offs vs other paths:**
| Aspect | Pure Event Sourcing | Hybrid (ES + CRUD) |
| ----------------------- | -------------------------------- | ------------------------------------------- |
| Audit completeness | Full history guaranteed | Partial—only ES domains audited |
| Query flexibility | Build any projection from events | Some data may not be projectable |
| Complexity | Higher—must project all reads | Lower—CRUD for simple domains |
| Migration cost | Very high (all-or-nothing) | Incremental |
| Team expertise required | Event-driven thinking throughout | Can isolate ES to specific bounded contexts |
### Path 2: Hybrid Event Sourcing (ES + CRUD)
**When to choose this path:**
- Some domains require audit trails; others don't
- Team has mixed experience with event-driven architecture
- Migrating from existing CRUD system incrementally
- Read-heavy workloads where projection lag is unacceptable for some data
**Key characteristics:**
- Core business domains use event sourcing (orders, transactions, inventory)
- Supporting domains use CRUD (user preferences, product catalog)
- Clear boundaries between ES and CRUD contexts
- Events may be published from CRUD systems for integration
**Implementation approach:**
```typescript collapse={1-3}
// Hybrid: Event-sourced orders, CRUD product catalog
// Order aggregate (event-sourced)
class OrderAggregate {
apply(event: OrderEvent): void {
switch (event.type) {
case "OrderPlaced":
this.status = "placed"
this.items = event.data.items
break
case "OrderShipped":
this.status = "shipped"
break
}
}
}
// Product service (CRUD with event publishing)
class ProductService {
async updatePrice(productId: string, newPrice: number): Promise {
// CRUD update
await this.db.products.update(productId, { price: newPrice })
// Publish event for downstream consumers (not source of truth)
await this.eventBus.publish({ type: "PriceUpdated", productId, newPrice })
}
}
```
**Real-world example:**
Walmart's Inventory Availability System uses hybrid event sourcing. The inventory domain is fully event-sourced to track every stock movement. Product catalog data (descriptions, images) remains in traditional databases. Events from inventory updates feed projections that power real-time availability queries for millions of customers.
Key details:
- Events partitioned by product-node ID for scalability
- Command processor validates business rules before event emission
- Read and write sides use different tech stacks, scaled independently
**Trade-offs vs pure ES:**
| Aspect | Hybrid | Pure ES |
| ---------------------- | ------------------------------------------- | --------------------------------- |
| Migration path | Gradual, bounded context by context | Big-bang or strangler pattern |
| Team adoption | Easier—ES expertise not required everywhere | Requires organization-wide buy-in |
| Consistency | Mixed—ES domains eventual, CRUD immediate | Eventual consistency throughout |
| Operational complexity | Higher—two paradigms to operate | Single paradigm (but complex) |
### Path 3: Event Sourcing with Synchronous Projections
**When to choose this path:**
- Strong consistency between writes and reads required
- Projection latency is unacceptable
- Lower throughput acceptable for consistency guarantee
- Single-node or low-scale deployments
**Key characteristics:**
- Projections updated in same transaction as event append
- Read model immediately reflects write
- No eventual consistency concerns
- Limited scalability—projection work blocks command processing
**Implementation approach:**
```typescript collapse={1-2, 18-22}
// Synchronous projection: update read model in same transaction
async function handleCommand(command: PlaceOrderCommand): Promise {
await db.transaction(async (tx) => {
// 1. Load aggregate
const order = await loadFromEvents(tx, command.orderId)
// 2. Execute business logic
const events = order.place(command.items)
// 3. Append events
await appendEvents(tx, command.orderId, events)
// 4. Update projection synchronously
for (const event of events) {
await updateOrderListProjection(tx, event)
await updateInventoryProjection(tx, event)
}
})
// Transaction commits: events + projections atomically
}
```
**Real-world example:**
Marten (for .NET/PostgreSQL) supports inline projections that run within the same database transaction as event appends. This guarantees that when a command succeeds, the read model is immediately consistent.
From Marten documentation: inline projections are "run as part of the same transaction as the events being captured and updated in the underlying database as well."
**Trade-offs:**
| Aspect | Synchronous Projections | Async Projections |
| ----------------- | --------------------------------------- | ------------------------------------ |
| Consistency | Strong—reads reflect writes immediately | Eventual—lag between write and read |
| Throughput | Lower—projection work blocks writes | Higher—writes return without waiting |
| Scalability | Limited by projection cost | Projections scale independently |
| Failure isolation | Projection failure fails the write | Write succeeds; projection can retry |
### Path 4: Event Sourcing with Asynchronous Projections
**When to choose this path:**
- High write throughput required
- Eventual consistency acceptable
- Projections are expensive (aggregations, joins)
- Need to scale reads independently of writes
**Key characteristics:**
- Events appended to store; command returns immediately
- Background processes subscribe to event streams
- Projections catch up asynchronously
- Must handle out-of-order delivery, duplicates, and lag
**Implementation approach:**
```typescript collapse={1-4, 20-28}
// Async projection with checkpointing
interface ProjectionState {
lastProcessedPosition: number
// projection-specific state
}
async function runProjection(
subscription: EventSubscription,
project: (state: ProjectionState, event: DomainEvent) => ProjectionState,
checkpoint: (position: number) => Promise,
): Promise {
let state = await loadProjectionState()
for await (const event of subscription.fromPosition(state.lastProcessedPosition)) {
state = project(state, event)
// Checkpoint periodically (not every event—batching for performance)
if (event.position % 100 === 0) {
await checkpoint(event.position)
}
}
}
// Idempotent projection handling duplicates
function projectOrderPlaced(state: OrderListState, event: OrderPlacedEvent): OrderListState {
// Skip if already processed (idempotency)
if (state.processedEventIds.has(event.eventId)) {
return state
}
return {
...state,
orders: [...state.orders, { id: event.data.orderId, status: "placed" }],
processedEventIds: state.processedEventIds.add(event.eventId),
}
}
```
**Real-world example:**
Netflix's Downloads feature (launched November 2016) uses Cassandra-backed event sourcing with async projections. The team built this in six months, transforming from a stateless to stateful service.
Key implementation details:
- Three components: Event Store, Aggregate Repository, Aggregate Service
- Projections rebuild on-demand for debugging
- Horizontal scaling limited only by disk space
**Trade-offs:**
| Aspect | Async Projections | Sync Projections |
| ----------------- | ------------------------------------------ | ----------------------------------- |
| Write latency | Low—return immediately | Higher—wait for projections |
| Read freshness | Eventual (seconds to minutes lag) | Immediate |
| Failure isolation | Yes—projection failures don't block writes | No—projection failure fails command |
| Complexity | Higher—handle duplicates, ordering, lag | Lower—transactional guarantees |
### Decision Framework

## Snapshot Strategies
Replaying thousands of events for every command becomes prohibitive. Snapshots store materialized state at a point in time, enabling replay from snapshot + recent events.
### When to Snapshot
From the EventStoreDB team: "Do not add snapshots right away. They are not needed at the beginning."
Add snapshots when:
- Aggregate event counts exceed hundreds (measure first)
- Load times impact user experience or SLAs
- Cold start times are unacceptable
### Snapshot Approaches
| Strategy | Trigger | Pros | Cons |
| ------------------- | -------------------------------------------- | ------------------------------- | --------------------------------------- |
| **Event-count** | Every N events (e.g., 100) | Predictable load times | May snapshot unnecessarily |
| **Time-based** | Every N hours/days | Operational simplicity | Variable event counts between snapshots |
| **State-triggered** | On significant transitions (draft→published) | Snapshots at natural boundaries | Requires domain knowledge |
| **On-demand** | When load time exceeds threshold | Only when needed | First slow load before snapshot exists |
### Implementation
```typescript collapse={1-5, 22-30}
// Snapshot-aware aggregate loading
interface Snapshot {
state: TState
version: number
schemaVersion: number // For schema evolution
}
async function loadWithSnapshot(
streamId: string,
eventStore: EventStore,
snapshotStore: SnapshotStore,
evolve: (state: TState, event: TEvent) => TState,
initialState: TState,
): Promise<{ state: TState; version: number }> {
// Try loading snapshot first
const snapshot = await snapshotStore.load(streamId)
const startVersion = snapshot?.version ?? 0
const startState = snapshot?.state ?? initialState
// Replay only events after snapshot
const events = await eventStore.readStream(streamId, { fromVersion: startVersion + 1 })
const state = events.reduce(evolve, startState)
return { state, version: startVersion + events.length }
}
// Snapshot after command if threshold exceeded
async function maybeSnapshot(
streamId: string,
state: TState,
version: number,
snapshotStore: SnapshotStore,
threshold: number = 100,
): Promise {
const lastSnapshot = await snapshotStore.getVersion(streamId)
if (version - lastSnapshot > threshold) {
await snapshotStore.save(streamId, { state, version, schemaVersion: 1 })
}
}
```
### Schema Evolution for Snapshots
When snapshot schema changes:
1. Increment `schemaVersion` in new snapshots
2. On load, check `schemaVersion`
3. If outdated, discard snapshot and rebuild from events
This is simpler than migrating snapshots—events are the source of truth; snapshots are ephemeral optimization.
## Event Schema Evolution
Events are immutable and stored forever. Schema changes must be backward-compatible or handled via transformation.
### Core Principle
From Greg Young (EventStoreDB creator):
> "A new version of an event must be convertible from the old version. If not, it is not a new version but a new event."
### Evolution Strategies
#### Strategy 1: Additive Changes (Preferred)
Add optional fields; never remove or rename.
```typescript collapse={1-3}
// Version 1
interface OrderPlacedV1 {
orderId: string
customerId: string
items: OrderItem[]
}
// Version 2: Added optional field
interface OrderPlacedV2 {
orderId: string
customerId: string
items: OrderItem[]
discountCode?: string // New optional field
}
// Projection handles both
function projectOrder(event: OrderPlacedV1 | OrderPlacedV2): Order {
return {
id: event.orderId,
customerId: event.customerId,
items: event.items,
discountCode: "discountCode" in event ? event.discountCode : undefined,
}
}
```
#### Strategy 2: Upcasting (Transform on Read)
Transform old events to new schema when reading.
```typescript collapse={1-4, 18-22}
// Upcaster transforms old schema to current schema
type Upcaster = (old: TOld) => TNew
const orderPlacedUpcasters: Map> = new Map([
[
1,
(v1: OrderPlacedV1) => ({
...v1,
discountCode: undefined,
source: "unknown", // Default for old events
}),
],
[
2,
(v2: OrderPlacedV2) => ({
...v2,
source: "unknown",
}),
],
])
function upcast(event: StoredEvent): DomainEvent {
const upcaster = orderPlacedUpcasters.get(event.schemaVersion)
return upcaster ? upcaster(event.data) : event.data
}
```
**When to use**: Schema changes are frequent; you want projections to only handle the latest version.
**Trade-off**: CPU cost on every read; upcaster chain grows over time.
#### Strategy 3: Stream Transformation (Migrate Data)
Rewrite the event stream with transformed events during a release window.
**When to use**: Breaking changes that upcasting cannot handle; reducing technical debt from long upcaster chains.
**Trade-off**: Requires downtime or careful blue-green deployment; must preserve event IDs and ordering.
**Warning from Greg Young**: "Updating an existing event can cause large problems." Prefer stream transformations (copy with transform) over in-place modification.
### Schema Registry
For production systems, use a schema registry:
- Store event schemas with versions
- Validate events against schema on write
- Reject events that don't conform
- Generate types from schemas (Avro, Protobuf, JSON Schema)
## Projections and Read Models
Projections transform event streams into query-optimized read models.
### Projection Lifecycle (Marten Model)
| Type | Execution | Consistency | Use Case |
| ---------- | -------------------------- | --------------- | ------------------------------------- |
| **Inline** | Same transaction as events | Strong | Critical reads; simple projections |
| **Async** | Background process | Eventual | Complex aggregations; high throughput |
| **Live** | On-demand, not persisted | None (computed) | One-time analytics; debugging |
### Building Projections
```typescript collapse={1-6}
// Projection as left-fold over events
interface Projection {
initialState: TState
apply: (state: TState, event: DomainEvent) => TState
}
// Order list projection for dashboard
const orderListProjection: Projection = {
initialState: { orders: [], totalRevenue: 0 },
apply(state, event) {
switch (event.type) {
case "OrderPlaced":
return {
...state,
orders: [...state.orders, { id: event.data.orderId, status: "placed", total: event.data.total }],
totalRevenue: state.totalRevenue + event.data.total,
}
case "OrderShipped":
return {
...state,
orders: state.orders.map((o) => (o.id === event.data.orderId ? { ...o, status: "shipped" } : o)),
}
case "OrderRefunded":
return {
...state,
orders: state.orders.map((o) => (o.id === event.data.orderId ? { ...o, status: "refunded" } : o)),
totalRevenue: state.totalRevenue - event.data.refundAmount,
}
default:
return state
}
},
}
```
### Rebuilding Projections
Since events are source of truth, projections can be rebuilt at any time:
1. Drop existing projection state
2. Replay all events through projection logic
3. New projection reflects current business logic
**Use cases**:
- Bug fix in projection logic
- New read model for new query pattern
- Schema change in read database
**Warning**: Rebuilding from millions of events takes time. For Netflix Downloads, "re-runs took DAYS to complete" during development—mitigated by snapshots and event archival.
### Cross-Projection Dependencies
**Problem**: Projection A depends on Projection B's state. During replay, B may not be at the same position as A.
From Dennis Doomen's production experience:
> "Rebuilding projection after schema upgrade caused projector to read from another projection that was at a different state in the event store's history."
**Solutions**:
1. **Single-pass projections**: Project to denormalized models; avoid joins at projection time
2. **Explicit dependencies**: Projections declare dependencies; rebuild engine ensures correct ordering
3. **Position tracking**: Each projection tracks which events it has seen; dependent projections wait
## Event Store Technologies
### EventStoreDB (KurrentDB)
Purpose-built for event sourcing.
**Performance characteristics**:
- 15,000+ writes/second
- 50,000+ reads/second
- Quorum-based replication (majority must acknowledge)
**Strengths**:
- Native event sourcing primitives (streams, subscriptions, projections)
- Optimistic concurrency per stream
- Persistent subscriptions with checkpointing
- Built-in projections in JavaScript
**Limitations**:
- No built-in sharding (must implement at application level)
- Performance bounded by disk I/O
- Read-only replicas require cluster connection
### Apache Kafka
Designed for event streaming, not event sourcing.
**Why Kafka is problematic for event sourcing** (from Serialized.io):
1. **Topic scalability**: Not designed for millions of topics (aggregates can easily reach millions)
2. **Entity loading**: No practical way to load events for specific entity by ID within a topic
3. **No optimistic concurrency**: Cannot do "save this event only if version is still X"
4. **Compaction destroys history**: Log compaction keeps only latest value per key—antithetical to event sourcing
**Where Kafka fits**: Transport layer for events to downstream consumers. Use dedicated event store for source of truth, Kafka for integration.
### PostgreSQL-Backed Stores
Use relational database as event store.
**Implementation patterns**:
- **Transactional outbox**: Events + state change in same transaction
- **LISTEN/NOTIFY**: PostgreSQL async notifications for real-time subscriptions
- **Advisory locks**: Application-level locking for projection synchronization
**Libraries**: Marten (.NET), pg-event-store (Node.js), Eventide (Ruby)
**Trade-offs**:
| Aspect | PostgreSQL | EventStoreDB |
| ------------------------- | ------------------------------- | ---------------------------------- |
| Operational familiarity | High (existing expertise) | New technology to learn |
| Event sourcing primitives | Must build or use library | Native |
| Transactions | Full ACID with application data | Event store separate from app DB |
| Scaling | Traditional DB scaling | Purpose-built but limited sharding |
### Cloud Services
**AWS EventBridge**: Event routing and orchestration; not an event store. 90+ AWS service integrations. Use for serverless event-driven architectures, not as source of truth.
**Azure Event Hubs**: High-volume event ingestion (millions/second). Kafka-compatible. Geo-disaster recovery. Similar caveats as Kafka for event sourcing.
## Production Implementations
### LMAX: 6 Million Orders/Second
**Context**: Financial exchange requiring ultra-low latency and complete audit trail.
**Architecture**:
- Business Logic Processor (BLP): Single-threaded, in-memory, event-sourced
- Disruptor: Lock-free ring buffer for I/O (25M+ messages/second, sub-50ns latency)
- Replication: Three BLPs process all input events (two in primary DC, one DR)
**Specific details**:
- Ring buffers: 20M slots (input), 4M slots (output each)
- Nightly snapshots; BLPs restart nightly with zero downtime
- Cryptographic operations offloaded from core logic
- Validation split: state-independent checks before BLP, state-dependent in BLP
**What worked**: In-memory event sourcing eliminated database as bottleneck.
**What was hard**: Debugging distributed replay; ensuring determinism in business logic.
**Source**: [Martin Fowler - LMAX Architecture](https://martinfowler.com/articles/lmax.html)
### Netflix Downloads: Six-Month Build
**Context**: November 2016 feature launch requiring stateful service (download tracking, licensing) built in months.
**Architecture**:
- Cassandra-backed event store
- Three components: Event Store, Aggregate Repository, Aggregate Service
- Async projections for query optimization
**Specific details**:
- Horizontal scaling: only disk space was the constraint
- Projections rebuilt during debugging/development
- Full auditing enabled rapid issue diagnosis
**What worked**: Event sourcing flexibility enabled pivoting requirements during development.
**What was hard**: Rebuild times during development ("re-runs took DAYS")—mitigated by snapshotting.
**Source**: [InfoQ - Scaling Event Sourcing for Netflix Downloads](https://www.infoq.com/presentations/netflix-scale-event-sourcing/)
### Uber Cadence: Durable Execution at Scale
**Context**: Workflow orchestration for hundreds of applications across Uber.
**Architecture**:
- Event sourcing for workflow state (deterministic replay)
- Components: Front End, History Service, Matching Service
- Cassandra backend
**Specific details**:
- Multitenant clusters with hundreds of domains each
- Single service supports 100+ applications
- Worker Goroutines reduced from 16K to 100 per history host (95%+ reduction)
**Key insight**: Workflows are event-sourced—state reconstructed by replaying historical events. Cadence (and its fork, Temporal) demonstrates event sourcing for durable execution, not just data storage.
**Source**: [Uber Engineering - Cadence Overview](https://www.uber.com/blog/open-source-orchestration-tool-cadence-overview/)
### Implementation Comparison
| Aspect | LMAX | Netflix Downloads | Uber Cadence |
| ----------------- | ------------------ | ----------------- | ---------------------- |
| Domain | Financial exchange | Media licensing | Workflow orchestration |
| Events/second | 6M orders | Not disclosed | Hundreds of domains |
| Consistency | Single-threaded | Eventual | Per-workflow |
| Storage | In-memory + disk | Cassandra | Cassandra |
| Snapshot strategy | Nightly | As needed | Per-workflow history |
| Team size | Small, specialized | 6-month team | Platform team |
## Common Pitfalls
### 1. Unbounded Event Growth
**The mistake**: Not planning for event retention or archival.
**Example**: Financial app storing every price tick (millions/day). After one year, rebuilding account balances requires replaying 3TB of events. Queries take minutes; cold starts are impossible.
**Solutions**:
- **Snapshots**: Every N events or time period
- **Archival**: Move events older than N days to cold storage (S3, Glacier)
- **Temporal streams**: Segment streams by time period (orders-2024-Q1)
### 2. Assuming Event Order Guarantees
**The mistake**: Building logic that assumes "event A always comes before event B."
**Reality** (from production war stories): "Events are duplicated, delayed, reordered, partially processed, or replayed long after the context that created them has changed."
**Example**: Projection assumes `OrderPlaced` arrives before `OrderShipped`. During backfill, events arrive out of order; projection crashes.
**Solutions**:
- Treat each event as independent input
- Build projections that handle any arrival order
- Use event timestamps for ordering within projections, not arrival order
### 3. Dual-Write to Event Store and Downstream
**The mistake**: Writing event to store, then publishing to message queue in separate operations.
**Example**: `appendEvent()` succeeds, `publishToKafka()` fails. Event exists but downstream never sees it.
**Solutions**:
- **Transactional outbox**: Write event + outbox entry in same transaction; separate process polls outbox
- **Change Data Capture (CDC)**: Database-level capture of event table changes
- **Event store with built-in pub/sub**: EventStoreDB subscriptions
### 4. Projection Complexity Explosion
**The mistake**: Building projections with complex joins, aggregations, and cross-stream queries.
**Example**: "Order revenue by region by product category by month" projection requires joining order events, product catalog, and region data. Rebuild takes hours.
**Solutions**:
- Denormalize aggressively at projection time
- Accept data duplication in read models
- Build multiple focused projections vs one complex one
- Consider whether query belongs in analytics layer (data warehouse)
### 5. Schema Migration Nightmares
**The mistake**: Changing event schemas without migration strategy.
**Example** (from production): Team added `discount_code` field to `OrderPlaced` events. Old projections ignored it. Replay applied 2024 logic to 2022 events, giving customers unintended discounts.
**Solutions**:
- **Additive changes only**: New optional fields with defaults
- **Versioned projections**: Tag projections with compatible schema versions
- **Code version in events**: Every event carries code version that created it
- **Upcasters**: Transform old events to current schema on read
### 6. "Time Travel Debugging" Oversold
**The mistake**: Expecting event sourcing to make debugging trivial.
**Reality** (from Chris Kiehl's experience): "99% of the time 'bad states' were bad events caused by standard run-of-the-mill human error. Having a ledger provided little value over normal debugging intuition."
**Practical questions**:
- How do you apply time-travel on production data?
- How do you inspect intermediate states if events are in binary format?
- How do you fix bugs for users already impacted?
**Example**: "Order stuck in pending for 3 hours" took 6 hours to debug—what would've been a 20-minute fix in CRUD required understanding the full event history.
**Solution**: Event sourcing provides audit capability, not magic debugging. Invest in:
- Event visualization tooling
- Projection debugging (show state at any event)
- Compensating event workflows for fixing bad state
## GDPR and Data Deletion
Event sourcing's immutability conflicts with "right to be forgotten."
### Crypto Shredding Pattern
From Mathias Verraes:
**How it works**:
1. Store personal data encrypted in event store (not plaintext)
2. Store encryption key separately (different database/filesystem)
3. When user requests deletion, delete only the encryption key
4. Encrypted data becomes permanently unreadable
**Implementation**:
```typescript collapse={1-4, 18-25}
// Events store only encrypted personal data
interface OrderPlacedEvent {
orderId: string
customerRef: string // Reference to customer, not PII
encryptedCustomerData: string // AES-encrypted name, address
items: OrderItem[]
}
// Key storage separate from event store
interface CustomerKeyStore {
getKey(customerRef: string): Promise
deleteKey(customerRef: string): Promise // "Forget" customer
}
// Projection decrypts only if key exists
async function projectOrder(event: OrderPlacedEvent, keyStore: CustomerKeyStore): Promise {
const key = await keyStore.getKey(event.customerRef)
const customerData = key
? await decrypt(event.encryptedCustomerData, key)
: { name: "[deleted]", address: "[deleted]" }
return {
orderId: event.orderId,
customerName: customerData.name,
// ...
}
}
```
**Benefits**:
- Event stream remains intact (immutable)
- Deleting key effectively removes data from all backups
- Existing events still queryable for non-personal data
**Legal caveat**: GDPR states encrypted personal information is still personal information. Some legal interpretations argue crypto shredding may not be fully compliant. Consult legal counsel.
### Forgettable Payloads Alternative
Store personal data in separate database; events contain only references.
**Trade-off**: Requires joins at read time; personal data store becomes another system to maintain.
## Conclusion
Event sourcing replaces mutable state with an immutable log of domain events. Current state is derived, not stored. This enables complete audit trails, temporal queries, and projection flexibility—at the cost of increased complexity, eventual consistency, and schema evolution challenges.
Key takeaways:
1. **Not a silver bullet**: CQRS/ES adds complexity. Use only for bounded contexts where audit trails or temporal queries justify the cost.
2. **Design decisions multiply**: Stream granularity, projection strategy, snapshot policy, and schema evolution must all be decided.
3. **Production realities differ from theory**: Real systems handle out-of-order events, duplicate delivery, projection lag, and schema migrations.
4. **Start simple**: No snapshots initially. Synchronous projections if consistency matters. Add complexity only when measured.
5. **Kafka is not an event store**: Purpose-built stores (EventStoreDB) or database-backed solutions (Marten, PostgreSQL) are better fits.
## Appendix
### Prerequisites
- Understanding of domain-driven design (aggregates, bounded contexts)
- Familiarity with distributed systems consistency models (eventual vs strong)
- Basic understanding of CQRS (Command Query Responsibility Segregation)
### Terminology
| Term | Definition |
| -------------------------- | ------------------------------------------------------------------------- |
| **Event** | Immutable fact representing something that happened in the domain |
| **Stream** | Ordered sequence of events for an aggregate |
| **Projection** | Read model derived by applying events to initial state |
| **Snapshot** | Materialized state at a point in time; optimization for replay |
| **Upcasting** | Transforming old event schema to current schema on read |
| **Aggregate** | DDD concept; consistency boundary containing related entities |
| **Left-fold** | Reducing a sequence to a single value by applying a function cumulatively |
| **Optimistic concurrency** | Conflict detection by checking version hasn't changed since read |
| **Transactional outbox** | Pattern for reliable event publishing within database transaction |
### Summary
- **Core pattern**: State = replay(events). Events are immutable facts; current state is derived.
- **Design paths**: Pure ES vs hybrid; sync vs async projections. Choose based on consistency and throughput requirements.
- **Snapshots**: Add when replay time becomes problematic. Event-count or time-based triggers.
- **Schema evolution**: Additive changes preferred. Upcasting for breaking changes. Never modify existing events in place.
- **Production reality**: Handle out-of-order events, duplicates, projection lag, and unbounded growth.
- **GDPR**: Crypto shredding pattern separates encryption keys from encrypted data.
### References
#### Foundational Sources
- [Martin Fowler - Event Sourcing](https://martinfowler.com/eaaDev/EventSourcing.html) - Original definition and patterns
- [Martin Fowler - CQRS](https://martinfowler.com/bliki/CQRS.html) - CQRS relationship and warnings
- [Greg Young - CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf) - Comprehensive CQRS/ES guide from the pattern's creator
- [Microsoft - Exploring CQRS and Event Sourcing](https://www.microsoft.com/en-us/download/details.aspx?id=34774) - Reference implementation with Greg Young foreword
#### Production Case Studies
- [Martin Fowler - LMAX Architecture](https://martinfowler.com/articles/lmax.html) - 6M orders/second event sourcing
- [Netflix Tech Blog - Scaling Event Sourcing for Downloads](https://netflixtechblog.com/scaling-event-sourcing-for-netflix-downloads-episode-2-ce1b54d46eec) - Six-month build case study
- [Walmart Global Tech - Inventory Availability System](https://medium.com/walmartglobaltech/design-inventory-availability-system-using-event-sourcing-1d0f022e399f) - Hybrid event sourcing at scale
- [Uber Engineering - Cadence Overview](https://www.uber.com/blog/open-source-orchestration-tool-cadence-overview/) - Event sourcing for durable execution
#### Implementation Patterns
- [Event-Driven.io - Projections and Read Models](https://event-driven.io/en/projections_and_read_models_in_event_driven_architecture/) - Projection lifecycle patterns
- [Event-Driven.io - Event Versioning](https://event-driven.io/en/how_to_do_event_versioning/) - Schema evolution strategies
- [Domain Centric - Snapshotting](https://domaincentric.net/blog/event-sourcing-snapshotting) - Snapshot strategy decision framework
- [Marten Documentation](https://martendb.io/) - PostgreSQL-backed event sourcing for .NET
#### Failure Modes and Lessons
- [Dennis Doomen - The Ugly of Event Sourcing](https://www.dennisdoomen.com/2017/11/the-ugly-of-event-sourcingreal-world.html) - Production war stories
- [Chris Kiehl - Event Sourcing is Hard](https://chriskiehl.com/article/event-sourcing-is-hard) - Debugging reality check
- [Serialized.io - Why Kafka is Not for Event Sourcing](https://serialized.io/blog/apache-kafka-is-not-for-event-sourcing/) - Technology selection guidance
#### GDPR and Compliance
- [Mathias Verraes - Crypto Shredding Pattern](https://verraes.net/2019/05/eventsourcing-patterns-throw-away-the-key/) - Data deletion strategy
- [Event-Driven.io - GDPR in Event-Driven Architecture](https://event-driven.io/en/gdpr_in_event_driven_architecture/) - Compliance patterns
#### Tools and Frameworks
- [EventStoreDB Documentation](https://developers.eventstore.com/) - Purpose-built event store
- [LMAX Disruptor](https://lmax-exchange.github.io/disruptor/) - High-performance ring buffer
- [Axon Framework](https://www.axoniq.io/framework) - Java CQRS/ES toolkit (70M+ downloads)
- [Temporal.io](https://temporal.io/) - Durable execution platform (Cadence fork)
---
## Change Data Capture
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/change-data-capture
**Category:** System Design / Core Distributed Patterns
**Description:** Change Data Capture (CDC) extracts and streams database changes to downstream systems in real-time. Rather than polling databases or maintaining dual-write logic, CDC reads directly from the database’s internal change mechanisms—transaction logs, replication streams, or triggers—providing a reliable, non-invasive way to propagate data changes across systems.This article covers CDC approaches, log-based implementation internals, production patterns, and when each variant makes sense.
# Change Data Capture
Change Data Capture (CDC) extracts and streams database changes to downstream systems in real-time. Rather than polling databases or maintaining dual-write logic, CDC reads directly from the database's internal change mechanisms—transaction logs, replication streams, or triggers—providing a reliable, non-invasive way to propagate data changes across systems.
This article covers CDC approaches, log-based implementation internals, production patterns, and when each variant makes sense.

CDC captures changes from the database's transaction log and emits structured change events to downstream consumers—each event contains the operation type, before/after state, and source metadata.
## Abstract
CDC provides **eventually-consistent data propagation** without application-level dual writes. The fundamental insight: databases already track all changes internally for durability and replication—CDC exposes this stream externally.
**Core mental model:**
- **Log-based CDC** reads the database's Write-Ahead Log (WAL) or binary log. Non-invasive, captures all changes including those from direct SQL. The gold standard for production.
- **Trigger-based CDC** uses database triggers to capture changes. Higher database overhead, but works when log access is unavailable.
- **Polling-based CDC** queries for changes via timestamps or sequence columns. Misses hard deletes, adds query load, but requires no special database access.
**Key insight**: The choice between approaches is a **source access vs. operational complexity** tradeoff. Log-based requires database configuration and replication slots but captures everything with minimal overhead. Polling requires no special access but misses deletions and adds query load.
**Production reality**: Log-based CDC dominates production systems. Debezium (open-source) and AWS DMS (managed) are the primary tools. Most implementations use Kafka as the change event transport.
## The Problem
### Why Naive Solutions Fail
**Approach 1: Dual Writes in Application Code**
```typescript collapse={1-5}
async function updateUser(userId: string, data: UserData) {
await db.users.update(userId, data)
await kafka.publish("users", { op: "UPDATE", after: data })
}
```
Fails because:
- **Partial failures**: Database commits but Kafka publish fails. Data is now inconsistent.
- **Distributed transaction complexity**: 2PC across database and Kafka is slow and fragile.
- **Missed changes**: Direct SQL updates, migrations, and other services bypass the publish logic.
- **Ordering**: No guarantee that Kafka messages arrive in commit order.
**Approach 2: Polling with Timestamps**
```sql
SELECT * FROM users WHERE updated_at > :last_poll_time
```
Fails because:
- **Misses hard deletes**: Deleted rows don't appear in query results.
- **Clock skew**: `updated_at` timestamp may not reflect actual commit order across replicas.
- **Polling interval trade-off**: Frequent polling adds load; infrequent polling adds latency.
- **Transaction visibility**: May read uncommitted or partially committed transactions.
**Approach 3: Trigger-Based Capture**
```sql
CREATE TRIGGER user_changes AFTER INSERT OR UPDATE OR DELETE ON users
FOR EACH ROW EXECUTE FUNCTION capture_change();
```
Fails at scale because:
- **Transaction overhead**: Trigger executes within the transaction, adding latency to every write.
- **Lock contention**: Writing to a change table can create lock conflicts.
- **Operational burden**: Triggers must be maintained across schema changes.
### The Core Challenge
The fundamental tension: **application code cannot reliably capture all database changes without the database's cooperation**. Direct SQL, stored procedures, migrations, and multiple services all modify data outside application control.
CDC resolves this by **reading changes where they're already reliably recorded**—the database's transaction log. This log exists for durability and replication; CDC treats it as a public API.
## CDC Approaches
### Log-Based CDC (Primary Approach)
**How it works:**
1. CDC connector acts as a replica consumer for the database's transaction log
2. Connector maintains position (LSN, binlog coordinates, or GTID) for resumability
3. Changes parsed from binary log format into structured events
4. Events published to message broker, maintaining transaction boundaries
**Database-specific mechanisms:**
| Database | Log Type | Access Method | Position Tracking |
| ---------- | --------------------- | ------------------------ | ------------------------- |
| PostgreSQL | WAL (Write-Ahead Log) | Logical Replication Slot | LSN (Log Sequence Number) |
| MySQL | Binary Log | Binlog client protocol | GTID or file:position |
| MongoDB | Oplog | Change Streams API | Resume token |
| SQL Server | Transaction Log | CDC tables or log reader | LSN |
**Why log-based is preferred:**
- **Complete capture**: Every committed change, including DDL, is in the log
- **Minimal overhead**: Reading the log adds no load to write path
- **Transactional boundaries**: Changes can be grouped by transaction
- **Ordering guarantees**: Log order matches commit order
**Trade-offs:**
| Advantage | Disadvantage |
| ------------------------------ | ------------------------------- |
| Captures all changes | Requires database configuration |
| No write-path overhead | Log format is database-specific |
| Transaction ordering preserved | Replication slot management |
| Includes deletes and DDL | Requires log retention tuning |
### Trigger-Based CDC
**How it works:**
1. Create triggers on source tables for INSERT, UPDATE, DELETE
2. Triggers write change records to shadow tables
3. Separate process polls shadow tables and publishes events
4. Shadow table records deleted after successful publish
**When to choose:**
- Log-based access unavailable (managed databases, permission restrictions)
- Only specific tables need capture (trigger overhead is localized)
- Legacy databases without logical replication support
**Trade-offs:**
| Advantage | Disadvantage |
| ------------------------------------- | -------------------------------- |
| Works without special database access | Adds latency to every write |
| Full control over captured data | Trigger maintenance overhead |
| Selective capture | Lock contention on shadow tables |
### Polling-Based CDC
**How it works:**
1. Query source tables periodically for changes since last poll
2. Use `updated_at` timestamp or sequence column to identify changes
3. Mark captured rows or track high-water mark
4. Publish changes to downstream systems
**When to choose:**
- Read replica available for polling (isolates from production writes)
- Soft deletes only (hard deletes not used)
- Near-real-time acceptable (seconds to minutes latency)
**Limitations:**
- Cannot capture hard deletes without tombstone markers
- Timestamp precision issues (multiple changes within same timestamp)
- Must poll frequently to approach real-time
- No transaction grouping
### Decision Framework

## Log-Based CDC Internals
### PostgreSQL: WAL and Logical Replication
PostgreSQL's CDC uses **logical replication**, which decodes the physical WAL into logical change events.
**Architecture:**

**Configuration requirements:**
```sql
-- postgresql.conf
wal_level = logical -- Required for logical replication
max_replication_slots = 4 -- One per CDC connector
max_wal_senders = 4 -- Connections for replication
-- Create replication slot (done by Debezium automatically)
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');
```
**Output plugins:**
| Plugin | Output Format | Use Case |
| --------------- | --------------- | ----------------------------------------------- |
| `pgoutput` | Binary protocol | Native PostgreSQL replication, Debezium default |
| `wal2json` | JSON | External systems requiring JSON |
| `test_decoding` | Text | Debugging and testing |
**Critical operational concern—slot bloat:**
PostgreSQL retains WAL as long as a replication slot hasn't consumed it. If a CDC connector goes down:
```sql
-- Monitor slot lag
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
-- Set maximum retained WAL (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
```
Without `max_slot_wal_keep_size`, an inactive slot can fill the disk. This is the most common CDC production incident.
**Version evolution:**
> **PostgreSQL 17 (2024)**: Introduced logical replication failover. Replication slot state can be synchronized to standby servers, enabling CDC continuity during primary failover. Prior versions required re-snapshotting after failover.
### MySQL: Binary Log
MySQL's CDC reads the binary log, which records all data modifications.
**Configuration requirements:**
```ini
# my.cnf
server-id = 1 # Unique across replication topology
log_bin = mysql-bin # Enable binary logging
binlog_format = ROW # Required: ROW format (not STATEMENT)
binlog_row_image = FULL # Capture before and after state
expire_logs_days = 3 # Retention period
```
**GTID (Global Transaction ID):**
GTIDs uniquely identify transactions across the replication topology, enabling position-independent replication.
```sql
-- Enable GTID mode
gtid_mode = ON
enforce_gtid_consistency = ON
-- Format: server_uuid:transaction_id
-- Example: 3E11FA47-71CA-11E1-9E33-C80AA9429562:23
```
**Why GTID matters for CDC:**
- **Resumability**: CDC connector can resume from GTID regardless of binlog file rotation
- **Failover**: After primary failover, GTID identifies exactly which transactions to resume from
- **Multi-source**: When capturing from multiple MySQL instances, GTIDs prevent duplicate processing
**Binlog format comparison:**
| Format | Content | CDC Compatibility |
| --------- | ----------------------------------- | ---------------------------------------- |
| STATEMENT | SQL statements | Poor—cannot determine actual row changes |
| ROW | Actual row changes | Required for CDC |
| MIXED | Statement or row depending on query | Unreliable for CDC |
### MongoDB: Change Streams
MongoDB provides Change Streams, a high-level API over the oplog (operations log).
```typescript collapse={1-3}
const client = new MongoClient(uri)
const db = client.db("mydb")
// Watch collection-level changes
const changeStream = db.collection("users").watch([], {
fullDocument: "updateLookup", // Include full document on updates
fullDocumentBeforeChange: "whenAvailable", // Include before-image (MongoDB 6.0+)
})
changeStream.on("change", (change) => {
// change.operationType: 'insert' | 'update' | 'delete' | 'replace'
// change.fullDocument: current document state
// change.fullDocumentBeforeChange: previous state (if configured)
// change._id: resume token for resumability
})
```
**Key differences from relational CDC:**
- **Schema-free**: Documents can vary; change events reflect actual structure
- **Nested changes**: Updates to nested fields captured as partial updates
- **Resume tokens**: Opaque tokens for resumability (vs. LSN/GTID)
**Limitation**: Change Streams require replica set or sharded cluster. Single-node MongoDB doesn't support CDC.
## Design Paths
### Path 1: Debezium + Kafka Connect
**Context**: Open-source CDC platform. Most popular choice for self-managed CDC.
**Architecture:**

**When to choose this path:**
- Self-managed infrastructure with Kafka already in place
- Need sub-second latency
- Require full control over configuration and schema handling
- Multi-database environments
**Key characteristics:**
- One Kafka topic per table (configurable)
- Schema Registry integration for Avro/Protobuf/JSON Schema
- Exactly-once semantics with Kafka 3.3.0+ and KRaft
- Snapshot for initial data load, then streaming
**Configuration example:**
```json collapse={1-2, 15-25}
{
"name": "users-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db.example.com",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${secrets:postgres/password}",
"database.dbname": "myapp",
"topic.prefix": "myapp",
"table.include.list": "public.users,public.orders",
"slot.name": "debezium_users",
"publication.name": "dbz_publication",
"snapshot.mode": "initial",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}
```
**Trade-offs vs other paths:**
| Aspect | Debezium | AWS DMS | Fivetran |
| ------------------ | ------------------- | --------------- | ---------------- |
| Latency | Sub-second | Seconds-minutes | Seconds-minutes |
| Cost (100GB/day) | Infrastructure only | ~$200-400/mo | ~$1,500-3,000/mo |
| Operational burden | High | Low | Very low |
| Customization | Full control | Limited | Limited |
| Schema handling | Schema Registry | Basic | Automatic |
**Real-world: Shopify**
Shopify migrated from query-based to Debezium CDC:
- **Scale**: 173B requests on Black Friday 2024, 12TB data/minute peak
- **Implementation**: Debezium + Kafka Connect on Kubernetes
- **Outcome**: Real-time analytics replaced batch processing; fraud detection went from minutes to milliseconds
### Path 2: AWS Database Migration Service
**Context**: Managed CDC service integrated with AWS ecosystem.
**Architecture:**

**When to choose this path:**
- AWS-centric infrastructure
- Prefer managed over self-managed
- Target is AWS service (S3, Redshift, DynamoDB)
- Batch/near-real-time acceptable (not sub-second)
**Key characteristics:**
- Full load + ongoing CDC in single task
- Automatic schema migration (optional)
- Built-in monitoring via CloudWatch
- No Kafka required (direct to S3/Redshift)
**Limitations:**
- **Tables without primary keys**: Skipped during CDC (critical gap)
- **Latency**: Seconds to minutes, not sub-second
- **Large transactions**: Can cause significant lag
- **DDL propagation**: Limited support; may require manual intervention
**Cost model (2025):**
| Component | Pricing |
| -------------------- | ----------------------------------- |
| Replication instance | $0.016-$0.624/hour (size-dependent) |
| Data transfer | Standard AWS rates |
| Storage | $0.10/GB-month |
### Path 3: Maxwell's Daemon (MySQL-Specific)
**Context**: Lightweight MySQL CDC tool. Simpler than Debezium for MySQL-only environments.
**Architecture:**

**When to choose:**
- MySQL only
- Want simpler deployment than full Kafka Connect
- JSON output acceptable (no schema registry)
- Lower operational overhead priority
**Output format:**
```json
{
"database": "myapp",
"table": "users",
"type": "update",
"ts": 1706745600,
"data": { "id": 1, "name": "Alice", "email": "alice@example.com" },
"old": { "name": "Old Name" }
}
```
**Trade-offs:**
| Advantage | Disadvantage |
| ----------------------- | ---------------------------- |
| Simple deployment | MySQL only |
| Multiple output targets | No schema registry |
| Lightweight | Less mature ecosystem |
| Easy JSON parsing | Single-threaded per database |
### Comparison Matrix
| Factor | Debezium | AWS DMS | Maxwell | Fivetran |
| ------------------ | --------------- | --------------- | ------------ | --------------- |
| Databases | 10+ | 20+ | MySQL only | 500+ |
| Latency | Sub-second | Seconds-minutes | Sub-second | Seconds-minutes |
| Deployment | Self-managed | Managed | Self-managed | SaaS |
| Schema evolution | Schema Registry | Basic | JSON only | Automatic |
| Cost at scale | Low (infra) | Medium | Low | High |
| Operational burden | High | Low | Medium | Very low |
## Production Implementations
### LinkedIn: Databus
**Context**: LinkedIn built Databus (2012) as one of the first production CDC systems. Open-sourced; influenced later designs.
**Architecture:**

**Implementation details:**
- **Relay pattern**: Relays pull from OLTP database, deserialize to Avro, store in circular memory buffer
- **Bootstrap service**: Provides full data snapshots for new consumers or catch-up
- **Infinite lookback**: New consumers can request full dataset without stressing production database
- **Transactional ordering**: Preserves commit order within source
**Scale:**
- Thousands of events/second per relay server
- Millisecond end-to-end latency
- Powers: Social Graph Index, People Search Index, member profile replicas
**Key insight from LinkedIn:**
> "The relay maintains a sliding time window of changes in memory. Consumers that fall behind can catch up from the relay; consumers that fall too far behind bootstrap from a snapshot and then resume streaming."
### Airbnb: SpinalTap + Riverbed
**Context**: Airbnb uses CDC for their materialized views framework, processing billions of events daily.
**SpinalTap (CDC layer):**
- Scalable CDC across MySQL, DynamoDB, and internal storage
- Kafka as event transport
- Handles sharded monolith with transactional consistency
**Riverbed (materialized views):**

**Scale (2024):**
- 2.4 billion CDC events per day
- 350 million documents written daily to materialized views
- 50+ materialized views (search, payments, reviews, itineraries)
- Lambda architecture: Kafka (online) + Spark (offline)
**What worked:**
- GraphQL DSL for declarative view definitions
- Automatic schema evolution handling
- Real-time search index updates
### Netflix: DBLog
**Context**: Netflix developed DBLog for CDC across heterogeneous databases.
**Key innovation—incremental snapshots:**
Traditional CDC: Full snapshot (locks table) → Start streaming
DBLog approach:
```
1. Start CDC streaming (no snapshot)
2. Incrementally snapshot in chunks:
- Select small range by primary key
- Emit snapshot events
- Continue streaming concurrently
3. Reconcile snapshot with streaming at consumer
```
**Benefits:**
- No long-running locks or table copies
- Snapshot can be paused/resumed
- Works alongside live traffic
**Production since 2018:**
- Powers Delta platform (data synchronization)
- Studio applications event processing
- Connectors for MySQL, PostgreSQL, CockroachDB, Cassandra
### WePay: Cassandra CDC
**Context**: WePay (now part of Chase) built CDC for Cassandra, which lacks native CDC support.
**Implementation:**

**Key design decisions:**
- **Agent per node**: Each Cassandra node has a local CDC agent reading commit logs
- **Primary agent pattern**: Each agent is "primary" for a subset of partition keys, avoiding duplicates
- **Exactly-once**: Achieved at agent level through offset tracking
**Open-sourced**: Now part of Debezium as incubating Cassandra connector.
### Implementation Comparison
| Aspect | LinkedIn Databus | Airbnb SpinalTap | Netflix DBLog | WePay Cassandra |
| ----------------- | ----------------- | ------------------ | -------------------- | --------------------- |
| Primary database | Oracle/MySQL | MySQL/DynamoDB | Heterogeneous | Cassandra |
| Snapshot approach | Bootstrap server | Full then stream | Incremental chunks | N/A (no snapshot) |
| Scale | Thousands/sec | Billions/day | Studio-scale | Payments-scale |
| Open-source | Yes (archived) | No | Concepts only | Yes (Debezium) |
| Key innovation | Relay + bootstrap | Materialized views | Incremental snapshot | Primary agent pattern |
## Schema Evolution
### The Schema Challenge
CDC events must carry schema information. When source schema changes, downstream consumers must handle the evolution.
**Problem scenarios:**
1. **Column added**: New events have field; old events don't
2. **Column removed**: Old events have field; new events don't
3. **Column renamed**: Appears as remove + add
4. **Type changed**: `INT` → `BIGINT`, `VARCHAR(50)` → `VARCHAR(100)`
### Schema Registry Integration

**How it works:**
1. CDC connector serializes event with schema
2. Schema registered in Schema Registry (if new)
3. Event includes schema ID reference (not full schema)
4. Consumer fetches schema by ID, caches locally
5. Consumer deserializes using fetched schema
**Compatibility modes:**
| Mode | Allows | Use Case |
| -------- | ---------------------------- | ---------------------------------- |
| BACKWARD | New schema can read old data | Consumers updated before producers |
| FORWARD | Old schema can read new data | Producers updated before consumers |
| FULL | Both directions | Most restrictive; safest |
| NONE | Any change | Development only |
**Recommended approach**: BACKWARD_TRANSITIVE (all previous versions readable by latest)
### Handling DDL Changes
**Safe operations (backward compatible):**
- Add nullable column
- Add column with default value
- Increase column size (`VARCHAR(50)` → `VARCHAR(100)`)
**Breaking operations (require coordination):**
- Remove column
- Rename column
- Change column type
- Add NOT NULL column without default
**Migration pattern for breaking changes:**
```
1. Add new column (nullable)
2. Deploy producers writing to new column
3. Backfill new column from old column
4. Deploy consumers reading new column
5. Stop writing old column
6. Remove old column
```
### Debezium Schema Handling
Debezium can be configured to:
```json
{
"schema.history.internal.kafka.topic": "schema-changes.myapp",
"schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
"include.schema.changes": "true"
}
```
**Schema change events:**
```json
{
"source": { "table": "users", "db": "myapp" },
"ddl": "ALTER TABLE users ADD COLUMN phone VARCHAR(20)",
"databaseName": "myapp",
"tableChanges": [{
"type": "ALTER",
"id": "myapp.users",
"table": {
"columns": [...]
}
}]
}
```
## Exactly-Once Semantics
### The Delivery Challenge
CDC involves multiple hops where failures can occur:
```
Database → CDC Connector → Kafka → Consumer → Target System
```
Each transition can fail after partial completion.
### Kafka Exactly-Once (Since 0.11.0)
**Idempotent producer:**
```properties
enable.idempotence=true
```
Producer assigns sequence number to each message. Broker deduplicates by (producer_id, sequence).
**Transactional writes:**
```java
producer.initTransactions();
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction(); // Atomic: all or nothing
```
**Consumer isolation:**
```properties
isolation.level=read_committed
```
Consumer only sees committed transactional messages.
### Debezium EOS (Kafka 3.3.0+)
Debezium source connectors support exactly-once when:
1. Kafka Connect configured for exactly-once source
2. Kafka version 3.3.0+ with KRaft
3. Connector offset storage in Kafka
```properties
# Kafka Connect worker config
exactly.once.source.support=enabled
```
**How it works:**
1. Connector reads changes and offset atomically
2. Writes records + offset update in single Kafka transaction
3. On restart, resumes from last committed offset
4. No duplicate events published to Kafka
**Important caveat**: This is exactly-once from database to Kafka. Consumer-to-target still needs idempotent handling.
### End-to-End Exactly-Once
For true end-to-end exactly-once:

Consumer-side idempotency:
```typescript collapse={1-5}
async function processChange(change: ChangeEvent) {
const key = `${change.source.table}:${change.key}`
const version = change.source.lsn
// Idempotent upsert using source version
await target.upsert(
{
id: key,
data: change.after,
_version: version,
},
{
where: { _version: { lt: version } }, // Only apply if newer
},
)
}
```
## CDC Consumer Patterns
### Transactional Outbox Integration
The **transactional outbox pattern** ensures reliable event publishing by writing events to a database table (outbox) within the same transaction as business data.

**CDC as outbox relay:**
```sql
-- Outbox table
CREATE TABLE outbox (
id UUID PRIMARY KEY,
aggregate_type VARCHAR(255),
aggregate_id VARCHAR(255),
type VARCHAR(255),
payload JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
-- Application writes to outbox in same transaction
BEGIN;
UPDATE users SET email = 'new@example.com' WHERE id = 123;
INSERT INTO outbox (id, aggregate_type, aggregate_id, type, payload)
VALUES (gen_random_uuid(), 'User', '123', 'EmailChanged', '{"email": "new@example.com"}');
COMMIT;
```
**Debezium outbox transform:**
```json
{
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.route.topic.replacement": "events.${routedByValue}"
}
```
### Cache Invalidation
CDC enables event-driven cache invalidation without TTL guessing:

**Implementation:**
```typescript collapse={1-8}
interface ChangeEvent {
op: "c" | "u" | "d" // create, update, delete
before: Record | null
after: Record | null
source: { table: string }
}
async function handleChange(change: ChangeEvent) {
const table = change.source.table
const key = change.after?.id ?? change.before?.id
// Invalidate cache entry
await redis.del(`${table}:${key}`)
// Optional: warm cache with new value
if (change.op !== "d" && change.after) {
await redis.setex(`${table}:${key}`, 3600, JSON.stringify(change.after))
}
}
```
**Benefits over TTL:**
- Immediate invalidation (sub-second vs. minutes/hours)
- No stale reads from long TTLs
- No thundering herd from short TTLs
### Search Index Synchronization
CDC keeps search indices in sync with source of truth:

**Kafka Connect Elasticsearch sink:**
```json
{
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "myapp.public.products",
"connection.url": "http://elasticsearch:9200",
"type.name": "_doc",
"key.ignore": "false",
"schema.ignore": "true",
"behavior.on.null.values": "delete"
}
```
**Handling deletions:**
- Debezium emits tombstone (null value) for deletes
- Sink connector translates tombstone to Elasticsearch delete
- Index stays synchronized including deletions
### Analytics Pipeline Feeding
CDC enables real-time analytics without batch ETL:

**Lambda architecture simplification:**
| Traditional | CDC-Based |
| -------------------------- | ---------------------------------- |
| Batch ETL (daily) + Stream | Single CDC stream |
| Batch for completeness | Snapshot + stream for completeness |
| Hours-old data | Seconds-old data |
| Multiple pipelines | Single pipeline |
## Common Pitfalls
### 1. Replication Slot Disk Bloat (PostgreSQL)
**The mistake**: Not monitoring replication slot lag.
**What happens**: CDC connector goes down or can't keep up. PostgreSQL retains all WAL since last consumed position. Disk fills. Database crashes.
**Example**: Connector had network issue for 2 hours. 50GB of WAL accumulated. Recovery required manual slot deletion and re-snapshot.
**Solutions:**
```sql
-- Monitor slot lag
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag,
active
FROM pg_replication_slots;
-- Set maximum retained WAL (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
-- Alert on inactive slots
SELECT slot_name FROM pg_replication_slots WHERE NOT active;
```
### 2. Tables Without Primary Keys
**The mistake**: Creating tables without primary keys, then adding them to CDC.
**What happens**: AWS DMS skips these tables entirely during CDC. Debezium can capture but updates/deletes can't be keyed properly.
**Example**: Legacy table `audit_log` had no PK. Added to CDC scope. All changes captured as creates; updates appeared as new rows.
**Solutions:**
- Add primary keys to all tables before enabling CDC
- Use composite key if no natural key exists
- For truly keyless tables, add surrogate key column
### 3. Large Transaction Handling
**The mistake**: Running batch updates (millions of rows) during CDC operation.
**What happens**: Debezium buffers changes until transaction commits. Memory pressure. Downstream lag. Potential OOM.
**Example**: Nightly job updating 5M rows in single transaction. CDC connector memory spiked to 8GB, causing restart. Other tables' CDC delayed by 30 minutes.
**Solutions:**
- Break large updates into batches with commits
- Configure Debezium memory limits
- Schedule large batch jobs during low-traffic windows
- Use `incremental.snapshot` for backfills
### 4. Snapshot + Streaming Race Conditions
**The mistake**: Not understanding snapshot isolation during initial load.
**What happens**: Snapshot reads table at point-in-time. Streaming starts from "after snapshot." Changes during snapshot can be missed or duplicated.
**Example**:
1. Snapshot starts at LSN 100
2. Row inserted at LSN 150
3. Snapshot reads row (sees insertion)
4. Streaming starts at LSN 100
5. Streaming also captures insertion at LSN 150
6. Duplicate row in target
**Solutions:**
Debezium handles this correctly when configured properly:
```json
{
"snapshot.mode": "initial",
"snapshot.locking.mode": "minimal"
}
```
Consumer must be idempotent to handle potential duplicates during snapshot-to-streaming transition.
### 5. Schema Change During CDC
**The mistake**: Assuming DDL changes propagate seamlessly.
**What happens**:
- Column added: Old consumers fail parsing
- Column removed: Data loss if not handled
- Type changed: Deserialization errors
**Example**: Added `phone` column to `users` table. CDC captured the DDL. Downstream consumer's Avro schema didn't have `phone`. Consumer crashed with schema mismatch error.
**Solutions:**
- Use Schema Registry with BACKWARD compatibility
- Test schema changes in staging with CDC running
- Coordinate consumer deployments with schema changes
- Monitor for schema change events before production DDL
## Implementation Guide
### Starting Point Decision

### Checklist for Production CDC
**Database preparation:**
- [ ] Enable logical replication/binary logging
- [ ] Create dedicated CDC user with minimal permissions
- [ ] Configure log retention appropriately
- [ ] Add primary keys to all tables in scope
- [ ] Test DDL change impact
**Infrastructure:**
- [ ] Kafka cluster sized for CDC throughput
- [ ] Schema Registry deployed and accessible
- [ ] Monitoring dashboards for connector lag
- [ ] Alerting on replication slot lag (PostgreSQL)
- [ ] Alerting on connector failures
**Operational:**
- [ ] Runbook for connector restart
- [ ] Runbook for re-snapshot after extended downtime
- [ ] Backup strategy for connector offsets
- [ ] Schema change coordination process
- [ ] Large transaction handling policy
### Capacity Planning
**Throughput estimation:**
```
CDC messages/sec ≈ (writes/sec to source tables) × (avg columns per table / 10)
```
Each CDC message size depends on row size and change type (update includes before/after).
**Kafka sizing:**
| Metric | Recommendation |
| -------------------- | ----------------------------------------- |
| Partitions per topic | 2-3 × expected consumer parallelism |
| Replication factor | 3 (standard Kafka recommendation) |
| Retention | 7 days minimum (allows consumer recovery) |
| Broker disk | 3 × (daily CDC volume) × retention days |
## Conclusion
CDC transforms database changes into reliable event streams, enabling real-time data propagation without application-level dual writes. Log-based CDC—reading from WAL, binlog, or oplog—is the production standard, capturing all changes with minimal database impact.
**Key decisions:**
1. **Log-based vs. polling**: Log-based captures everything including deletes; polling is simpler but misses hard deletes and adds latency
2. **Debezium vs. managed**: Debezium offers sub-second latency and full control; managed services (DMS, Fivetran) reduce operational burden
3. **Schema evolution strategy**: Schema Registry with BACKWARD compatibility prevents consumer breakage
**Critical operational concerns:**
- PostgreSQL replication slot bloat is the most common production incident
- Large transactions can cause memory pressure and downstream lag
- Tables without primary keys create CDC gaps
**Start simple**: Single database → Debezium → Kafka → single consumer. Add complexity (schema registry, multiple sources, complex routing) as requirements demand.
## Appendix
### Prerequisites
- Database administration fundamentals (replication, transaction logs)
- Message broker concepts (Kafka topics, partitions, consumer groups)
- Distributed systems basics (eventual consistency, exactly-once semantics)
### Terminology
| Term | Definition |
| -------------------- | ------------------------------------------------------------------- |
| **WAL** | Write-Ahead Log—PostgreSQL's transaction log for durability |
| **Binlog** | Binary Log—MySQL's log of all data modifications |
| **Oplog** | Operations Log—MongoDB's capped collection recording writes |
| **LSN** | Log Sequence Number—position in PostgreSQL WAL |
| **GTID** | Global Transaction ID—MySQL's cross-topology transaction identifier |
| **Replication slot** | PostgreSQL mechanism to track consumer position and retain WAL |
| **Tombstone** | Kafka message with null value indicating deletion |
| **Schema Registry** | Service storing and versioning message schemas |
| **Snapshot** | Initial full data load before streaming changes |
### Summary
- CDC extracts database changes from transaction logs without impacting write performance
- **Log-based CDC** (Debezium, DMS) is the production standard—captures all operations including deletes and DDL
- **PostgreSQL** uses logical replication slots; monitor `max_slot_wal_keep_size` to prevent disk bloat
- **MySQL** requires `binlog_format=ROW` and benefits from GTID for resumability across failover
- **Exactly-once semantics** require Kafka 3.3.0+ with KRaft; consumer-side idempotency for end-to-end guarantees
- **Schema evolution** needs Schema Registry with BACKWARD compatibility; coordinate schema changes with consumer deployments
- **Transactional outbox** pattern integrates naturally with CDC for reliable event publishing
### References
**Official Documentation:**
- [Debezium Documentation](https://debezium.io/documentation/reference/stable/architecture.html) - Architecture, connectors, and configuration
- [Debezium Exactly-Once Delivery](https://debezium.io/blog/2023/06/22/towards-exactly-once-delivery/) - EOS implementation details
- [PostgreSQL Logical Replication](https://www.postgresql.org/docs/current/logical-replication.html) - Native PostgreSQL replication
- [PostgreSQL Logical Decoding](https://www.postgresql.org/docs/current/logicaldecoding.html) - WAL decoding internals
- [MySQL Binary Log](https://dev.mysql.com/doc/refman/8.0/en/binary-log.html) - Binlog configuration and format
- [MySQL GTID](https://dev.mysql.com/doc/refman/8.4/en/replication-gtids-concepts.html) - Global Transaction ID concepts
- [MongoDB Change Streams](https://www.mongodb.com/docs/manual/changeStreams/) - Change Stream API reference
- [AWS DMS CDC](https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Task.CDC.html) - DMS ongoing replication
**Engineering Blogs:**
- [LinkedIn: Open Sourcing Databus](https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system) - Original Databus architecture
- [Shopify: Capturing Every Change](https://shopify.engineering/capturing-every-change-shopify-sharded-monolith) - CDC at Shopify scale
- [Netflix: DBLog](https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b) - Incremental snapshot approach
- [Airbnb: SpinalTap](https://medium.com/airbnb-engineering/capturing-data-evolution-in-a-service-oriented-architecture-72f7c643ee6f) - CDC for materialized views
**Patterns and Best Practices:**
- [Transactional Outbox Pattern](https://microservices.io/patterns/data/transactional-outbox.html) - Reliable event publishing pattern
- [AWS Transactional Outbox](https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/transactional-outbox.html) - AWS implementation guide
- [PostgreSQL Replication Slots Deep Dive](https://www.morling.dev/blog/mastering-postgres-replication-slots/) - Operational guidance
- [Advantages of Log-Based CDC](https://debezium.io/blog/2018/07/19/advantages-of-log-based-change-data-capture/) - Comparison with other approaches
**Kafka Exactly-Once:**
- [Kafka Exactly-Once Semantics](https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/) - Confluent explanation
- [KIP-98: Exactly Once Delivery](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging) - Original Kafka proposal
---
## Database Migrations at Scale
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/database-migrations-at-scale
**Category:** System Design / Core Distributed Patterns
**Description:** Changing database schemas in production systems without downtime requires coordinating schema changes, data transformations, and application code across distributed systems. The core challenge: the schema change itself takes milliseconds, but MySQL’s ALTER TABLE on a 500GB table with row locking would take days and block all writes. This article covers the design paths, tool mechanisms, and production patterns that enable zero-downtime migrations.
# Database Migrations at Scale
Changing database schemas in production systems without downtime requires coordinating schema changes, data transformations, and application code across distributed systems. The core challenge: the schema change itself takes milliseconds, but MySQL's `ALTER TABLE` on a 500GB table with row locking would take days and block all writes. This article covers the design paths, tool mechanisms, and production patterns that enable zero-downtime migrations.

Decision tree for selecting migration approach based on table characteristics and change type.
## Abstract
Database migrations at scale are fundamentally about **change isolation**—ensuring schema modifications don't block production traffic. Three primary mechanisms achieve this:
1. **Shadow table approach**: Create a copy of the table with the new schema, copy data in batches, capture ongoing changes, then atomically swap. The critical design choice is _how_ to capture changes—triggers (synchronous, blocks queries) vs. binlog consumption (asynchronous, adds latency).
2. **Expand-contract pattern**: Add new columns/tables alongside old ones, run dual-write/dual-read during transition, then remove old structures. This trades migration complexity for deployment flexibility—any stage can be rolled back independently.
3. **Native instant DDL**: Metadata-only changes that avoid data copying entirely. Limited to specific operations (MySQL 8.0.29+ supports adding/dropping columns instantly), but transforms hours-long migrations into milliseconds.
The production reality: most migrations combine these approaches. A column type change might use gh-ost for the schema change, expand-contract for application rollout, and backfill jobs for data transformation—each stage with its own rollback plan.
## The Problem
### Why Naive Schema Changes Fail
**Approach 1: Direct `ALTER TABLE`**
MySQL's `ALTER TABLE` (pre-8.0, or for unsupported operations) acquires a metadata lock that blocks all queries for the duration of the table copy:
```sql
-- On a 200GB table with 500M rows, this takes 4-8 hours
-- ALL reads and writes blocked for the entire duration
ALTER TABLE orders ADD COLUMN shipping_estimate DATETIME;
```
The table must be copied row-by-row to a new structure. For a 200GB table at 100MB/s internal throughput, that's ~33 minutes just for copying—but with checkpointing, validation, and lock contention, real-world times are 4-8 hours. During this time, the application is effectively down.
**Approach 2: Blue-Green Table Swap**
Create a new table, copy data, swap:
```sql
CREATE TABLE orders_new LIKE orders;
ALTER TABLE orders_new ADD COLUMN shipping_estimate DATETIME;
INSERT INTO orders_new SELECT *, NULL FROM orders;
RENAME TABLE orders TO orders_old, orders_new TO orders;
```
The `RENAME` is atomic, but the gap between finishing the `INSERT` and executing `RENAME` means writes to `orders` are lost. On a table with 1,000 writes/second, even a 100ms gap loses 100 records.
**Approach 3: Application-Level Dual Write**
Write to both old and new tables from application code:
```python
def create_order(order_data):
db.execute("INSERT INTO orders ...", order_data)
db.execute("INSERT INTO orders_new ...", order_data) # Add new column
```
Without distributed transactions, a crash between the two writes leaves the tables inconsistent. More subtly, the two writes aren't atomic—concurrent transactions might see different states. Stripe's engineering blog notes that dual-write "works only if you can tolerate some data inconsistency during the migration window."
### The Core Challenge
The fundamental tension: **schema changes require exclusive access, but production systems require continuous availability**. Online schema change (OSC) tools resolve this by creating the illusion of atomicity through careful orchestration—maintaining a shadow copy that stays synchronized until a brief, sub-second cutover.
## Design Paths
### Path 1: Trigger-Based (pt-online-schema-change)
Percona's pt-online-schema-change (pt-osc) uses MySQL triggers to synchronously capture changes.
**Mechanism:**
1. Create shadow table `_orders_new` with new schema
2. Create `AFTER INSERT`, `AFTER UPDATE`, `AFTER DELETE` triggers on original table
3. Copy rows in batches (default: 1,000 rows per chunk)
4. Triggers mirror every DML to the shadow table in real-time
5. Atomic swap: `RENAME TABLE orders TO _orders_old, _orders_new TO orders`
**Why triggers?** The synchronous nature means zero replication lag—every change to the original table is immediately applied to the shadow table within the same transaction. This simplifies the cutover: when the batch copy finishes, the shadow table is already current.
**Code flow:**
```sql
-- Trigger created by pt-osc (simplified)
CREATE TRIGGER pt_osc_ins_orders AFTER INSERT ON orders
FOR EACH ROW
REPLACE INTO _orders_new (id, ..., shipping_estimate)
VALUES (NEW.id, ..., NULL);
```
**Trade-offs:**
| Aspect | Characteristic |
| -------------------- | ----------------------------------------------- |
| Consistency | Strong—changes are synchronous |
| Performance overhead | 10-15% on DML operations (trigger execution) |
| Replication format | Works with statement-based, row-based, or mixed |
| Foreign keys | Better support than gh-ost |
| Cut-over | Atomic RENAME, sub-second |
**When to choose:**
- Environments with statement-based replication (SBR)
- Tables with foreign key constraints
- Lower write throughput (<5,000 writes/sec on typical hardware)
- Simpler operational requirements
**Real-world:** Percona benchmarks show pt-osc completing migrations ~2x faster than gh-ost under light load, because triggers execute in parallel with the application while gh-ost's single-threaded binlog processing serializes changes.
### Path 2: Binlog-Based (gh-ost)
GitHub's gh-ost avoids triggers entirely by consuming the MySQL binary log.
**Mechanism:**
1. Create shadow table `_orders_gho` with new schema
2. Connect to a replica and simulate being a replica itself
3. Copy rows in batches from the original table
4. Simultaneously tail the binlog for changes to the original table
5. Apply binlog events to the shadow table asynchronously
6. Cut-over using a lock-and-rename coordination protocol
**Why binlog?** Triggers compete with application queries for row locks and add parsing overhead to every DML. On high-throughput tables (10,000+ writes/sec), trigger contention causes deadlocks and latency spikes. Binlog consumption is asynchronous—it doesn't block application queries.
**The cut-over challenge:**
MySQL doesn't support atomic table swap within a single connection that holds a lock. gh-ost's solution uses two connections:
```
Connection 1: LOCK TABLES orders WRITE, _orders_gho WRITE
-- Creates sentry table to block premature RENAME
Connection 2: RENAME TABLE orders TO _orders_del, _orders_gho TO orders
-- Blocked until sentry is dropped
Connection 1: DROP TABLE sentry; UNLOCK TABLES
-- RENAME executes, appears atomic to replicas
```
The sentry table mechanism ensures the RENAME only executes after all binlog events are applied. From the replica's perspective (which only sees the binlog), the RENAME appears instantaneous.
**Trade-offs:**
| Aspect | Characteristic |
| -------------------- | --------------------------------------------------- |
| Consistency | Eventually consistent during migration (binlog lag) |
| Performance overhead | Network I/O for binlog, but no trigger overhead |
| Replication format | Requires row-based replication (RBR) |
| Foreign keys | No support (can't track FK updates via binlog) |
| Cut-over | Coordinated lock-and-rename, typically 1-3 seconds |
**When to choose:**
- High write throughput (>5,000 writes/sec)
- Long-running migrations where throttling is needed
- Need for true pause capability (pausing gh-ost stops all migration work)
- Modern MySQL 5.7+ / 8.0 environments with RBR
**Real-world:** GitHub migrates tables with millions of rows while serving production traffic. Their documentation notes that under sustained high load, gh-ost may lag behind indefinitely if writes exceed single-threaded binlog processing capacity.
### Path 3: Native Instant DDL (MySQL 8.0+)
MySQL 8.0.12 introduced `ALGORITHM=INSTANT` for metadata-only changes.
**Supported instant operations (8.0.29+):**
- `ADD COLUMN` at any position
- `DROP COLUMN`
- `RENAME COLUMN`
- `SET DEFAULT` / `DROP DEFAULT`
- Modify `ENUM`/`SET` column definition
**Mechanism:**
Instead of copying the table, MySQL stores the new column definition in metadata. Existing rows return `NULL` (or the default value) for the new column until explicitly updated. New rows include the column in their storage format.
```sql
-- Completes in milliseconds regardless of table size
ALTER TABLE orders ADD COLUMN shipping_estimate DATETIME, ALGORITHM=INSTANT;
```
**Why the 64-version limit?** Each instant operation increments a row version counter. MySQL caps this at 64 to bound the complexity of interpreting row formats during reads. After 64 instant operations, you must rebuild the table (which MySQL will do automatically if you try another instant operation beyond the limit).
**Trade-offs:**
| Aspect | Characteristic |
| ------------------- | -------------------------------------------- |
| Speed | Milliseconds for any table size |
| Limitations | 64 instant operations per table rebuild |
| Row format | Compressed tables not supported |
| Combined operations | Can't mix instant + non-instant in one ALTER |
**When to choose:**
- Adding nullable columns or columns with defaults
- Dropping unused columns
- Any supported operation on tables too large for OSC
**Gotcha:** Even "online" ALTER TABLE on the primary causes replication lag on replicas. The replica must acquire the same locks and perform the same operation. For non-instant operations on large tables, replicas will lag for hours.
### Path 4: Expand-Contract (Application-Coordinated)
For changes that require data transformation (not just schema changes), the expand-contract pattern decouples the migration into independently deployable stages.
**Phases:**

**Stage 1: Expand (backward compatible)**
```sql
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
```
```python
# Application code v2: Write to both
def update_user_name(user_id, first, last):
db.execute("""
UPDATE users
SET first_name = %s, last_name = %s, full_name = %s
WHERE id = %s
""", (first, last, f"{first} {last}", user_id))
```
**Stage 2: Migrate (transition)**
Run backfill job:
```sql
-- Process in batches of 10,000
UPDATE users
SET full_name = CONCAT(first_name, ' ', last_name)
WHERE full_name IS NULL
LIMIT 10000;
```
Deploy read from new column:
```python
def get_display_name(user_id):
return db.query("SELECT full_name FROM users WHERE id = %s", user_id)
```
**Stage 3: Contract (remove legacy)**
```python
# Application code v4: Write only new column
def update_user_name(user_id, full_name):
db.execute("UPDATE users SET full_name = %s WHERE id = %s", (full_name, user_id))
```
```sql
ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;
```
**Trade-offs:**
| Aspect | Characteristic |
| ---------------------- | ------------------------------------------------- |
| Rollback capability | Each stage independently reversible |
| Deployment flexibility | Different services can be at different stages |
| Complexity | Multiple deployments, temporary increased storage |
| Duration | Days to weeks for full migration |
**When to choose:**
- Data transformation required (not just schema change)
- Multiple services read/write the table
- Need fine-grained rollback at each stage
- Regulatory or audit requirements for gradual rollout
**Real-world:** Stripe uses expand-contract for all schema migrations involving data changes. Their engineering blog notes: "We never change more than a few hundred lines at a time—each stage is a separate pull request with its own review."
### Decision Framework

## Production Implementations
### GitHub: gh-ost at Scale
**Context:** GitHub serves millions of repositories with MySQL clusters handling tens of thousands of queries per second. Schema changes on tables like `commits` (billions of rows) were impossible with traditional OSC tools.
**Implementation choices:**
- Pattern variant: Binlog-based with replica consumption
- Key customization: Configurable throttling based on replica lag, custom cut-over timing
- Scale: Tables with hundreds of millions of rows, thousands of writes/sec
**Specific details:**
- Reads binlog from replica to minimize primary load
- Supports `--postpone-cut-over` to schedule final swap during low-traffic windows
- Dynamic configuration via Unix socket while migration runs
- Heartbeat injection for lag detection (configurable, default 500ms)
**What worked:**
- True pause capability—unlike trigger-based tools, pausing gh-ost means zero additional load
- Testability—can run in "test-on-replica" mode that never touches production
**What was hard:**
- Single-threaded binlog processing can't keep up with very high write loads
- Initial versions had bugs in cut-over that caused brief outages
- Network bandwidth—must transfer both data copy and binlog stream
### Slack: Vitess Migration
**Context:** Slack's messaging infrastructure required scaling from a single MySQL shard to dozens while maintaining sub-100ms latency for message delivery.
**Implementation choices:**
- Pattern variant: VReplication-based (Vitess's internal replication)
- Key customization: Table comprising 20% of overall query load migrated incrementally
- Scale: 0 to 2.3 million QPS over 3 years
**Architecture:**

**Specific details:**
- Generic backfill system for cloning tables with double-writes from application
- Parallel double-read diffing to verify semantic correctness before cutover
- Data model based on collocating all data per Slack team on same shard
**What worked:**
- VReplication's built-in consistency guarantees reduced custom verification code
- Declarative schema migrations—define desired state, Vitess computes diff
**What was hard:**
- 3-year migration timeline required maintaining backward compatibility throughout
- Double-read verification caught subtle bugs that would have caused data loss
### Figma: PostgreSQL Horizontal Sharding
**Context:** Figma's single PostgreSQL database on AWS's largest RDS instance was approaching capacity limits with ~100x growth from 2020-2022.
**Implementation choices:**
- Pattern variant: Custom horizontal sharding with proxy layer
- Key customization: DBProxy service intercepts SQL and routes dynamically
- Scale: Largest RDS instance to horizontally sharded cluster
**Architecture:**

**Specific details:**
- DBProxy implements load-shedding and request hedging
- Shadow application readiness testing predicts live traffic behavior
- Maintained rollback capability throughout—no one-way migrations
**What worked:**
- ~30 seconds downtime during table moves (vs. hours with traditional approaches)
- Transparent scale-out—no future application changes required
**What was hard:**
- Evaluated CockroachDB, TiDB, Spanner—migration risk too high for timeline
- Skip expensive backfills by designing schema for eventual sharding from start
### Stripe: Scientist for Verification
**Context:** Stripe migrates millions of active payment objects while maintaining 99.999% availability and financial consistency.
**Implementation choices:**
- Pattern variant: Expand-contract with shadow verification
- Key customization: GitHub Scientist library for runtime comparison
- Scale: 5 million database queries per second
**Scientist verification pattern:**
```ruby
# Compare old and new code paths in production
experiment = Scientist::Experiment.new("order-migration")
experiment.use { legacy_order_lookup(id) } # Control: always returned
experiment.try { new_order_lookup(id) } # Candidate: compared
experiment.run
```
**Specific details:**
- Runs both code paths, compares results, publishes mismatches
- Never uses Scientist for writes (side effects would execute twice)
- Data reconciliation scripts run alongside experiments
- Hadoop MapReduce for offline data transformation (vs. expensive production queries)
**What worked:**
- Caught subtle bugs in new code paths before cutover
- Mismatches trigger alerts, allowing fixes before full rollout
**What was hard:**
- Scientist blocks run sequentially—data may change between executions
- Large-scale comparisons require careful sampling to avoid performance impact
### Implementation Comparison
| Aspect | GitHub (gh-ost) | Slack (Vitess) | Figma (Custom) | Stripe (Scientist) |
| ------------ | ---------------- | ---------------- | --------------------- | ------------------------------ |
| Approach | Binlog-based OSC | VReplication | Proxy + sharding | Expand-contract + verification |
| Scale | Billions of rows | 2.3M QPS | 100x growth | 5M queries/sec |
| Downtime | 1-3 sec cutover | Sub-second | ~30 sec moves | Zero (staged) |
| Verification | Test-on-replica | Double-read diff | Shadow traffic | Scientist comparison |
| Rollback | Re-run migration | VReplication | Maintained throughout | Per-stage |
## Implementation Guide
### Starting Point Decision

### Tool Selection Matrix
| Tool | Best For | Avoid When |
| --------------- | ------------------------------------- | -------------------------------------------- |
| **INSTANT DDL** | Add/drop columns, any table size | Unsupported operations, 64-version limit hit |
| **pt-osc** | Simpler setup, FK support, SBR | High write load (>5K/sec), trigger-sensitive |
| **gh-ost** | High throughput, pausable, testable | Foreign keys, SBR environments |
| **Spirit** | MySQL 8.0+, checkpoint/resume | Non-MySQL, need FK support |
| **Vitess** | Already on Vitess, need revertibility | Single database, simple schema |
| **pgroll** | PostgreSQL, reversible migrations | Not Postgres 14+, can't use views |
### Backfill Best Practices
**Batching and throttling:**
```python
def backfill_column(batch_size=10000, sleep_between=0.5):
last_id = 0
while True:
# Process batch
result = db.execute("""
UPDATE users
SET full_name = CONCAT(first_name, ' ', last_name)
WHERE id > %s AND full_name IS NULL
ORDER BY id LIMIT %s
""", (last_id, batch_size))
if result.rowcount == 0:
break
last_id = get_max_id_from_batch()
# Checkpoint for resume
save_checkpoint(last_id)
# Throttle to avoid replica lag
time.sleep(sleep_between)
```
**Key requirements:**
- **Idempotent**: Safe to re-run on failure
- **Resumable**: Checkpoint progress for restart
- **Throttled**: Monitor replica lag, pause if > threshold
- **Batched by primary key**: Avoid full table scans
### Verification Checklist
Before cutover:
- [ ] Shadow table row count matches original (within replication lag window)
- [ ] Checksum verification passes (Spirit, pt-osc support this)
- [ ] Replica lag is within acceptable bounds
- [ ] No blocking queries on the table
- [ ] Rollback procedure documented and tested
During cutover:
- [ ] Cut-over scheduled during low-traffic window
- [ ] Monitoring dashboards open
- [ ] On-call engineer available
- [ ] Rollback trigger defined (latency spike, error rate)
After cutover:
- [ ] Application metrics stable
- [ ] No increase in error rates
- [ ] Old table preserved for 24-48 hours (emergency rollback)
- [ ] Cleanup scheduled (drop old table after validation period)
## Common Pitfalls
### 1. Blocking Queries During Cut-Over
**The mistake:** Running long-running queries while attempting cut-over.
**What happens:** gh-ost and pt-osc need a metadata lock for the final RENAME. Long-running `SELECT` statements (reporting queries, analytics) hold metadata locks that block the cut-over. The migration waits, and new queries queue behind it, causing application timeout.
**Solutions:**
- Kill long-running queries before cut-over window
- Use `--postpone-cut-over` to schedule during low-traffic windows
- Implement query timeout policies in the application
### 2. Foreign Key Constraints
**The mistake:** Using gh-ost on tables with foreign keys.
**What happens:** gh-ost creates a shadow table without FK relationships. Inserts to child tables referencing the original table fail because the FK points to the wrong table during migration.
**Solutions:**
- Use pt-osc for tables with foreign keys
- Remove FKs before migration, re-add after (requires application-level integrity)
- Migrate parent tables before child tables
### 3. Underestimating Backfill Duration
**The mistake:** Assuming backfill will complete in production test time × table size ratio.
**What happens:** Production has more concurrent load, larger transactions, more replica lag. A backfill that took 2 hours in staging takes 2 days in production, blocking the contract phase.
**Solutions:**
- Run backfill during off-peak hours
- Use aggressive throttling (better slow than blocking production)
- Parallelize across partitions if table is partitioned
### 4. Missing Rollback Plan
**The mistake:** Planning only for success.
**What happens:** Migration completes, application deployed, then subtle data corruption discovered. No clear path to restore the old schema with new data intact.
**Solutions:**
- Keep old columns/tables for 24-48 hours after cutover
- Design for bidirectional compatibility during expand phase
- Test rollback procedure before starting migration
### 5. Instant DDL Version Limit
**The mistake:** Assuming INSTANT DDL always works.
**What happens:** After 64 instant operations on a table, MySQL forces a table rebuild. This can surprise teams mid-migration with a multi-hour ALTER.
**Solutions:**
- Track instant operation count per table
- Proactively rebuild tables approaching the limit during maintenance windows
- Use `SELECT NAME, TOTAL_ROW_VERSIONS FROM INFORMATION_SCHEMA.INNODB_TABLES`
## Conclusion
Database migrations at scale reduce to a core principle: **isolate change from production traffic through incremental, reversible steps**. The shadow table approach (whether trigger-based or binlog-based) enables schema changes without downtime. Expand-contract enables data transformations without big-bang deployments. Native instant DDL enables metadata changes without any table copying.
The production patterns—verification via Scientist-style comparison, throttled backfills, careful cut-over coordination—exist because migrations fail. GitHub built gh-ost after trigger-based tools caused deadlocks. Stripe built Scientist after code migrations introduced subtle bugs. Figma built custom sharding after evaluating NewSQL databases.
The common thread: plan for rollback at every stage, verify before trusting, and never assume the happy path.
## Appendix
### Prerequisites
- Understanding of database replication (binary log, replica lag)
- Familiarity with ACID transactions and locking
- Knowledge of blue-green deployment patterns
### Summary
- **Shadow table approach**: pt-osc (triggers) vs. gh-ost (binlog)—choose based on write load and foreign key needs
- **Expand-contract**: For data transformations, decouple into backward-compatible stages
- **Instant DDL**: Use for supported operations; track the 64-version limit
- **Verification matters**: Stripe's Scientist, Slack's double-read—verify before cutover
- **Plan for failure**: Keep old structures 24-48 hours, test rollback before starting
### References
- [gh-ost: GitHub's Online Schema-Change Tool](https://github.com/github/gh-ost) - Design documentation and cut-over mechanism
- [pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html) - Percona Toolkit documentation
- [MySQL 8.0 Online DDL Operations](https://dev.mysql.com/doc/refman/8.0/en/innodb-online-ddl-operations.html) - Official MySQL documentation on INSTANT algorithm
- [Spirit: MySQL Online Schema Change](https://github.com/block/spirit) - Block's modern reimplementation of gh-ost
- [Vitess Online DDL](https://vitess.io/docs/user-guides/schema-changes/ddl-strategies/) - VReplication-based migrations
- [pgroll: Zero-Downtime PostgreSQL Migrations](https://github.com/xataio/pgroll) - Expand-contract for PostgreSQL
- [Stripe Engineering: Online Migrations at Scale](https://stripe.com/blog/online-migrations) - Expand-contract and Scientist verification
- [GitHub Scientist](https://github.com/github/scientist) - Runtime comparison library
- [Slack Engineering: Scaling Datastores with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/) - 3-year Vitess migration
- [Figma: How the Databases Team Lived to Tell the Scale](https://www.figma.com/blog/how-figmas-databases-team-lived-to-tell-the-scale/) - Custom PostgreSQL sharding
- [Martin Fowler: Parallel Change](https://martinfowler.com/bliki/ParallelChange.html) - Expand-contract pattern origin
---
## Multi-Region Architecture
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/multi-region-architecture
**Category:** System Design / Core Distributed Patterns
**Description:** Building systems that span multiple geographic regions to achieve lower latency, higher availability, and regulatory compliance. This article covers the design paths—active-passive, active-active, and cell-based architectures—along with production implementations from Netflix, Slack, and Uber, data replication strategies, conflict resolution approaches, and the operational complexity trade-offs that determine which pattern fits your constraints.
# Multi-Region Architecture
Building systems that span multiple geographic regions to achieve lower latency, higher availability, and regulatory compliance. This article covers the design paths—active-passive, active-active, and cell-based architectures—along with production implementations from Netflix, Slack, and Uber, data replication strategies, conflict resolution approaches, and the operational complexity trade-offs that determine which pattern fits your constraints.

Multi-region architecture with cell-based isolation: traffic routes to nearest region, cells provide fault isolation within regions, and data replicates asynchronously across regions.
## Abstract
Multi-region architecture navigates a fundamental tension: **global reach requires geographic distribution, but distribution introduces latency for coordination**. The core design decision is where to place the consistency boundary:
- **Active-passive**: Single writer, simple consistency, higher RTO during failover
- **Active-active**: Multiple writers, lower RTO, requires conflict resolution
- **Cell-based**: Regional isolation limits blast radius regardless of active/passive choice
The CAP theorem forces the choice: partition tolerance is mandatory across regions (WAN failures happen), so you trade consistency for availability or vice versa. Most production systems choose eventual consistency with idempotent operations and reconciliation—accepting that replicas may temporarily diverge but will converge.
**Key numbers to remember:**
| Metric | Typical Value |
| ------------------------------------- | ------------------ |
| Cross-region RTT (US-East to EU-West) | 80-120ms |
| Sync replication latency penalty | 2× RTT per write |
| Async replication lag (normal) | 10ms - 1s |
| Async replication lag (degraded) | Minutes to hours |
| Active-active failover | Seconds |
| Active-passive failover | Minutes (scale-up) |
| Cell-based failover (Slack) | < 5 minutes |
## The Problem
### Why Single-Region Fails
**Latency ceiling**: A user in Tokyo accessing servers in US-East experiences 150-200ms RTT before any processing. For interactive applications, this degrades UX—humans perceive delays > 100ms.
**Availability ceiling**: A single region, even with multiple availability zones, shares failure domains: regional network issues, power grid problems, or cloud provider outages. AWS US-East-1 has experienced multiple region-wide incidents affecting all AZs.
**Compliance barriers**: GDPR, data residency laws, and data sovereignty requirements mandate that certain data stays within geographic boundaries. A single-region architecture cannot satisfy conflicting jurisdiction requirements.
### Why Naive Multi-Region Fails
**Approach 1: Synchronous replication everywhere**
Write latency becomes `2 × RTT + processing`. For US-to-EU replication, every write takes 200-300ms minimum. Users experience this as application sluggishness. Under high load, write queues back up and cascade into failures.
**Approach 2: Read replicas only, single primary**
Reads are fast (local), but writes route to a single region. Users far from the primary experience write latency. During primary region failure, writes are unavailable until manual failover—RTO measured in minutes to hours.
**Approach 3: Multi-primary without conflict resolution**
Concurrent writes to the same data in different regions corrupt state. Last-write-wins by wall clock fails because clock skew between regions can be seconds. The system appears to work until edge cases surface in production.
### The Core Challenge
The fundamental tension: **coordination across regions requires communication, communication requires time, and time is latency**. Strong consistency demands coordination. Lower latency demands less coordination. You cannot have both.
Multi-region architecture exists to navigate this tension by:
1. Defining clear consistency boundaries (what must be consistent, what can be eventual)
2. Choosing replication strategies that match latency requirements
3. Designing conflict resolution for inevitable concurrent writes
4. Building isolation boundaries to limit failure propagation
## Design Paths
### Path 1: Active-Passive

Active-passive: all traffic goes to primary region; standby receives replicated data but serves no traffic until failover.
**How it works:**
- One region (primary) handles all read and write traffic
- Standby region receives asynchronously replicated data
- Standby services may be scaled down or off to reduce cost
- Failover promotes standby to primary (manual or automated)
**When to choose:**
- Write latency is critical (single-writer means no cross-region coordination)
- Operational simplicity is prioritized
- Cost is a concern (standby can run minimal infrastructure)
- RTO of minutes is acceptable
**Key characteristics:**
- **RPO**: Depends on replication lag (typically seconds to minutes)
- **RTO**: Minutes to tens of minutes (standby scale-up, DNS propagation)
- **Consistency**: Strong within primary region
- **Cost**: Lower (standby runs minimal or no compute)
**Failover process:**
1. Detect primary failure (health checks, synthetic monitoring)
2. Stop replication to prevent stale writes
3. Promote standby database to primary
4. Scale up standby compute
5. Update DNS/routing to point to new primary
6. Verify application health
**Trade-offs vs active-active:**
| Aspect | Active-Passive | Active-Active |
| ---------------------- | ---------------------- | -------------------------------- |
| Write latency | Lowest (single region) | Higher if sync, same if async |
| RTO | Minutes | Seconds |
| Operational complexity | Lower | Higher |
| Cost | Lower | Higher |
| Consistency model | Strong | Eventually consistent or complex |
**Real-world consideration:** AWS Elastic Disaster Recovery achieves RTO in minutes and RPO in seconds for active-passive setups. Azure Site Recovery provides 5-minute crash-consistent recovery points. These tools automate the failover process but still require standby scale-up time.
### Path 2: Active-Active

Active-active: both regions serve traffic simultaneously with bidirectional data replication.
**How it works:**
- All regions actively serve production traffic
- Each region has full read/write capability
- Data replicates bidirectionally between regions
- Conflict resolution handles concurrent writes to same data
**When to choose:**
- Near-zero RTO is required (no failover delay)
- Users are globally distributed (each region serves local users)
- Write availability cannot be sacrificed during region failures
- Team can handle conflict resolution complexity
**Key characteristics:**
- **RPO**: Zero (no data loss if replication is synchronous) or near-zero (async)
- **RTO**: Seconds (traffic reroutes automatically)
- **Consistency**: Eventually consistent (async) or linearizable (sync with latency cost)
- **Cost**: Higher (full capacity in all regions)
**Conflict resolution strategies:**
| Strategy | How It Works | Trade-offs |
| ----------------------- | ------------------------------------- | ------------------------------------------- |
| Last-Write-Wins (LWW) | Timestamp-based; later write wins | Simple but loses earlier concurrent writes |
| Application-level merge | Custom logic per data type | Flexible but complex to implement correctly |
| CRDTs | Mathematically guaranteed convergence | Limited data structures, can grow unbounded |
| Quorum writes | Majority must agree | Higher latency, reduced availability |
**Real-world example (Netflix):**
Netflix runs active-active across US-East and US-West:
- Strict region autonomy: no synchronous cross-region calls
- Services discover only local instances
- Writes go local, replicate asynchronously
- Embraced eventual consistency with idempotent operations
- Business logic redesigned to handle temporary divergence
- User profiles may temporarily differ but converge through replicated event journals
Result: Invisible failover to users; routine Chaos Kong tests drop entire regions.
**Trade-offs vs active-passive:**
| Aspect | Active-Active | Active-Passive |
| ---------------------- | --------------------------- | -------------------- |
| RTO | Seconds | Minutes |
| Conflict handling | Required | None (single writer) |
| Data consistency | Eventual (typically) | Strong |
| Resource utilization | Higher (all regions active) | Lower |
| Operational complexity | Higher | Lower |
### Path 3: Cell-Based Architecture

Cell-based architecture: each cell is an isolated, complete deployment serving a subset of users; failure in Cell A1 (highlighted) doesn't affect A2 or A3.
**How it works:**
- Workload is partitioned into isolated cells
- Each cell is a complete, independent deployment
- Cells don't share state with other cells
- Users are routed to a specific cell (by user ID, tenant, geography)
- Cell failure affects only users assigned to that cell
**When to choose:**
- Blast radius limitation is critical
- Multi-tenant systems where tenant isolation is required
- Gradual rollout of changes (per-cell deployment)
- High availability requirements where zone/region failures are unacceptable
**Key characteristics:**
- **Blast radius**: Limited to cell size (e.g., 1/N of users)
- **Independence**: Cells don't communicate with each other
- **Scaling**: Add more cells rather than scaling cell size
- **Complexity**: Cell routing, cross-cell operations (rare)
**Cell sizing considerations:**
| Cell Size | Blast Radius | Cost Efficiency | Operational Overhead |
| --------------------- | ------------ | --------------- | -------------------- |
| Small (1% of users) | Minimal | Lower | Higher (more cells) |
| Medium (10% of users) | Moderate | Balanced | Moderate |
| Large (25% of users) | Higher | Higher | Lower |
**Real-world example (Slack):**
Slack migrated to cell-based architecture after availability zone outages:
- Each AZ contains a completely siloed backend deployment
- Services only communicate within their AZ
- Failure in one AZ is contained to that AZ
- Can drain affected AZ within 5 minutes
- Traffic shifted gradually (1% granularity)
- Edge load balancers (Envoy) distributed across regions
Result: AZ failures no longer cascade; graceful degradation affects subset of users.
**Combining with active-active:**
Cell-based architecture is orthogonal to active-passive/active-active:
- **Active-passive cells**: Each cell has a primary and standby
- **Active-active cells**: Cells in different regions serve same user partition
- **Region-scoped cells**: Cells within a region, replicated to other regions
### Decision Framework

Decision tree for multi-region architecture patterns based on RTO requirements, conflict handling capability, and blast radius concerns.
**Quick decision matrix:**
| If you need... | Choose... |
| ----------------------------------- | ------------------------------------ |
| Simplest operations, minutes RTO OK | Active-Passive |
| Seconds RTO, can handle conflicts | Active-Active |
| Seconds RTO, no conflicts | Active-Active with data partitioning |
| Limit blast radius | Add Cell-Based to any pattern |
## Data Replication Strategies
### Synchronous Replication
**How it works:**
Write is not acknowledged until all (or quorum of) replicas confirm receipt.
```
Client → Primary → [Replicas confirm] → Ack to Client
```
**Latency impact:**
Write latency = `2 × RTT to farthest replica + processing`
For US-East to EU-West (80ms RTT one-way):
- Minimum write latency: 160ms + processing
- P99 can exceed 300ms under load
**When to use:**
- Zero RPO is mandatory (financial transactions, audit logs)
- Data loss is worse than latency
- Write volume is low enough to absorb latency
**Implementations:**
- **Google Spanner**: Synchronous Paxos-based replication; external consistency
- **CockroachDB**: Consensus-based; write committed when majority acknowledges
### Asynchronous Replication
**How it works:**
Primary acknowledges write immediately, replicates in background.
```
Client → Primary → Ack to Client
↓ (async)
Replicas
```
**Latency impact:**
Write latency = local processing only (microseconds to milliseconds)
**Replication lag:**
| Condition | Typical Lag |
| ------------------ | ---------------------- |
| Normal operation | 10ms - 1s |
| Network congestion | Seconds to minutes |
| Region degradation | Minutes to hours |
| Uber HiveSync P99 | ~20 minutes (batch) |
| AWS Aurora Global | Sub-second (streaming) |
**When to use:**
- Write throughput is critical
- Temporary inconsistency is acceptable
- RPO of seconds-to-minutes is tolerable
**Risk:**
Primary failure before replication completes = data loss. Committed writes may not have propagated.
### Semi-Synchronous Replication
**How it works:**
Synchronously replicate to N replicas; asynchronously to others.
```
Client → Primary → [N replicas confirm] → Ack
↓ (async)
Remaining replicas
```
**Trade-off:**
- Better durability than fully async (data exists in N+1 locations)
- Better latency than fully sync (only N replicas in critical path)
- Common pattern: sync to one replica in same region, async to others
**Implementations:**
- MySQL semi-sync replication
- PostgreSQL synchronous standby with async secondaries
### Replication Topology Patterns

Replication topologies: Star (primary to all replicas), Chain (reduces primary load), Mesh (multi-primary active-active).
| Topology | Use Case | Trade-off |
| -------- | ------------------------------- | --------------------------- |
| Star | Active-passive, read replicas | Primary is bottleneck |
| Chain | Reduce primary replication load | Higher lag to end of chain |
| Mesh | Active-active multi-primary | Complex conflict resolution |
## Conflict Resolution
### The Conflict Problem
In active-active systems, concurrent writes to the same data in different regions create conflicts:
```
Region A: SET user.name = "Alice" @ T1
Region B: SET user.name = "Bob" @ T1
```
Both writes succeed locally. When replication happens, which value wins?
### Last-Write-Wins (LWW)
**Mechanism:** Attach timestamp to each write; higher timestamp wins.
```typescript
type LWWValue = {
value: T
timestamp: number // Hybrid logical clock recommended
}
function merge(local: LWWValue, remote: LWWValue): LWWValue {
return local.timestamp >= remote.timestamp ? local : remote
}
```
**Clock considerations:**
- Wall clock: Skew between regions can be seconds; NTP helps but doesn't eliminate
- Logical clocks: Monotonic per node; need hybrid for cross-node ordering
- Hybrid Logical Clocks (HLC): Combine wall time with logical counter; used by CockroachDB
**Trade-offs:**
- **Pro:** Simple to implement
- **Con:** Earlier concurrent write is silently lost
- **Use when:** Losing concurrent writes is acceptable (e.g., last-update-wins is the business rule)
### Application-Level Merge
**Mechanism:** Custom merge function per data type.
```typescript
function mergeShoppingCart(local: Cart, remote: Cart): Cart {
// Union of items; for duplicates, sum quantities
const merged = new Map()
for (const item of [...local.items, ...remote.items]) {
const existing = merged.get(item.id)
if (existing) {
existing.quantity += item.quantity
} else {
merged.set(item.id, { ...item })
}
}
return { items: Array.from(merged.values()) }
}
```
**Trade-offs:**
- **Pro:** Full control over merge semantics
- **Con:** Must implement and test for each data type
- **Use when:** Business logic dictates specific merge behavior
### CRDTs (Conflict-Free Replicated Data Types)
**Mechanism:** Data structures mathematically guaranteed to converge without conflicts.
**Core CRDT types:**
| CRDT | Use Case | Behavior |
| ------------------------ | ---------------------- | -------------------------------------------------- |
| G-Counter | Increment-only counter | Each node tracks own count; merge = max per node |
| PN-Counter | Counter with decrement | Two G-Counters (positive, negative); value = P - N |
| G-Set | Add-only set | Union on merge |
| OR-Set (Observed-Remove) | Set with remove | Tracks add/remove per element with unique tags |
| LWW-Register | Single value | Last-write-wins with timestamp |
| MV-Register | Multi-value register | Keeps all concurrent values |
**G-Counter example:**
```typescript
type GCounter = Map
function increment(counter: GCounter, nodeId: NodeId): GCounter {
const newCounter = new Map(counter)
newCounter.set(nodeId, (counter.get(nodeId) ?? 0) + 1)
return newCounter
}
function merge(a: GCounter, b: GCounter): GCounter {
const merged = new Map()
for (const nodeId of new Set([...a.keys(), ...b.keys()])) {
merged.set(nodeId, Math.max(a.get(nodeId) ?? 0, b.get(nodeId) ?? 0))
}
return merged
}
function value(counter: GCounter): number {
return Array.from(counter.values()).reduce((sum, n) => sum + n, 0)
}
```
**Trade-offs:**
- **Pro:** Automatic convergence; no custom merge logic
- **Con:** Limited data structures; can grow unbounded (tombstones)
- **Use when:** Data model fits CRDT primitives
**Production usage:**
- **Riak:** State-based CRDTs with delta-state optimization
- **Redis CRDB:** CRDTs for active-active geo-distribution
- **Figma:** Operation-based CRDTs for collaborative editing
See [CRDTs for Collaborative Systems](../crdt-for-collaborative-systems/README.md) for deep-dive.
### Choosing a Conflict Resolution Strategy
| Scenario | Recommended Strategy |
| ----------------------- | ---------------------------------------- |
| User profile updates | LWW (last update wins is expected) |
| Shopping cart | Application merge (union of items) |
| Counters (likes, views) | G-Counter CRDT |
| Collaborative documents | Operation-based CRDTs or OT |
| Financial balances | Avoid conflict (single writer or quorum) |
## Global Load Balancing
### GeoDNS
**Mechanism:** DNS resolver returns different IP addresses based on client's geographic location.
```
Client (Tokyo) → DNS query → GeoDNS → Returns IP of Asia-Pacific region
Client (NYC) → DNS query → GeoDNS → Returns IP of US-East region
```
**Limitations:**
- IP geolocation is imprecise (VPNs, corporate proxies, mobile networks)
- DNS caching delays failover (TTL typically minutes)
- No real-time health awareness
**When to use:**
- Coarse-grained geographic routing
- Latency optimization (not failover)
- Cost is a concern (simpler than anycast)
### Anycast
**Mechanism:** Multiple servers share the same IP address; BGP routing directs to "closest" server.
```
Same IP announced from:
- US-East data center
- EU-West data center
- Asia-Pacific data center
BGP routes to topologically closest (not geographically closest)
```
**Advantages:**
- Instant failover (BGP reconverges in seconds)
- Works regardless of DNS caching
- True network proximity (based on routing, not geography)
**Limitations:**
- Requires own AS number and upstream relationships
- Complex to operate
- Stateful connections can break during route changes
**Production usage:**
Cloudflare announces service IPs from every data center worldwide. Traffic always routes to closest data center. Regional Services feature passes traffic to region-specific processing after edge inspection.
### Latency-Based Routing
**Mechanism:** Route based on measured latency, not assumed geography.
AWS Route 53 latency-based routing:
1. AWS measures latency from DNS resolver networks to each region
2. Returns IP of region with lowest latency for that resolver
3. Periodic re-measurement adapts to network changes
**Advantages:**
- Actual performance, not assumed
- Adapts to network conditions
**Limitations:**
- Measures resolver-to-region, not end-user-to-region
- Still subject to DNS caching
### Global Server Load Balancing (GSLB)
**Mechanism:** Combines geographic awareness, health checks, and load balancing.
```
GSLB considers:
- Geographic proximity
- Server health (active health checks)
- Current load per region
- Latency measurements
```
**Typical decision flow:**
1. Client request arrives
2. GSLB checks health of all regions
3. Filters to healthy regions
4. Selects based on latency/load/geography
5. Returns appropriate endpoint
**Trade-off vs simpler approaches:**
| Approach | Failover Speed | Health Awareness | Complexity |
| -------- | --------------- | ----------------- | ---------- |
| GeoDNS | Minutes (TTL) | None | Low |
| Anycast | Seconds (BGP) | Route-level | High |
| GSLB | Seconds-Minutes | Application-level | Medium |
## Production Implementations
### Netflix: Active-Active Multi-Region
**Context:** Streaming service with 200M+ subscribers; downtime directly impacts revenue.
**Architecture:**
- Full stack deployed in US-East and US-West
- All regions active, serving production traffic
- No synchronous cross-region calls (strict region autonomy)
**Key design decisions:**
| Decision | Rationale |
| -------------------------- | ---------------------------------------------- |
| Async replication | Write latency critical for UX |
| Regional service discovery | Eliminates cross-region call latency |
| Idempotent operations | Safe to retry; handles duplicate processing |
| Eventual consistency | Accepted temporary divergence for availability |
**Data handling:**
- Writes occur locally, replicate asynchronously
- User profiles and playback states may temporarily differ
- Convergence through replicated event journals
- Deterministic A/B test bucketing (same user, same bucket regardless of region)
**Routing and failover:**
- Enhanced Zuul proxy handles active-active routing
- Detects and handles mis-routed requests
- Global routing layer shifts traffic transparently during issues
- Failover is invisible to users
**Durability:**
- Every data element replicated across 3 AZs per region
- Routine S3 snapshots to another region and non-AWS cloud
- Can survive complete region loss
**Testing:**
- Chaos Kong: Drops entire AWS region in production
- Chaos Gorilla: Drops entire availability zone
- Regular failover exercises
**Outcome:** Near-zero RTO, invisible failovers, routine region-drop tests.
### Slack: Cellular Architecture
**Context:** Enterprise messaging; AZ outages triggered architecture redesign.
**Motivation:**
Previous monolithic deployment meant AZ failure affected all users. Migration to cell-based architecture to limit blast radius.
**Architecture:**
- Each AZ contains siloed backend deployment
- Components constrained to single AZ
- Services only communicate within their AZ
- Edge load balancers (Envoy) distributed across regions
**Cell isolation:**
```
Cell A1 (AZ-1): Services A, B, C → Database A1
Cell A2 (AZ-2): Services A, B, C → Database A2
Cell A3 (AZ-3): Services A, B, C → Database A3
No cross-cell communication
```
**Failover capabilities:**
| Metric | Value |
| ----------------------------- | ------------------- |
| Drain affected AZ | < 5 minutes |
| Traffic shift granularity | 1% |
| Request handling during drain | Graceful completion |
**Implementation details:**
- Heavy investment in Envoy/xDS migration from HAProxy
- In-house xDS control plane (Rotor)
- Edge load balancers in different regions
- Control plane replicated regionally, resilient to single AZ failure
**Outcome:** AZ failures contained; graceful degradation affects subset of users.
### Uber: Multi-Region Data Consistency
**Context:** Ride-sharing and delivery; 5M daily Hive events, 8PB of data replication.
**HiveSync System:**
Cross-region batch replication for data lake:
- Event-driven jobs capture Hive Metastore changes
- MySQL logs replication events asynchronously
- Converts to replication jobs as finite-state machines
- DAG-based orchestration with dynamic sharding
**Performance:**
| Metric | Target | Actual |
| --------------------- | ------- | ----------- |
| Replication SLA | 4 hours | Met |
| P99 lag | - | ~20 minutes |
| Cross-region accuracy | - | 99.99% |
**Data Reparo Service:**
- Scans regions for anomalies
- Fixes mismatches for consistency
- Catches replication failures and drift
**Multi-region Kafka (uReplicator):**
- Open-source solution for Kafka replication
- Extends MirrorMaker with reliability guarantees
- Zero-data-loss guarantee
- Supports active/active and active/passive consumption
**Failover handling:**
- Tracks consumption offset in primary region
- Replicates offset to other regions
- On primary failure: consumers resume from replicated offset
### CockroachDB: Multi-Active Availability
**Context:** Distributed SQL database designed for multi-region from the start.
**Approach:**
Multi-Active Availability: all replicas handle reads AND writes.
**Replication mechanism:**
- Consensus-based (Raft variant)
- Write committed when majority acknowledges
- At least 3 replicas required
**Key features:**
| Feature | Description |
| -------------------- | ------------------------------------------------------------------- |
| Transparent failover | Region failure handled without application changes |
| Zero RPO | Majority-commit means no data loss |
| Near-zero RTO | Automatic leader election |
| Non-voting replicas | Follow Raft log without quorum participation; reduces write latency |
**Multi-region topology patterns:**
1. **Regional tables:** Data pinned to specific region for compliance
2. **Global tables:** Replicated everywhere for low-latency reads
3. **Survival goals:** Configure whether to survive region or zone failure
**Availability:**
- Multi-region instances: 99.999% availability target
- Regional instances: 99.99% availability target
### Google Spanner: Multi-Region with External Consistency
**Context:** Google's globally distributed, strongly consistent database.
**Consistency guarantee:**
External consistency—stronger than linearizability. Transaction order observed by clients matches actual commit order. Achieved through TrueTime (GPS + atomic clocks).
**Replication:**
- Synchronous, Paxos-based
- Write quorum from replicas across regions
- Witness replicas provide fault tolerance without full data
**Architecture (typical):**
```
Default: 2 read-write regions (2 replicas each) + 1 witness region
Write path:
1. Leader replica receives write
2. Replicates to quorum (majority)
3. Commits when quorum acknowledges
4. Async replicates to remaining replicas
```
**Availability:**
- Multi-region: 99.999% SLA
- Regional: 99.99% SLA
**Trade-off:**
Higher write latency (cross-region Paxos) in exchange for strongest consistency guarantees.
### Implementation Comparison
| Aspect | Netflix | Slack | Uber | CockroachDB | Spanner |
| ----------- | ------------- | ---------- | ---------------- | ------------ | ------------ |
| Pattern | Active-Active | Cell-Based | Hybrid | Multi-Active | Multi-Active |
| Consistency | Eventual | Eventual | Eventual (batch) | Strong | External |
| RTO | Seconds | < 5 min | Varies | Near-zero | Near-zero |
| RPO | Near-zero | Near-zero | Minutes (batch) | Zero | Zero |
| Complexity | High | High | High | Medium | High |
## Common Pitfalls
### 1. Assuming Cross-Region Calls Are Fast
**The mistake:** Designing services that make synchronous cross-region calls, assuming network is reliable.
**Example:** Authentication service in US-East calls authorization service in EU-West for every request. Under load, 100ms+ RTT cascades into timeouts.
**Why it happens:** Works fine in development (same region) and low traffic (no queue buildup).
**Solutions:**
- Enforce region autonomy: services only call local instances
- Replicate data needed for authorization to each region
- Design for async where possible
### 2. Underestimating Replication Lag
**The mistake:** Building features that assume immediate replication.
**Example:** User updates profile in Region A, immediately checks from Region B, sees stale data. Files support ticket about "lost" update.
**Why it happens:** Normal lag is sub-second; pathological cases (network issues, load) can be minutes.
**Solutions:**
- Read-your-own-writes: Route user to same region for reads after write
- Version tokens: Client includes version; server ensures that version is visible
- UI feedback: Show "saving..." until confirmation propagates
### 3. Clock Skew in LWW
**The mistake:** Using wall clock time for last-write-wins without accounting for skew.
**Example:** Region A's clock is 5 seconds ahead. All its writes "win" against Region B, even if Region B's writes were actually later.
**Why it happens:** NTP reduces skew but doesn't eliminate it. Cloud providers have millisecond-level skew between regions under good conditions; seconds under bad.
**Solutions:**
- Hybrid Logical Clocks: Combine wall time with logical counter
- Centralized timestamp service: Single source of truth (but adds latency)
- Application-level versioning: Client-provided version numbers
### 4. Unbounded Growth in CRDTs
**The mistake:** Using CRDTs without planning for garbage collection.
**Example:** OR-Set tracks tombstones for deleted elements. After a year, set has 100K tombstones, 1K actual elements. Memory explodes.
**Why it happens:** CRDTs guarantee convergence by keeping metadata. Without cleanup, metadata grows forever.
**Solutions:**
- Tombstone expiry: Remove tombstones after grace period (risk: resurrection if old replica reconnects)
- Periodic compaction: Checkpoint state, truncate history
- Bounded metadata: Cap actor IDs, merge old entries
### 5. Testing Only Happy Path
**The mistake:** Testing failover manually once; not testing regularly or under load.
**Example:** Failover works in staging. In production, DNS cache TTL is higher, standby takes longer to scale, dependent services timeout during transition.
**Why it happens:** Failover testing is expensive and scary. Teams avoid it.
**Solutions:**
- Chaos engineering: Regular production failure injection (Chaos Monkey, Chaos Kong)
- Game days: Scheduled failover exercises
- Automated failover testing: CI/CD pipeline includes failover scenarios
### 6. Split-Brain Without Quorum
**The mistake:** Active-active with 2 regions; network partition leads to both accepting writes independently.
**Example:** US-East and EU-West can't communicate. Both continue serving traffic, writing conflicting data. When partition heals, data is corrupted beyond automatic merge.
**Why it happens:** 2-region active-active has no quorum; neither can determine if it's the "real" primary.
**Solutions:**
- 3+ regions: Quorum requires majority (2 of 3)
- Witness region: Doesn't serve traffic but participates in quorum
- Partition detection: One region goes read-only during partition
## Conclusion
Multi-region architecture is a spectrum of trade-offs, not a single pattern to apply. The decision tree starts with RTO requirements:
- **Minutes acceptable:** Active-passive with async replication—simpler operations, lower cost
- **Seconds required:** Active-active with conflict resolution—higher complexity, near-zero RTO
- **Blast radius concern:** Add cell-based isolation—limits failure impact regardless of active/passive choice
Data replication strategy follows from RPO:
- **Zero data loss:** Synchronous replication—pay the latency cost
- **Seconds-to-minutes acceptable:** Asynchronous replication—better performance, accept lag
Conflict resolution depends on data model:
- **Overwrite is OK:** Last-write-wins
- **Custom semantics needed:** Application-level merge
- **Countable/set-like data:** CRDTs
Production systems like Netflix, Slack, and Uber demonstrate that eventual consistency with idempotent operations and reconciliation handles most use cases. Strong consistency (Spanner, CockroachDB) is achievable but at latency cost.
The meta-lesson: **design for failure from the start**. Assume regions will fail, replication will lag, and conflicts will occur. Build idempotency, reconciliation, and graceful degradation into the foundation rather than retrofitting later.
## Appendix
### Prerequisites
- Understanding of distributed systems fundamentals (CAP theorem, consensus)
- Familiarity with database replication concepts
- Knowledge of DNS and network routing basics
### Terminology
| Term | Definition |
| ---------------------------------- | ----------------------------------------------------------------------------------- |
| **RTO (Recovery Time Objective)** | Maximum acceptable time system can be down during failure |
| **RPO (Recovery Point Objective)** | Maximum acceptable data loss measured in time |
| **Active-Passive** | Architecture where one region serves traffic; others are standby |
| **Active-Active** | Architecture where all regions serve traffic simultaneously |
| **Cell-Based Architecture** | Isolated deployments (cells) each serving subset of users |
| **CRDT** | Conflict-free Replicated Data Type; data structure that merges automatically |
| **Anycast** | Routing technique where multiple locations share same IP; network routes to closest |
| **GeoDNS** | DNS that returns different IPs based on client's geographic location |
| **Split-Brain** | Failure mode where partitioned nodes operate independently, causing divergence |
| **Quorum** | Majority of nodes that must agree for operation to succeed |
### Summary
- Multi-region navigates the tension between global reach and coordination latency
- Active-passive: simple, minutes RTO, single writer
- Active-active: complex, seconds RTO, requires conflict resolution
- Cell-based: limits blast radius, orthogonal to active/passive choice
- Data replication: sync (zero RPO, high latency) vs async (low latency, potential data loss)
- Conflict resolution: LWW (simple, loses data), CRDTs (automatic, limited types), app merge (flexible, complex)
- Production systems embrace eventual consistency with idempotent operations
### References
**Architecture Patterns:**
- [AWS Well-Architected: Multi-Region Active-Active](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/) - AWS multi-region DR patterns
- [AWS Well-Architected: Cell-Based Architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html) - Cell-based architecture guidance
- [Azure Multi-Region Design](https://learn.microsoft.com/en-us/azure/well-architected/reliability/highly-available-multi-region-design) - Azure multi-region strategies
**Production Case Studies:**
- [Netflix Active-Active for Multi-Regional Resiliency](https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b) - Netflix's active-active architecture
- [Slack's Migration to Cellular Architecture](https://slack.engineering/slacks-migration-to-a-cellular-architecture/) - Slack's cell-based transformation
- [Uber's HiveSync for Cross-Region Data](https://www.uber.com/blog/building-ubers-data-lake-batch-data-replication-using-hivesync/) - Uber's data replication system
- [Uber's Kafka Disaster Recovery](https://www.uber.com/blog/kafka/) - uReplicator for multi-region Kafka
**Database Multi-Region:**
- [CockroachDB Multi-Active Availability](https://www.cockroachlabs.com/docs/stable/multi-active-availability) - CockroachDB's approach
- [Google Spanner Multi-Region](https://cloud.google.com/blog/topics/developers-practitioners/demystifying-cloud-spanner-multi-region-configurations) - Spanner replication and consistency
- [AWS Aurora Global Database](https://aws.amazon.com/blogs/database/monitor-amazon-aurora-global-database-replication-at-scale-using-amazon-cloudwatch-metrics-insights/) - Aurora replication monitoring
**Distributed Systems Theory:**
- [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem) - Brewer's theorem and practical implications
- [CRDTs](https://crdt.tech/) - Conflict-free Replicated Data Types resources
- [Raft Consensus](https://raft.github.io/) - Raft algorithm specification
**Global Load Balancing:**
- [Cloudflare Anycast Primer](https://blog.cloudflare.com/a-brief-anycast-primer/) - How anycast works
- [AWS Route 53 Latency Routing](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy-latency.html) - Latency-based routing
---
## Graceful Degradation
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/core-distributed-patterns/graceful-degradation
**Category:** System Design / Core Distributed Patterns
**Description:** Graceful degradation is the discipline of designing distributed systems that maintain partial functionality when components fail, rather than collapsing entirely. The core insight: a system serving degraded responses to all users is preferable to one returning errors to most users. This article covers the pattern variants, implementation trade-offs, and production strategies that separate resilient systems from fragile ones.
# Graceful Degradation
Graceful degradation is the discipline of designing distributed systems that maintain partial functionality when components fail, rather than collapsing entirely. The core insight: a system serving degraded responses to all users is preferable to one returning errors to most users. This article covers the pattern variants, implementation trade-offs, and production strategies that separate resilient systems from fragile ones.

Graceful degradation creates multiple intermediate states between full health and complete failure, with recovery paths back to normal operation.
## Abstract
Graceful degradation transforms hard dependencies into soft dependencies through a hierarchy of fallback behaviors. The mental model:
1. **Degradation Hierarchy**: Systems define ordered fallback states—from full functionality through progressively simpler modes down to static responses—each trading capability for availability
2. **Failure Isolation**: Patterns like circuit breakers, bulkheads, and timeouts contain failures to prevent cascade propagation across service boundaries
3. **Load Management**: Admission control and load shedding protect system capacity by rejecting excess work early, keeping latency acceptable for admitted requests
4. **Recovery Coordination**: Backoff with jitter prevents thundering herd on recovery; retry budgets cap amplification during degraded states
The key design tension: aggressive fallbacks improve availability but may serve stale or incomplete data. Conservative fallbacks preserve correctness but risk cascade failures. Production systems typically err toward availability, with explicit SLOs defining acceptable staleness.
## The Problem
### Why Naive Approaches Fail
**Approach 1: Fail-Fast Everything**
Returning errors immediately when any dependency is unavailable seems honest, but propagates failures upstream. A single slow database query can cascade through dozens of dependent services, each timing out and returning errors to their callers.
**Approach 2: Infinite Retries**
Retrying failed requests until they succeed appears resilient, but creates retry storms. If a service handles 10,000 requests per second and fails for 10 seconds, naive retries generate 100,000+ additional requests, overwhelming any recovery attempt.
**Approach 3: Long Timeouts**
Setting generous timeouts (30+ seconds) to "wait for things to recover" exhausts connection pools and threads. A service with 100 threads and 30-second timeouts can only handle 3.3 requests/second during a slowdown—a 1000x capacity reduction.
### The Core Challenge
Distributed systems face a fundamental tension: **availability versus correctness**. When a dependency fails, you must choose between:
- Returning an error (correct but unavailable)
- Returning stale/incomplete data (available but potentially incorrect)
- Blocking until recovery (neither available nor responsive)
Graceful degradation provides a framework for making this choice deliberately, with explicit trade-offs documented in SLOs.
## Pattern Overview
### Core Mechanism
Graceful degradation works by defining a **degradation hierarchy**—an ordered list of fallback behaviors activated as failures accumulate:
| Level | State | Behavior | Trade-off |
| ----- | ----------- | ----------------------------- | ------------------------------ |
| 0 | Healthy | Full functionality | None |
| 1 | Degraded | Serve cached/stale data | Staleness vs availability |
| 2 | Limited | Disable non-critical features | Functionality vs stability |
| 3 | Minimal | Read-only mode | Writes lost vs reads preserved |
| 4 | Static | Return default responses | Personalization vs uptime |
| 5 | Unavailable | Return error | Complete failure |
Each level represents an explicit trade-off. The system progresses through levels only when lower levels become untenable.
### Key Invariants
1. **Monotonic Degradation**: Systems move through degradation levels in order; skipping from healthy to unavailable without intermediate states indicates missing fallbacks
2. **Bounded Impact**: Any single component failure affects only the features depending on that component; unrelated functionality continues normally
3. **Explicit Recovery**: Systems don't automatically return to healthy state—circuit breakers test recovery, and operators may gate full restoration
### Failure Modes
| Failure | Impact | Mitigation |
| ----------------- | ------------------------------------------------- | --------------------------------------------- |
| Cascade failure | One service failure propagates to all dependents | Circuit breakers, timeouts, bulkheads |
| Retry storm | Failed requests amplified by retries | Exponential backoff, jitter, retry budgets |
| Thundering herd | Simultaneous recovery overwhelms system | Staggered recovery, jitter on first request |
| Stale data served | Users see outdated information | TTL limits, staleness indicators in UI |
| Split brain | Different servers in different degradation states | Centralized health checks, consensus on state |
## Design Paths
### Path 1: Circuit Breaker Pattern
**When to choose this path:**
- Dependencies have distinct failure modes (timeout, error, slow)
- You need automatic recovery detection
- Service mesh or library support available
**Key characteristics:**
The circuit breaker monitors call success rates and "trips" when failures exceed a threshold, preventing further calls to a failing service. This gives the dependency time to recover without continued load.
**Three states:**

- **Closed**: Normal operation, requests pass through, failures tracked
- **Open**: Requests fail immediately without calling dependency
- **Half-Open**: Limited test requests probe for recovery
**Implementation approach:**
```typescript title="circuit-breaker.ts" collapse={1-5, 35-50}
import { EventEmitter } from "events"
type CircuitState = "closed" | "open" | "half-open"
interface CircuitBreakerConfig {
failureThreshold: number // Failures before opening (e.g., 5)
successThreshold: number // Successes to close (e.g., 2)
timeout: number // Time in open state (e.g., 30000ms)
monitorWindow: number // Sliding window size (e.g., 10)
}
class CircuitBreaker {
private state: CircuitState = "closed"
private failures = 0
private successes = 0
private lastFailureTime = 0
async execute(fn: () => Promise): Promise {
if (this.state === "open") {
if (Date.now() - this.lastFailureTime > this.config.timeout) {
this.state = "half-open"
this.successes = 0
} else {
throw new CircuitOpenError("Circuit is open")
}
}
try {
const result = await fn()
this.onSuccess()
return result
} catch (error) {
this.onFailure()
throw error
}
}
private onSuccess(): void {
this.failures = 0
if (this.state === "half-open") {
this.successes++
if (this.successes >= this.config.successThreshold) {
this.state = "closed"
}
}
}
private onFailure(): void {
this.failures++
this.lastFailureTime = Date.now()
if (this.failures >= this.config.failureThreshold) {
this.state = "open"
}
}
}
```
**Production configuration (Resilience4j):**
| Parameter | Typical Value | Rationale |
| --------------------------------------- | ------------- | ------------------------------------------- |
| `failureRateThreshold` | 50% | Opens when half of recorded calls fail |
| `slidingWindowSize` | 10-20 calls | Enough samples for statistical significance |
| `minimumNumberOfCalls` | 5 | Don't trip on first few failures |
| `waitDurationInOpenState` | 10-30s | Time for dependency to recover |
| `permittedNumberOfCallsInHalfOpenState` | 2-3 | Enough probes to confirm recovery |
**Real-world example: Netflix Hystrix**
Netflix pioneered circuit breakers at scale, processing tens of billions of thread-isolated calls daily. Their key insight: the circuit breaker's fallback must be simpler than the primary path.
> "The fallback is for giving users a reasonable response when the circuit is open. It shouldn't try to be clever—a simple cached value or default is better than complex retry logic." — Netflix Tech Blog
Hystrix is now in maintenance mode; Netflix recommends Resilience4j for new projects, which uses a functional composition model:
```java title="Resilience4jExample.java" collapse={1-8, 18-25}
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.vavr.control.Try;
import java.time.Duration;
import java.util.function.Supplier;
// Configuration
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(10))
.slidingWindowSize(10)
.minimumNumberOfCalls(5)
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("userService", config);
// Usage - functional composition
Supplier decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> userService.getUser(userId));
Try result = Try.ofSupplier(decoratedSupplier)
.recover(throwable -> getCachedUser(userId)); // Fallback
```
**Trade-offs vs other paths:**
| Aspect | Circuit Breaker | Timeout Only | Retry Only |
| ------------------ | ----------------------------- | ----------------- | -------------------- |
| Failure detection | Automatic (threshold) | Per-request | None |
| Recovery detection | Automatic (half-open) | Manual | None |
| Overhead | State tracking per dependency | Minimal | Minimal |
| Configuration | Multiple parameters | Single timeout | Retry count, backoff |
| Best for | Unstable dependencies | Slow dependencies | Transient failures |
### Path 2: Load Shedding
**When to choose this path:**
- System receives traffic spikes beyond capacity
- Some requests are more important than others
- You control the server-side admission logic
**Key characteristics:**
Load shedding rejects excess requests _before_ they consume resources, keeping latency acceptable for admitted requests. The key insight from AWS:
> "The goal is to keep serving good latencies to the requests you do accept, rather than serving bad latencies to all requests." — AWS Builders' Library
**Implementation approaches:**
**Server-side admission control:**
```typescript title="load-shedder.ts" collapse={1-3, 30-45}
import { Request, Response, NextFunction } from "express"
interface LoadShedderConfig {
maxConcurrent: number // Max concurrent requests
maxQueueSize: number // Max waiting requests
priorityHeader: string // Header indicating request priority
}
class LoadShedder {
private currentRequests = 0
private queueSize = 0
middleware = (req: Request, res: Response, next: NextFunction) => {
const priority = this.getPriority(req)
const capacity = this.getAvailableCapacity()
// High priority: always admit if any capacity
// Low priority: only admit if significant capacity
const threshold = priority === "high" ? 0 : 0.3
if (capacity < threshold) {
res.status(503).header("Retry-After", "5").send("Service overloaded")
return
}
this.currentRequests++
res.on("finish", () => this.currentRequests--)
next()
}
private getPriority(req: Request): "high" | "low" {
// Payment endpoints are always high priority
if (req.path.includes("/checkout") || req.path.includes("/payment")) {
return "high"
}
return (req.headers[this.config.priorityHeader] as "high" | "low") ?? "low"
}
private getAvailableCapacity(): number {
return 1 - this.currentRequests / this.config.maxConcurrent
}
}
```
**Priority-based shedding:**
| Priority | Shed When Capacity Below | Examples |
| -------- | ------------------------ | ---------------------------------- |
| Critical | 0% (never shed) | Health checks, payment processing |
| High | 20% | User-facing reads, search |
| Medium | 40% | Background syncs, analytics events |
| Low | 60% | Batch jobs, reports, prefetching |
**Real-world example: Shopify**
Shopify handles 100x traffic spikes during flash sales by shedding non-essential features:
- **Always preserved**: Checkout, payment processing, order confirmation
- **Shed early**: Product recommendations, recently viewed, wish lists
- **Shed under load**: Search suggestions, inventory counts, shipping estimates
> "Our pod model means each merchant's traffic is isolated. But within a pod, we have explicit rules: checkout never degrades. Everything else can pause." — Shopify Engineering
**Real-world example: Google GFE**
Google's Global Front End (GFE) implements multi-tier admission control:
1. **Connection limits**: Cap TCP connections per client IP
2. **Request rate limits**: Per-user and per-API quotas
3. **Priority queues**: Critical traffic bypasses congestion
4. **Adaptive shedding**: Increase rejection rate as CPU approaches saturation
Google SRE recommends:
- **Per-request retry cap**: Maximum 3 attempts
- **Per-client retry budget**: Keep retries under 10% of normal traffic
**Trade-offs:**
| Aspect | Load Shedding | No Shedding |
| --------------------- | ------------------------ | ------------------------ |
| Latency under load | Stable for admitted | Degrades for all |
| Throughput under load | Maintained | Collapses |
| User experience | Some see errors | All see slow |
| Implementation | Requires priority scheme | Simpler |
| Capacity planning | Can overprovision less | Need headroom for spikes |
### Path 3: Bulkhead Pattern
**When to choose this path:**
- Multiple independent workloads share resources
- One workload's failure shouldn't affect others
- You need blast radius containment
**Key characteristics:**
Named after ship compartments that prevent a hull breach from sinking the entire vessel, bulkheads isolate failures to affected components.
**Implementation levels:**
| Level | Isolation Unit | Use Case |
| ---------------- | --------------------------------- | -------------------------- |
| Thread pools | Per-dependency thread pool | Different latency profiles |
| Connection pools | Per-service connection limits | Database isolation |
| Process | Separate processes/containers | Complete memory isolation |
| Cell | Independent infrastructure stacks | Regional blast radius |
**Thread pool isolation:**
```typescript title="bulkhead.ts" collapse={1-4, 30-40}
import { Worker } from "worker_threads"
import { Queue } from "./queue"
interface BulkheadConfig {
maxConcurrent: number // Max parallel executions
maxWait: number // Max queue time (ms)
name: string // For metrics/logging
}
class Bulkhead {
private executing = 0
private queue: Queue<() => void> = new Queue()
async execute(fn: () => Promise): Promise {
if (this.executing >= this.config.maxConcurrent) {
if (this.queue.size >= this.config.maxConcurrent) {
throw new BulkheadFullError(`${this.config.name} bulkhead full`)
}
await this.waitForCapacity()
}
this.executing++
try {
return await fn()
} finally {
this.executing--
this.queue.dequeue()?.()
}
}
private waitForCapacity(): Promise {
return new Promise((resolve, reject) => {
const timeout = setTimeout(() => {
reject(new BulkheadTimeoutError(`${this.config.name} wait timeout`))
}, this.config.maxWait)
this.queue.enqueue(() => {
clearTimeout(timeout)
resolve()
})
})
}
}
```
**Real-world example: AWS Cell-Based Architecture**
AWS uses cell-based architecture for blast radius containment:

Each cell:
- Handles a subset of customers (often by customer ID hash)
- Has independent database, cache, and service instances
- Shares no state with other cells
- Can fail without affecting other cells
**Real-world example: Shopify Pod Model**
Shopify isolates each merchant into a "pod"—a fully independent slice with its own:
- MySQL primary and replicas
- Redis cluster
- Memcached nodes
- Background job workers
A pod failure affects only merchants in that pod, not the entire platform.
**Trade-offs:**
| Aspect | Bulkhead | Shared Resources |
| ---------------------- | ---------------------- | ---------------- |
| Resource efficiency | Lower (duplication) | Higher |
| Blast radius | Contained | System-wide |
| Operational complexity | Higher (more units) | Lower |
| Cost | Higher | Lower |
| Recovery time | Faster (smaller scope) | Slower |
### Path 4: Timeout and Retry Strategy
**When to choose this path:**
- Failures are transient (network blips, temporary overload)
- Idempotent operations (safe to retry)
- Acceptable latency budget for retries
**Key characteristics:**
Timeouts prevent resource exhaustion from slow dependencies. Retries handle transient failures. Combined poorly, they create retry storms. Combined well, they provide resilience without amplification.
**Timeout configuration:**
Start with P99.9 latency plus 20-30% buffer:
| Metric | Value | Timeout |
| ------------- | ----- | ------------------ |
| P50 latency | 20ms | — |
| P90 latency | 80ms | — |
| P99 latency | 300ms | — |
| P99.9 latency | 800ms | 1000ms (800 + 25%) |
**Retry with exponential backoff and jitter:**
```typescript title="retry.ts" collapse={1-2, 25-35}
interface RetryConfig {
maxAttempts: number // Total attempts (including first)
baseDelay: number // Initial delay (ms)
maxDelay: number // Cap on delay (ms)
jitterFactor: number // 0-1, randomness factor
}
async function retryWithBackoff(fn: () => Promise, config: RetryConfig): Promise {
let lastError: Error
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
try {
return await fn()
} catch (error) {
lastError = error
if (attempt < config.maxAttempts - 1) {
const delay = calculateDelay(attempt, config)
await sleep(delay)
}
}
}
throw lastError
}
function calculateDelay(attempt: number, config: RetryConfig): number {
// Exponential: baseDelay * 2^attempt
const exponential = config.baseDelay * Math.pow(2, attempt)
// Cap at maxDelay
const capped = Math.min(exponential, config.maxDelay)
// Add jitter: multiply by random factor between (1 - jitter) and (1 + jitter)
const jitter = 1 + (Math.random() * 2 - 1) * config.jitterFactor
return Math.floor(capped * jitter)
}
```
**Why jitter matters:**
Without jitter, all clients retry at the same intervals, creating synchronized spikes:
```
Time: 0s 1s 2s 4s 8s
Client A: X R R R R
Client B: X R R R R
Client C: X R R R R
↓ ↓ ↓ ↓ ↓
Load: 3 3 3 3 3 (spikes)
```
With jitter, retries spread across time:
```
Time: 0s 1s 2s 3s 4s 5s
Client A: X R R
Client B: X R R
Client C: X R R
↓ ↓ ↓ ↓ ↓ ↓ ↓
Load: 3 1 1 1 1 1 1 (distributed)
```
**Jitter the first request too:**
Sophie Bits (Cloudflare) notes that even the first request needs jitter when many clients start simultaneously (deployment, recovery):
```typescript
// Before making the first request
await sleep(Math.random() * initialJitterWindow)
```
**Retry budgets (Google SRE approach):**
Instead of per-request retry limits, track retries as a percentage of traffic:
```typescript title="retry-budget.ts" collapse={1-3}
class RetryBudget {
private requestCount = 0
private retryCount = 0
private readonly budgetPercent = 0.1 // 10% of traffic
canRetry(): boolean {
// Allow retry only if retries are under budget
return this.retryCount < this.requestCount * this.budgetPercent
}
recordRequest(): void {
this.requestCount++
}
recordRetry(): void {
this.retryCount++
}
// Reset counters periodically (e.g., every minute)
reset(): void {
this.requestCount = 0
this.retryCount = 0
}
}
```
**Production configuration:**
| Parameter | Value | Rationale |
| ------------- | ------ | ---------------------------- |
| Max attempts | 3 | Diminishing returns beyond 3 |
| Base delay | 100ms | Fast enough for user-facing |
| Max delay | 30-60s | Prevent indefinite waits |
| Jitter factor | 0.5 | 50% randomness spreads load |
| Retry budget | 10% | Caps amplification |
**Trade-offs:**
| Aspect | Aggressive Retries | Conservative Retries |
| -------------------------- | ------------------ | -------------------- |
| Transient failure recovery | Better | Worse |
| Amplification risk | Higher | Lower |
| User-perceived latency | Variable | Predictable |
| Resource consumption | Higher | Lower |
### Path 5: Fallback Strategies
**When to choose this path:**
- Some response is better than no response
- Acceptable to serve stale or simplified data
- Clear degradation hierarchy defined
**Key characteristics:**
Fallbacks define what to return when the primary path fails. The hierarchy typically follows: fresh data → cached data → default data → error.
**Fallback hierarchy:**
```typescript title="fallback-chain.ts" collapse={1-4, 35-50}
interface FallbackChain {
primary: () => Promise
fallbacks: Array<{
name: string
fn: () => Promise
condition?: (error: Error) => boolean
}>
default: T
}
async function executeWithFallbacks(chain: FallbackChain): Promise<{
result: T
source: string
degraded: boolean
}> {
// Try primary
try {
return { result: await chain.primary(), source: "primary", degraded: false }
} catch (primaryError) {
// Try fallbacks in order
for (const fallback of chain.fallbacks) {
if (fallback.condition && !fallback.condition(primaryError)) {
continue
}
try {
const result = await fallback.fn()
return { result, source: fallback.name, degraded: true }
} catch {
// Continue to next fallback
}
}
// All fallbacks failed, return default
return { result: chain.default, source: "default", degraded: true }
}
}
// Usage example
const productChain: FallbackChain = {
primary: () => productService.getProduct(id),
fallbacks: [
{ name: "cache", fn: () => cache.get(`product:${id}`) },
{ name: "cdn", fn: () => cdnCache.getProduct(id) },
],
default: { id, name: "Product Unavailable", price: null },
}
```
**Common fallback patterns:**
| Pattern | Use Case | Trade-off |
| ------------------- | ---------------------------- | ----------------------- |
| Cached data | Read-heavy workloads | Staleness |
| Default values | Configuration, feature flags | Loss of personalization |
| Simplified response | Complex aggregations | Incomplete data |
| Read-only mode | Write path failures | No updates |
| Static content | Complete backend failure | No dynamic data |
**Real-world example: Netflix Recommendations**
When Netflix's recommendation service is slow or unavailable:
1. **Primary**: Personalized recommendations from ML pipeline
2. **Fallback 1**: Cached recommendations (updated hourly)
3. **Fallback 2**: Popular content in user's region
4. **Fallback 3**: Globally popular content
5. **Default**: Static "Trending Now" list
The UI doesn't change—users see recommendations regardless of which tier served them.
**Real-world example: Feature Flag Fallbacks**
PostHog improved feature flag resilience by implementing local evaluation:
- **Primary**: Real-time flag evaluation via API (500ms latency)
- **Fallback**: Local evaluation with cached definitions (10-20ms)
- **Default**: Hard-coded default values
Result: Flags work even if PostHog's servers are unreachable.
**Trade-offs:**
| Aspect | Rich Fallbacks | Simple Fallbacks |
| ------------------------- | ------------------- | ----------------- |
| User experience | Better degraded UX | Worse degraded UX |
| Implementation complexity | Higher | Lower |
| Testing burden | Higher (more paths) | Lower |
| Cache infrastructure | Required | Optional |
| Staleness risk | Higher | Lower |
### Decision Framework

## Production Implementations
### Netflix: Resilience Library Evolution
**Context:** Netflix serves 200M+ subscribers with hundreds of microservices. A single page load touches dozens of services.
**Implementation evolution:**
1. **2011-2012**: Hystrix developed internally
2. **2012**: Hystrix open-sourced, became industry standard
3. **2018**: Hystrix enters maintenance mode
4. **2019+**: Resilience4j recommended for new projects
**Why the transition?**
Hystrix was designed for Java 6/7 with thread pool isolation as the primary mechanism. Resilience4j uses:
- Java 8 functional composition
- Lighter weight (only Vavr dependency)
- Composable decorators (stack circuit breaker + rate limiter + retry)
- Better metrics integration
**Key learnings from Netflix:**
> "Invest more in making your primary code reliable than in building elaborate fallbacks. A fallback you've never tested in production is a fallback that doesn't work."
**Specific metrics:**
- Hystrix processed tens of billions of thread-isolated calls daily
- Hundreds of billions of semaphore-isolated calls daily
- Circuit breaker trip rate: target < 0.1% during normal operation
### AWS: Cell-Based Architecture
**Context:** AWS services must maintain regional isolation—a failure in us-east-1 shouldn't affect eu-west-1.
**Implementation:**
- Each cell is an independent stack (compute, storage, networking)
- Cells share no state; each has own database
- Shuffle sharding maps customers to multiple cells, minimizing correlated failures
- Global routing layer directs traffic to healthy cells
**Configuration:**
- Cell size: 1-5% of total capacity per cell
- Minimum cells: 3 per region for redundancy
- Health check interval: 5-10 seconds
- Failover time: < 60 seconds
**Results:**
- Blast radius limited to cell size (1-5% of customers)
- Regional failures don't cascade globally
- Recovery is cell-by-cell, not all-or-nothing
### Slack: CI/CD Circuit Breakers
**Context:** Slack's CI/CD pipeline had cascading failures between systems, causing developer frustration.
**Implementation:**
Slack's Checkpoint system implements orchestration-level circuit breakers:
1. **Metrics collection**: Error rates, latency percentiles, queue depths
2. **Awareness phase**: Alert teams when error rates rise (before tripping)
3. **Trip decision**: Automated trip when thresholds exceeded
4. **Recovery**: Gradual traffic increase after manual verification
**Results (since 2020):**
- No cascading failures between CI/CD systems
- Increased service availability
- Fewer flaky developer experiences
> "The key insight was sharing visibility _before_ opening the circuit. Teams get a heads-up that their system is approaching the threshold, often fixing issues before the breaker trips."
### Shopify: Pod-Based Isolation
**Context:** Shopify handles 30TB+ data per minute during peak sales, with 100x traffic spikes.
**Implementation:**
- **Pod model**: Each merchant assigned to a pod
- **Pod contents**: Dedicated MySQL, Redis, Memcached, workers
- **Graceful degradation rules**:
- Checkout: Never degrades
- Cart: Degrades last
- Recommendations: First to shed
- Inventory counts: Can show stale data
**Tools:**
- **Toxiproxy**: Simulates network failures before production
- **Packwerk**: Enforces module boundaries in monolith
**Results:**
- Flash sales handled without checkout degradation
- Merchant isolation prevents noisy neighbor problems
- Predictable performance under load
### Discord: Backpressure and Coalescing
**Context:** Discord handles 1M+ push requests per minute with extreme traffic spikes during gaming events.
**Implementation:**
- **GenStage (Elixir)**: Built-in backpressure for message processing
- **Request coalescing**: Deduplicate identical requests in Rust services
- **Consistent hash routing**: Same requests route to same servers, improving deduplication
**Results:**
- Eliminated hot partitions
- Stable latencies during 100x spikes
- Fewer on-call emergencies
## Implementation Guide
### Starting Point Decision

### Library Options
| Library | Language | Patterns | Maturity | Notes |
| ------------- | ------------ | -------------------------------- | ---------- | ---------------------------- |
| Resilience4j | Java/Kotlin | CB, Retry, Bulkhead, RateLimiter | Production | Netflix recommended |
| Polly | .NET | CB, Retry, Timeout, Bulkhead | Production | Extensive policy composition |
| opossum | Node.js | Circuit Breaker | Production | Simple, well-tested |
| cockatiel | Node.js | CB, Retry, Timeout, Bulkhead | Production | TypeScript-first |
| go-resiliency | Go | CB, Retry, Semaphore | Production | Simple, idiomatic Go |
| Istio | Service Mesh | CB, Retry, Timeout | Production | No code changes, YAML config |
### Service Mesh Configuration (Istio)
```yaml title="istio-destination-rule.yaml"
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: user-service
spec:
host: user-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
h2UpgradePolicy: UPGRADE
http1MaxPendingRequests: 100
http2MaxRequests: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
```
```yaml title="istio-virtual-service.yaml" collapse={1-8}
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: user-service
spec:
hosts:
- user-service
http:
- route:
- destination:
host: user-service
timeout: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
```
### Implementation Checklist
- [ ] **Define degradation hierarchy**: Document each level and its trade-offs
- [ ] **Identify critical paths**: Which features must never degrade?
- [ ] **Set timeout SLOs**: Based on P99.9 + buffer, not arbitrary values
- [ ] **Configure circuit breakers**: Tune thresholds for your traffic patterns
- [ ] **Implement fallbacks**: Test that they actually work under failure
- [ ] **Add retry budgets**: Prevent amplification (< 10% of traffic)
- [ ] **Instrument everything**: Metrics for each degradation level
- [ ] **Test failure modes**: Chaos engineering before production incidents
- [ ] **Document recovery procedures**: How to verify system is healthy
## Common Pitfalls
### 1. Untested Fallbacks
**The mistake:** Building fallback paths that are never exercised in production.
**Example:** A team builds an elaborate cache fallback for their user service. In production, the cache is always warm, so the fallback code path is never executed. When the cache fails during an incident, the fallback has a bug that causes a null pointer exception.
**Solutions:**
- Chaos engineering: Regularly fail dependencies in production
- Fallback testing: Exercise fallback paths in staging/canary
- Synthetic monitoring: Periodically call fallback paths directly
### 2. Synchronous Retry Storms
**The mistake:** Retrying immediately without backoff or jitter.
**Example:** A service has 10,000 clients. The database has a 1-second outage. Without jitter, all 10,000 clients retry at exactly 1 second, then 2 seconds, then 4 seconds—creating synchronized spikes that prevent recovery.
**Solutions:**
- Always use exponential backoff with jitter
- Implement retry budgets (< 10% of traffic)
- Jitter the first request after deployment/restart
### 3. Circuit Breaker Thrashing
**The mistake:** Circuit breaker opens and closes rapidly, worse than no breaker.
**Example:** A circuit breaker with `failureThreshold=2` and `successThreshold=1` trips after two failures, closes after one success, trips again immediately. The constant state changes create overhead without providing stability.
**Solutions:**
- Increase `minimumNumberOfCalls` (at least 5-10)
- Use sliding window for failure rate, not raw counts
- Extend `waitDurationInOpenState` to allow real recovery
- Require multiple successes in half-open state
### 4. Stale Data Without Indication
**The mistake:** Serving cached data without indicating staleness to users.
**Example:** A stock trading app falls back to cached prices during an outage. Users see prices from 30 minutes ago but think they're current, making trades based on stale data.
**Solutions:**
- Show "as of" timestamps prominently
- Visual indicators for degraded state (yellow banner, icon)
- Disable actions that require fresh data (trading, purchasing)
- Set TTL limits on how stale data can be served
### 5. Missing Bulkheads
**The mistake:** Sharing thread pools/connection pools across all dependencies.
**Example:** Service A has 100 threads. Dependency B becomes slow (10s response time). All 100 threads block on B. Dependency C (healthy) becomes unreachable because no threads are available—a slow dependency causes total failure.
**Solutions:**
- Separate thread pools per dependency
- Connection pool limits per downstream service
- Semaphore isolation for fast dependencies
- Thread pool isolation for slow/risky dependencies
### 6. Load Shedding Without Priority
**The mistake:** Shedding requests randomly instead of by priority.
**Example:** Under load, a checkout service sheds 50% of requests randomly. Half of payment confirmations are lost, causing customer support incidents. Meanwhile, prefetch requests for thumbnails continue consuming capacity.
**Solutions:**
- Define priority tiers (critical, high, medium, low)
- Shed lowest priority first
- Never shed critical paths (payments, health checks)
- Tag requests with priority at entry point
## Conclusion
Graceful degradation is not a single pattern but a discipline: designing systems with explicit failure modes, testing those modes regularly, and accepting that partial functionality serves users better than complete outages.
The most resilient systems share these characteristics:
- **Degradation hierarchy is documented**: Teams know exactly what fails first and what never fails
- **Fallbacks are tested**: Chaos engineering proves fallbacks work before incidents
- **Amplification is controlled**: Retry budgets and jitter prevent self-inflicted outages
- **Blast radius is contained**: Bulkheads ensure one failure doesn't become total failure
Start simple—timeouts and circuit breakers cover most cases. Add complexity (load shedding, cell architecture) only when scale demands it.
## Appendix
### Prerequisites
- Distributed systems fundamentals (network partitions, CAP theorem)
- Service-oriented architecture concepts
- Basic understanding of observability (metrics, traces, logs)
### Terminology
- **Blast Radius**: The scope of impact when a component fails—smaller is better
- **Bulkhead**: Isolation boundary preventing failures from spreading
- **Circuit Breaker**: Pattern that stops calling a failing dependency after threshold exceeded
- **Degradation Hierarchy**: Ordered list of fallback behaviors from full to minimal functionality
- **Jitter**: Random variation added to timing to prevent synchronized behavior
- **Load Shedding**: Rejecting excess requests to maintain latency for admitted requests
- **Retry Budget**: Cap on retries as percentage of normal traffic (typically 10%)
- **Thundering Herd**: Many clients simultaneously retrying or reconnecting after an outage
### Summary
- Graceful degradation defines explicit intermediate states between healthy and failed, with documented trade-offs at each level
- Circuit breakers prevent cascade failures by stopping calls to failing dependencies; tune thresholds based on traffic patterns, not defaults
- Load shedding protects capacity by rejecting low-priority work early; define priority tiers and never shed critical paths
- Bulkheads contain blast radius through isolation (thread pools, cells, pods); the trade-off is resource duplication
- Retries need exponential backoff with jitter plus retry budgets (< 10%) to prevent amplification
- Fallbacks must be tested regularly through chaos engineering—untested fallbacks don't work when needed
### References
**Foundational Resources:**
- [AWS Well-Architected Framework: Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html) - Graceful degradation best practices
- [Google SRE Book: Handling Overload](https://sre.google/sre-book/handling-overload/) - Load shedding and admission control
- [Release It! Second Edition](https://pragprog.com/titles/mnee2/release-it-second-edition/) by Michael Nygard - Stability patterns including circuit breakers and bulkheads
**Pattern Implementations:**
- [Netflix Tech Blog: Introducing Hystrix](http://techblog.netflix.com/2012/11/hystrix.html) - Original circuit breaker at scale
- [Resilience4j Documentation](https://resilience4j.readme.io/docs/circuitbreaker) - Modern resilience library
- [AWS Builders' Library: Using Load Shedding](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/) - Server-side admission control
- [AWS Builders' Library: Timeouts, Retries, and Backoff with Jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) - Retry strategy design
**Production Case Studies:**
- [Slack Engineering: Circuit Breakers for CI/CD](https://slack.engineering/circuit-breakers/) - Orchestration-level resilience
- [Discord: How Discord Stores Trillions of Messages](https://discord.com/blog/how-discord-stores-trillions-of-messages) - Backpressure and coalescing
- [AWS: Cell-Based Architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html) - Blast radius containment
**Failure Analysis:**
- [Microsoft Azure: Retry Storm Antipattern](https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm/) - Retry amplification
- [Encore: Thundering Herd Problem](https://encore.dev/blog/thundering-herd-problem) - Recovery coordination
- [InfoQ: Anatomy of a Cascading Failure](https://www.infoq.com/articles/anatomy-cascading-failure/) - Failure propagation mechanisms
---
## Virtualization and Windowing
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/frontend-system-design/virtualization-and-windowing
**Category:** System Design / Frontend System Design
**Description:** Rendering large lists (1,000+ items) without virtualization creates a DOM tree so large that layout calculations alone can block the main thread for hundreds of milliseconds. Virtualization solves this by rendering only visible items plus a small buffer, keeping DOM node count constant regardless of list size. The trade-off: implementation complexity for consistent O(viewport) rendering performance.
# Virtualization and Windowing
Rendering large lists (1,000+ items) without virtualization creates a DOM tree so large that layout calculations alone can block the main thread for hundreds of milliseconds. Virtualization solves this by rendering only visible items plus a small buffer, keeping DOM node count constant regardless of list size. The trade-off: implementation complexity for consistent O(viewport) rendering performance.

Virtualization trades algorithmic complexity for constant DOM size, keeping frame times under the 16ms budget.
## Abstract
Virtualization is a **viewport-centric rendering strategy**: calculate which items are visible, render only those plus a buffer (overscan), and position them using GPU-accelerated transforms. The core insight is that users can only see viewport-sized content at any moment—rendering everything else is wasted work.
Three implementation approaches exist:
1. **Fixed-height**: Pure math-based positioning when all items are identical height. Simplest, fastest, but rarely applicable to real content.
2. **Variable-height**: Measure items as they render, cache heights, estimate unmeasured items. Production standard for dynamic content.
3. **DOM recycling**: Reuse DOM nodes as items scroll out, updating content rather than creating/destroying elements.
Critical constraints shape the design:
- **Frame budget**: 16ms per frame for 60fps, ~12ms after browser overhead
- **Transform positioning**: GPU-accelerated `translateY()` vs layout-triggering `top`
- **Measurement APIs**: ResizeObserver for heights, IntersectionObserver for visibility
- **Accessibility**: Screen readers require ARIA live regions; keyboard navigation requires focus management
A newer alternative, CSS `content-visibility: auto`, defers rendering while keeping full DOM—enabling browser find-in-page and anchor links that virtualization breaks.
## The Challenge
### Browser Constraints
The browser's rendering pipeline processes elements sequentially:
1. **Style** - Match CSS rules to elements
2. **Layout** - Calculate element geometry (expensive, proportional to DOM size)
3. **Paint** - Fill pixels for colors, text, shadows
4. **Composite** - Assemble layers, apply GPU transforms
Layout is the bottleneck. As documented in Chrome DevTools performance analysis, layout time scales with DOM size because the engine must resolve every element's position relative to its ancestors and siblings.
**Main thread budget**: At 60fps, each frame has 16ms total. The browser consumes ~4ms for its own work, leaving 10-12ms for JavaScript execution, style recalculation, and layout combined. A single layout pass on 1,000 elements can exceed this entirely.
**Memory pressure**: Each DOM node consumes memory for its JavaScript wrapper, style data, and layout information. On low-end mobile devices with 50-100MB practical limits, 10,000+ nodes can trigger garbage collection pauses or crashes.
### When Virtualization Becomes Necessary
| Item Count | Without Virtualization | With Virtualization |
| ---------- | ---------------------------------------- | ------------------- |
| 50-100 | Acceptable on desktop, measure on mobile | Optional |
| 500+ | Noticeable jank on scroll | Recommended |
| 1,000+ | Severe degradation, blocked frames | Required |
| 10,000+ | Unusable | Only viable path |
### User Experience Requirements
- **Perceived smoothness**: Scroll must feel native (60fps, no jumps)
- **Instant response**: Arrow key navigation, click handling under 100ms
- **Scroll position stability**: Jumping to arbitrary positions via scrollbar must work
- **Find-in-page**: Users expect Cmd/Ctrl+F to search all content (virtualization breaks this)
## Design Paths
### Path 1: Fixed-Height Virtualization
**How it works:**
When all items share identical height, positioning is pure arithmetic—no measurement required.
```typescript title="fixed-height-virtualizer.ts" collapse={1-3,29-35}
import { useState, useCallback } from 'react';
interface FixedVirtualizerProps {
items: T[];
itemHeight: number;
containerHeight: number;
overscan?: number;
renderItem: (item: T, index: number) => React.ReactNode;
}
function FixedVirtualizer({
items,
itemHeight,
containerHeight,
overscan = 3,
renderItem,
}: FixedVirtualizerProps) {
const [scrollTop, setScrollTop] = useState(0);
// Calculate visible range using arithmetic only
const startIndex = Math.max(0, Math.floor(scrollTop / itemHeight) - overscan);
const endIndex = Math.min(
items.length,
Math.ceil((scrollTop + containerHeight) / itemHeight) + overscan
);
const visibleItems = items.slice(startIndex, endIndex);
const offsetY = startIndex * itemHeight;
const totalHeight = items.length * itemHeight;
const handleScroll = useCallback((e: React.UIEvent) => {
setScrollTop(e.currentTarget.scrollTop);
}, []);
return (
{visibleItems.map((item, i) => (
{renderItem(item, startIndex + i)}
))}
);
}
```
**Why `transform` instead of `top`:**
- `transform: translateY()` is compositor-only—GPU handles it without triggering layout
- `top` or `margin-top` triggers full layout recalculation on every scroll frame
- This single choice can be the difference between 60fps and 15fps scrolling
**Performance characteristics:**
| Metric | Value |
| ------------------------- | ------------------------------- |
| DOM nodes | O(viewport) — typically 20-50 |
| Layout time | 1-3ms per frame |
| Memory | O(viewport) — no caching needed |
| Scroll performance | Consistent 60fps |
| Implementation complexity | Low |
**Best for:**
- Log viewers with monospace text
- Simple data tables with uniform rows
- Chat applications with text-only messages
- Any list where enforcing fixed height is acceptable
**Trade-offs:**
- ✅ Simplest implementation, smallest bundle contribution
- ✅ Predictable performance—pure math, no measurement
- ✅ No height cache memory overhead
- ❌ Real content (images, variable text, expandable sections) rarely fits
- ❌ Forces design constraints on UI
**Real-world example:**
VS Code's minimap uses fixed-height line rendering. Each line is represented at a consistent pixel height regardless of content wrapping in the main editor. This enables sub-millisecond position calculations for files with 100K+ lines.
### Path 2: Variable-Height Virtualization
**How it works:**
Items are measured as they render using ResizeObserver. A height cache stores measurements indexed by item. Unmeasured items use an estimated height.
```typescript title="variable-height-virtualizer.ts" collapse={1-5,60-80}
import { useState, useEffect, useRef, useCallback } from "react"
interface VariableVirtualizerProps {
items: T[]
estimatedItemHeight: number
containerHeight: number
overscan?: number
renderItem: (item: T, index: number, measureRef: (el: HTMLElement | null) => void) => React.ReactNode
}
function VariableVirtualizer({
items,
estimatedItemHeight,
containerHeight,
overscan = 3,
renderItem,
}: VariableVirtualizerProps) {
const [scrollTop, setScrollTop] = useState(0)
const [heightCache, setHeightCache] = useState>(new Map())
// Calculate positions with measured or estimated heights
const getItemOffset = (index: number): number => {
let offset = 0
for (let i = 0; i < index; i++) {
offset += heightCache.get(i) ?? estimatedItemHeight
}
return offset
}
// Binary search to find start index from scroll position
const findStartIndex = (scrollPos: number): number => {
let low = 0,
high = items.length - 1
while (low < high) {
const mid = Math.floor((low + high) / 2)
if (getItemOffset(mid + 1) <= scrollPos) {
low = mid + 1
} else {
high = mid
}
}
return Math.max(0, low - overscan)
}
const startIndex = findStartIndex(scrollTop)
// Find end index by accumulating heights until we exceed viewport
let endIndex = startIndex
let accumulatedHeight = 0
while (endIndex < items.length && accumulatedHeight < containerHeight + overscan * estimatedItemHeight) {
accumulatedHeight += heightCache.get(endIndex) ?? estimatedItemHeight
endIndex++
}
endIndex = Math.min(items.length, endIndex + overscan)
const totalHeight = getItemOffset(items.length)
const offsetY = getItemOffset(startIndex)
// Measure items as they render
const measureElement = useCallback(
(index: number) => (el: HTMLElement | null) => {
if (el) {
const observer = new ResizeObserver(([entry]) => {
const height = entry.contentRect.height
setHeightCache((prev) => {
if (prev.get(index) === height) return prev
const next = new Map(prev)
next.set(index, height)
return next
})
})
observer.observe(el)
return () => observer.disconnect()
}
},
[],
)
// ... scroll handler and render logic
}
```
**The estimation problem:**
When users drag the scrollbar to an arbitrary position, the virtualizer must render items that haven't been measured. Initial render uses estimates, which may be wrong—causing a visible "jump" as measurements arrive and positions correct.
**Mitigation strategies:**
1. **Running average**: Track average measured height, use for estimates
2. **Type-based estimates**: If items have types (text, image, card), use type-specific averages
3. **Larger buffers**: Overscan more items to catch estimation errors before they're visible
4. **Progressive correction**: Animate position corrections rather than snapping
**Performance characteristics:**
| Metric | Value |
| ------------------------- | ------------------------------------------- |
| DOM nodes | O(viewport) — typically 20-50 |
| Layout time | 2-5ms (measurement overhead) |
| Memory | O(n) for height cache in worst case |
| Scroll performance | 30-60fps depending on measurement frequency |
| Implementation complexity | High |
**Best for:**
- Social media feeds with mixed content
- Chat applications with images, embeds, reactions
- Any dynamic content where height enforcement is impossible
**Trade-offs:**
- ✅ Handles real-world content naturally
- ✅ Smooth scroll with proper buffering
- ❌ Height cache grows with items scrolled through
- ❌ Scroll position jumps possible on fast scrollbar drag
- ❌ Complex implementation with many edge cases
**Real-world example:**
Twitter/X uses variable-height virtualization for the home timeline. Tweets contain text, images, videos, polls, and embedded content—all with different heights. They mitigate scroll jumps by maintaining a larger buffer in the scroll direction and using type-based height estimates (tweets with images get higher estimates).
### Path 3: DOM Recycling
**How it works:**
Instead of creating and destroying DOM elements as items enter/exit the viewport, a fixed pool of elements is reused. When an item scrolls out, its DOM node is repositioned and its content updated with the new item.
```typescript title="dom-recycling-concept.ts"
// Conceptual representation - production implementations are more complex
class DOMRecycler {
private pool: HTMLElement[] = []
private poolSize: number
constructor(viewportSize: number, overscan: number) {
// Fixed pool size based on viewport, never grows
this.poolSize = viewportSize + overscan * 2
this.initializePool()
}
private initializePool() {
for (let i = 0; i < this.poolSize; i++) {
const element = document.createElement("div")
element.className = "virtual-item"
this.pool.push(element)
}
}
// When item scrolls out of view
recycleElement(element: HTMLElement, newItem: Item, newPosition: number) {
// Reuse same DOM node - update content, reposition
element.textContent = newItem.content
element.style.transform = `translateY(${newPosition}px)`
// No DOM creation/destruction - just property updates
}
}
```
**Why recycling improves performance:**
- **No GC pressure**: Element creation allocates memory; destruction triggers garbage collection
- **Warm caches**: Reused elements have cached style computations
- **Predictable memory**: Pool size is constant regardless of list length
**Performance characteristics:**
| Metric | Value |
| -------------- | ------------------------------- |
| DOM nodes | Fixed pool size (constant) |
| GC pauses | Eliminated during scroll |
| Memory pattern | Flat—no growth with scroll |
| Initial render | Slightly slower (pool creation) |
**Real-world example:**
AG Grid uses DOM recycling for both row and cell elements in data grids. When scrolling a 100K-row grid, the same ~50 row elements are reused continuously. Their documentation notes this eliminates GC pauses that would otherwise occur every few seconds during continuous scrolling.
### Path 4: CSS `content-visibility` (Newer Alternative)
**How it works:**
The `content-visibility` CSS property tells the browser to skip rendering work for off-screen elements while keeping them in the DOM.
```css title="content-visibility-example.css"
.list-item {
content-visibility: auto;
contain-intrinsic-size: 0 100px; /* Width height estimate */
}
```
**Values:**
- `visible` (default): Render normally
- `auto`: Skip rendering when off-screen, render when visible
- `hidden`: Never render (like `display: none` but preserves layout space)
**Critical requirement—`contain-intrinsic-size`:**
Without this, off-screen elements contribute zero height to layout, collapsing the scrollbar. The browser needs height estimates to maintain proper scroll dimensions.
**Browser support (as of 2025):**
All major browsers (Chrome 85+, Firefox 109+, Safari 17.4+) support `content-visibility`. Edge cases remain in Safari for nested scrolling contexts.
**Performance characteristics:**
| Metric | Value |
| ------------------------- | -------------------------------------------- |
| DOM nodes | All items remain in DOM |
| Layout time | O(viewport) for painting, full DOM for style |
| Memory | O(n) — all items exist |
| Find-in-page | ✅ Works natively |
| Anchor links | ✅ Works natively |
| Implementation complexity | Low (CSS only) |
**Trade-offs vs virtualization:**
| Capability | Virtualization | content-visibility |
| --------------------- | ---------------- | ------------------ |
| Find-in-page (Ctrl+F) | ❌ Broken | ✅ Works |
| Anchor links (#id) | ❌ Broken | ✅ Works |
| Memory usage | O(viewport) | O(n) |
| Bundle size impact | Library required | Zero (CSS) |
| Browser support | Universal | Modern browsers |
| Accessibility | Requires ARIA | Native |
**When to use `content-visibility` over virtualization:**
- Lists under 10,000 items where memory isn't critical
- When find-in-page is required
- When anchor navigation is required
- When simplicity is prioritized over minimal memory
- Progressive enhancement scenarios
**Real-world example:**
web.dev reports a 7x rendering performance improvement on the Chrome DevRel blog by applying `content-visibility: auto` to article sections. The full article DOM exists for search engines and find-in-page, but off-screen sections don't consume paint time.
### Decision Matrix
| Factor | Fixed-Height | Variable-Height | DOM Recycling | content-visibility |
| --------------------- | ------------ | --------------- | ------------- | ------------------ |
| Item count limit | Unlimited | Unlimited | Unlimited | ~10K (memory) |
| Dynamic heights | ❌ | ✅ | ✅ | ✅ |
| Find-in-page | ❌ | ❌ | ❌ | ✅ |
| Memory efficiency | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐ |
| Implementation effort | Low | High | High | Trivial |
| GC pauses | Possible | Possible | Eliminated | Possible |
| Browser support | Universal | Universal | Universal | Modern |
### Decision Framework

## Browser APIs for Virtualization
### ResizeObserver for Height Measurement
ResizeObserver reports element size changes asynchronously, avoiding the performance penalty of synchronous `getBoundingClientRect()` calls.
```typescript title="resize-observer-measurement.ts" collapse={1-2,20-25}
// Height measurement pattern used by react-virtuoso
function measureItemHeight(element: HTMLElement, onMeasure: (height: number) => void): () => void {
const observer = new ResizeObserver((entries) => {
// contentRect excludes padding and border
const height = entries[0].contentRect.height
onMeasure(height)
})
observer.observe(element)
// Cleanup function
return () => observer.disconnect()
}
```
**Critical detail—what ResizeObserver measures:**
ResizeObserver's `contentRect` reports the content box (excluding padding, border, margin). If items have margins, add them manually:
```typescript title="margin-aware-measurement.ts"
const computedStyle = getComputedStyle(element)
const marginTop = parseFloat(computedStyle.marginTop)
const marginBottom = parseFloat(computedStyle.marginBottom)
const totalHeight = entry.contentRect.height + marginTop + marginBottom
```
**Infinite loop prevention:**
The ResizeObserver spec prevents infinite loops by delivering notifications breadth-first. If observing an element triggers its resize (e.g., changing its content), the callback won't fire again in the same frame—subsequent changes are delivered in the next frame.
### IntersectionObserver for Visibility Detection
IntersectionObserver detects when elements enter or exit the viewport without triggering layout calculations.
```typescript title="intersection-observer-sentinel.ts" collapse={1-2}
// Sentinel pattern for infinite scroll loading
function setupLoadMoreTrigger(sentinel: HTMLElement, onVisible: () => void) {
const observer = new IntersectionObserver(
(entries) => {
if (entries[0].isIntersecting) {
onVisible() // Load more items
}
},
{
rootMargin: "200px", // Trigger 200px before visible
threshold: 0,
},
)
observer.observe(sentinel)
return () => observer.disconnect()
}
```
**Why IntersectionObserver over scroll events:**
- **Off main thread**: Calculations happen asynchronously
- **No layout thrashing**: Doesn't force synchronous layout
- **Batched**: Multiple intersections delivered in one callback
### requestAnimationFrame for Scroll Handling
Scroll events fire many times per frame. RAF throttles updates to the frame rate naturally.
```typescript title="raf-scroll-handling.ts"
let scheduled = false
let lastScrollTop = 0
function handleScroll(e: Event) {
lastScrollTop = (e.target as HTMLElement).scrollTop
if (!scheduled) {
scheduled = true
requestAnimationFrame(() => {
updateVisibleRange(lastScrollTop)
scheduled = false
})
}
}
```
**Why this matters:**
On a 144Hz display, scroll events might fire 144+ times per second. Without RAF throttling, you'd recalculate visible items on every event—most of which occur between frames and waste CPU cycles.
## Edge Cases and Failure Modes
### Scroll Position Jumping
**Symptom**: When dragging the scrollbar to an arbitrary position, content briefly shows wrong items before correcting.
**Root cause**: Unmeasured items use estimated heights. If estimates are wrong, calculated scroll positions are wrong.
**Mitigation strategies:**
```typescript title="jump-mitigation.ts"
// Strategy 1: Type-based estimates
const heightEstimates: Record = {
text: 60,
image: 300,
video: 400,
card: 150,
}
function estimateHeight(item: Item): number {
return heightEstimates[item.type] ?? 100
}
// Strategy 2: Running average
let totalMeasured = 0
let measurementCount = 0
function updateEstimate(measuredHeight: number) {
totalMeasured += measuredHeight
measurementCount++
}
function getEstimate(): number {
return measurementCount > 0 ? totalMeasured / measurementCount : defaultEstimate
}
```
### Keyboard Navigation and Focus Management
**Critical accessibility failure**: When a focused item scrolls out of the viewport and is unmounted, focus is lost. Screen reader users lose their position entirely.
**Solution**: Track logical focus (item index) separately from DOM focus. When scrolling brings the focused item back into view, restore DOM focus.
```typescript title="focus-management.ts" collapse={1-3,25-35}
import { useState, useEffect, useRef, useCallback } from "react"
function useFocusManagement(visibleRange: { start: number; end: number }) {
const [focusedIndex, setFocusedIndex] = useState(null)
const itemRefs = useRef>(new Map())
// Track which item should have focus (logical, not DOM)
const handleKeyDown = useCallback(
(e: KeyboardEvent) => {
if (e.key === "ArrowDown" && focusedIndex !== null) {
setFocusedIndex(focusedIndex + 1)
} else if (e.key === "ArrowUp" && focusedIndex !== null) {
setFocusedIndex(Math.max(0, focusedIndex - 1))
}
},
[focusedIndex],
)
// Restore DOM focus when focused item becomes visible
useEffect(() => {
if (focusedIndex !== null && focusedIndex >= visibleRange.start && focusedIndex <= visibleRange.end) {
itemRefs.current.get(focusedIndex)?.focus()
}
}, [focusedIndex, visibleRange])
return { focusedIndex, setFocusedIndex, itemRefs, handleKeyDown }
}
```
### Screen Reader Accessibility
**Fundamental problem**: Screen readers navigate the DOM. Virtualized content doesn't exist in the DOM.
**ARIA live regions**: Announce content changes to screen readers.
```html title="aria-live-regions.html"
```
**ARIA attributes for virtualized lists:**
| Attribute | Purpose |
| --------------------------- | --------------------------------------------- |
| `role="list"` | Identifies container as a list |
| `aria-live="polite"` | Announce changes when user is idle |
| `aria-relevant="additions"` | Only announce new items, not removals |
| `aria-busy="true"` | Indicate loading state |
| `aria-setsize` | Total number of items (including virtualized) |
| `aria-posinset` | Position of each visible item in full list |
```html title="aria-positioning.html"
Item 42 of 10,000
```
**Ongoing standards work**: The WICG (Web Incubator Community Group) is developing a `` web component to provide native browser support for virtualization with built-in accessibility.
### Find-in-Page Limitations
**Hard constraint**: Browser's Ctrl+F/Cmd+F searches only the visible DOM. Virtualized items don't exist in the DOM.
**Solutions:**
1. **Custom search UI**: Implement application-level search that queries data, calculates positions, and scrolls to results
```typescript title="custom-search.ts"
function searchAndScroll(query: string, items: Item[], virtualizer: Virtualizer) {
const matchIndex = items.findIndex((item) => item.content.toLowerCase().includes(query.toLowerCase()))
if (matchIndex !== -1) {
virtualizer.scrollToIndex(matchIndex, { align: "center" })
}
}
```
2. **Warning UI**: Display a notice explaining that browser search won't find all content
3. **`content-visibility` alternative**: For lists under ~10K items where find-in-page is critical, use CSS `content-visibility: auto` instead of virtualization
### Bidirectional Scrolling (Chat Interfaces)
**Unique challenge**: Chat interfaces load history when scrolling up and receive new messages at the bottom. Both directions modify the list, and scroll position must be preserved.
**Scroll anchoring**: When items are added above the viewport, adjust scroll position to keep the visible content stationary.
```typescript title="scroll-anchoring.ts"
function addHistoryItems(newItems: Item[], existingItems: Item[]) {
// Record current scroll position and reference item
const scrollContainer = containerRef.current
const currentScrollTop = scrollContainer.scrollTop
const anchorItem = findFirstVisibleItem()
const anchorOffset = getItemOffset(anchorItem.index)
// Add items to beginning
const combined = [...newItems, ...existingItems]
// After render, adjust scroll to maintain anchor position
requestAnimationFrame(() => {
const newAnchorOffset = getItemOffset(anchorItem.index + newItems.length)
const adjustment = newAnchorOffset - anchorOffset
scrollContainer.scrollTop = currentScrollTop + adjustment
})
}
```
**Reverse infinite scroll**: Discord's message list uses this pattern. Scrolling to the top triggers history loading, with scroll position adjusted so the user doesn't notice items being prepended.
### Dynamic Content (Images Loading)
**Problem**: Images without dimensions cause layout shift when they load, invalidating cached heights.
**Solutions:**
1. **Aspect ratio CSS**: Reserve space before image loads
```css title="aspect-ratio-placeholder.css"
.image-container {
aspect-ratio: 16 / 9;
width: 100%;
}
.image-container img {
width: 100%;
height: 100%;
object-fit: cover;
}
```
2. **Re-measure on load**: Update height cache when images complete
```typescript title="image-load-remeasure.ts" collapse={1-2}
function ImageItem({ item, onHeightChange }: Props) {
const ref = useRef(null);
const handleImageLoad = () => {
if (ref.current) {
onHeightChange(ref.current.offsetHeight);
}
};
return (
);
}
```
## Performance Optimization
### Overscan Configuration
Overscan renders additional items beyond the viewport to prevent blank areas during fast scrolling.
**Trade-off**: Higher overscan = smoother scrolling but more rendering work per frame.
| Scroll Speed | Recommended Overscan |
| ----------------------- | -------------------- |
| Slow/moderate | 1-2 items |
| Fast scrolling expected | 3-5 items |
| Scrollbar drag support | 5-10 items |
**Direction-aware overscan**: Render more items in the scroll direction for better perceived smoothness.
```typescript title="directional-overscan.ts"
function calculateOverscan(scrollDirection: "up" | "down") {
return {
overscanBefore: scrollDirection === "up" ? 5 : 2,
overscanAfter: scrollDirection === "down" ? 5 : 2,
}
}
```
### Scroll Event Handling Strategies
**RAF throttling (recommended)**:
```typescript title="raf-throttle.ts"
let rafId: number | null = null
function handleScroll(e: Event) {
if (rafId === null) {
rafId = requestAnimationFrame(() => {
updateVisibleItems()
rafId = null
})
}
}
```
**Passive event listeners**: Improve scroll performance by indicating the handler won't call `preventDefault()`.
```typescript title="passive-scroll.ts"
container.addEventListener("scroll", handleScroll, { passive: true })
```
### GPU-Accelerated Positioning
**Use transforms, not layout-triggering properties:**
```css title="gpu-positioning.css"
/* ✅ GPU-accelerated */
.virtual-item {
transform: translateY(var(--offset));
will-change: transform; /* Hint to browser */
}
/* ❌ Triggers layout */
.virtual-item {
position: absolute;
top: var(--offset);
}
```
**will-change caveat**: MDN warns that `will-change` is "intended to be used as a last resort." Excessive use creates too many compositor layers, consuming memory and potentially hurting performance. Use selectively on elements that will actually animate.
### Memory Management for Large Lists
**Height cache strategy**: For lists with millions of items, storing every height becomes problematic.
**Segment-based caching**: Store heights in segments, evict old segments using LRU.
```typescript title="segment-cache.ts"
class SegmentedHeightCache {
private segments: Map> = new Map()
private segmentSize = 1000
private maxSegments = 10
get(index: number): number | undefined {
const segmentId = Math.floor(index / this.segmentSize)
const segment = this.segments.get(segmentId)
return segment?.get(index)
}
set(index: number, height: number) {
const segmentId = Math.floor(index / this.segmentSize)
if (!this.segments.has(segmentId)) {
if (this.segments.size >= this.maxSegments) {
this.evictOldest()
}
this.segments.set(segmentId, new Map())
}
this.segments.get(segmentId)!.set(index, height)
}
private evictOldest() {
const firstKey = this.segments.keys().next().value
this.segments.delete(firstKey)
}
}
```
## Grid Virtualization
### Two-Dimensional Windowing
Grid virtualization requires windowing both rows AND columns simultaneously.
```typescript title="grid-virtualizer.ts" collapse={1-3,35-45}
interface GridVirtualizerProps {
rowCount: number
columnCount: number
rowHeight: number
columnWidth: number
containerWidth: number
containerHeight: number
}
function calculateVisibleGrid({
scrollTop,
scrollLeft,
rowHeight,
columnWidth,
containerWidth,
containerHeight,
rowCount,
columnCount,
}: GridVirtualizerProps & { scrollTop: number; scrollLeft: number }) {
const startRow = Math.floor(scrollTop / rowHeight)
const endRow = Math.min(rowCount, Math.ceil((scrollTop + containerHeight) / rowHeight) + 1)
const startCol = Math.floor(scrollLeft / columnWidth)
const endCol = Math.min(columnCount, Math.ceil((scrollLeft + containerWidth) / columnWidth) + 1)
return {
visibleRows: { start: startRow, end: endRow },
visibleCols: { start: startCol, end: endCol },
}
}
```
### Grid-Specific Challenges
**Header synchronization**: Column headers must scroll horizontally with the body; row headers must scroll vertically.
```typescript title="header-sync.ts"
// Sync column header horizontal scroll with body
function syncHeaders(bodyScrollLeft: number, bodyScrollTop: number) {
columnHeaderRef.current.scrollLeft = bodyScrollLeft
rowHeaderRef.current.scrollTop = bodyScrollTop
}
```
**Cell selection**: Multi-cell selection requires tracking 2D ranges and rendering selection highlights efficiently.
**Column resizing**: When column widths change, all cached position calculations must be invalidated—more expensive than row height changes because it affects the horizontal axis formula.
## Real-World Implementations
### Discord: Message Virtualization
**Challenge**: Millions of messages across servers, with rich content (embeds, reactions, replies).
**Architecture**:
- Backend: ScyllaDB for message storage (replaced Cassandra for better latency)
- Frontend: Custom virtualization with DOM recycling
- Real-time: Elixir processes for WebSocket coordination
**Virtualization approach**:
- Fixed-ish heights: Most messages have predictable heights
- Type-based estimation for embeds, images
- Bidirectional scrolling for history loading
- Scroll anchoring when new messages arrive
**Key insight**: Discord separates "jump to message" (scrollbar drag) from "scroll to load more" (reaching boundaries). Jump uses estimation; boundary loading measures precisely.
### Figma: Canvas Virtualization
**Challenge**: Infinite canvas with millions of objects at arbitrary zoom levels.
**Architecture**:
- Rendering: WebGL (now WebGPU in Chrome 127+)
- Spatial indexing: R-tree for visible object queries
- Level-of-detail: Simplify distant objects
**Key insight**: Figma doesn't use DOM virtualization—they bypass the DOM entirely with WebGL. The "virtualization" is which objects to send to the GPU based on the current viewport and zoom level.
**Outcome**: 60fps with 100K+ objects, 50ms initial render for complex files.
### VS Code: Editor Virtualization
**Challenge**: Files with 100K+ lines, syntax highlighting, code folding, minimap.
**Approach**:
- Line-based virtualization with fixed heights (monospace font)
- Incremental tokenization for syntax highlighting
- Minimap uses fixed-height representation
**Large file optimizations** (`editor.largeFileOptimizations`):
- Disable syntax highlighting above threshold
- Disable code folding computation
- Disable minimap rendering
**Key insight**: VS Code selectively disables features as file size increases, trading functionality for responsiveness.
### Slack: Hybrid Virtual Scrollbar
**Challenge**: Message history spanning years, with mixed content types.
**Approach**:
- True overflow scrolling (native scrollbar behavior)
- Custom virtual scrollbar UI overlaid
- Backend pagination with cursors
**Key insight**: Rather than reimplementing scroll physics, Slack uses the browser's native scrolling and only virtualizes the scrollbar UI—reducing complexity while maintaining familiar scroll feel.
## Library Comparison
### react-window
**Author**: Brian Vaughn (bvaughn), former React team member
**Design philosophy**: Minimal, focused alternative to react-virtualized.
**Components**:
- `FixedSizeList` / `VariableSizeList`
- `FixedSizeGrid` / `VariableSizeGrid`
**Bundle size**: ~6KB gzipped
**Best for**: Teams wanting a lightweight, well-maintained React solution for common virtualization patterns.
### react-virtuoso
**Specialization**: Variable-height content with automatic measurement.
**Key features**:
- ResizeObserver-based automatic height tracking
- Built-in handling for prepending items (chat use case)
- Grouped lists with sticky headers
**Bundle size**: ~15KB gzipped
**Best for**: Social feeds, chat interfaces—anywhere content heights vary unpredictably.
### @tanstack/virtual
**Author**: Tanner Linsley (TanStack maintainer)
**Architecture**: Framework-agnostic core with adapters for React, Vue, Svelte, Solid.
**Key innovation**: Inversion of control—framework adapters implement browser interactions, core handles calculations.
**Bundle size**: ~10KB gzipped (core)
**Best for**: Multi-framework teams, or when you need virtualization outside React.
### Library Decision Matrix
| Library | Bundle Size | Variable Height | Auto-measure | Framework | Best For |
| -------------------- | ----------- | --------------- | ------------ | --------- | --------------- |
| react-window | 6KB | Manual | ❌ | React | Simple lists |
| react-virtuoso | 15KB | Built-in | ✅ | React | Dynamic content |
| @tanstack/virtual | 10KB | Built-in | ✅ | Any | Multi-framework |
| vue-virtual-scroller | 12KB | Built-in | ✅ | Vue | Vue apps |
## Conclusion
Virtualization is the only viable approach for rendering large lists in the browser. The core principle—render only what's visible—transforms O(n) DOM operations into O(viewport) regardless of list size.
Choose your approach based on constraints:
- **Fixed-height virtualization** when you can enforce uniform item heights—simplest and fastest
- **Variable-height virtualization** for real-world content—the production standard
- **DOM recycling** when GC pauses are unacceptable—adds complexity but eliminates allocation churn
- **CSS `content-visibility`** when find-in-page matters and list size is under ~10K—simplest implementation, keeps full DOM
The implementation details matter: use `transform` over `top`, RAF-throttle scroll events, manage focus for accessibility, and handle the edge cases (scroll jumping, bidirectional scrolling, dynamic content) that differentiate production code from demos.
## Appendix
### Prerequisites
- Browser rendering pipeline (layout, paint, composite)
- React or framework fundamentals (for library examples)
- Basic understanding of time complexity notation
### Terminology
| Term | Definition |
| ---------------- | --------------------------------------------------------------- |
| Virtualization | Rendering only visible items plus buffer |
| Windowing | Synonym for virtualization in this context |
| Overscan | Extra items rendered beyond visible viewport |
| DOM recycling | Reusing DOM elements instead of create/destroy |
| Height cache | Storage of measured item heights for positioning |
| Scroll anchoring | Maintaining visual position when items are added above viewport |
### Summary
- Virtualization keeps DOM size constant (O(viewport)) regardless of list length
- Fixed-height is simplest (pure math); variable-height handles real content
- Use `transform: translateY()` for GPU-accelerated positioning
- ResizeObserver measures heights; IntersectionObserver detects visibility
- `content-visibility: auto` is a CSS-only alternative that preserves find-in-page
- Focus management and ARIA attributes are critical for accessibility
- Real-world implementations (Discord, Figma, VS Code) combine techniques based on their specific constraints
### References
- [W3C Intersection Observer Specification](https://www.w3.org/TR/intersection-observer/) - Authoritative API specification
- [CSSWG ResizeObserver Specification](https://drafts.csswg.org/resize-observer/) - Observer API for element size changes
- [MDN: content-visibility](https://developer.mozilla.org/en-US/docs/Web/CSS/content-visibility) - CSS property documentation
- [web.dev: Virtualize long lists with react-window](https://web.dev/articles/virtualize-long-lists-react-window) - Chrome DevRel implementation guide
- [web.dev: content-visibility: the new CSS property that boosts your rendering performance](https://web.dev/articles/content-visibility) - Performance case study
- [TanStack Virtual Documentation](https://tanstack.com/virtual/latest/docs/introduction) - Framework-agnostic library docs
- [react-window GitHub](https://github.com/bvaughn/react-window) - Library source and documentation
- [react-virtuoso Documentation](https://virtuoso.dev/) - Variable-height virtualization library
- [WICG Virtual Scroller Proposal](https://github.com/WICG/virtual-scroller) - Ongoing standardization effort
- [Figma: Keeping Figma Fast](https://www.figma.com/blog/keeping-figma-fast/) - Canvas virtualization case study
- [AG Grid: DOM Virtualization](https://www.ag-grid.com/javascript-data-grid/dom-virtualisation/) - Grid virtualization patterns
---
## Offline-First Architecture
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/frontend-system-design/offline-first-architecture
**Category:** System Design / Frontend System Design
**Description:** Building applications that prioritize local data and functionality, treating network connectivity as an enhancement rather than a requirement—the storage APIs, sync strategies, and conflict resolution patterns that power modern collaborative and offline-capable applications.Offline-first inverts the traditional web model: instead of fetching data from servers and caching it locally, data lives locally first and syncs to servers when possible. This article explores the browser APIs that enable this pattern, the sync strategies that keep data consistent, and how production applications like Figma, Notion, and Linear solve these problems at scale.
# Offline-First Architecture
Building applications that prioritize local data and functionality, treating network connectivity as an enhancement rather than a requirement—the storage APIs, sync strategies, and conflict resolution patterns that power modern collaborative and offline-capable applications.
Offline-first inverts the traditional web model: instead of fetching data from servers and caching it locally, data lives locally first and syncs to servers when possible. This article explores the browser APIs that enable this pattern, the sync strategies that keep data consistent, and how production applications like Figma, Notion, and Linear solve these problems at scale.

Offline-first architecture: application reads/writes to local storage first, service worker manages caching, and sync happens in the background when connectivity allows.
## Abstract
Offline-first architecture treats local storage as the primary data source and network as a sync mechanism. The core mental model:
- **Local-first data**: Application reads from and writes to local storage (IndexedDB, OPFS) immediately. Network operations are asynchronous background tasks, not blocking user interactions.
- **Service Workers as network proxy**: Service Workers intercept all network requests, enabling caching strategies (cache-first, network-first, stale-while-revalidate) and background sync when connectivity returns.
- **Conflict resolution is the hard problem**: When multiple clients modify the same data offline, syncing creates conflicts. Three approaches: Last-Write-Wins (simple but loses data), Operational Transform (requires central server), and CRDTs (mathematically guaranteed convergence but complex).
- **Storage is constrained and unreliable**: Browser storage quotas vary wildly (Safari: 1GB, Chrome: 60% of disk). Storage can be evicted without warning unless persistent storage is requested and granted.
| Pattern | Complexity | Data Loss Risk | Offline Duration | Best For |
| ---------- | ---------- | ------------------ | ---------------- | ----------------- |
| Cache-only | Low | High (stale data) | Minutes | Static assets |
| Sync queue | Medium | Medium (conflicts) | Hours | Form submissions |
| OT-based | High | Low | Days | Real-time collab |
| CRDT-based | Very High | None | Indefinite | P2P, long offline |
## The Challenge
### Browser Constraints
Building offline-first applications means working within browser limitations that don't exist in native apps.
**Main thread contention**: IndexedDB operations are asynchronous but still affect the main thread. Large reads/writes can cause jank. Service Workers run on a separate thread but share CPU with the page.
**Storage quotas**: Browsers limit how much data an origin can store, and quotas vary dramatically:
| Browser | Best-Effort Mode | Persistent Mode | Eviction Behavior |
| ------- | ---------------------- | --------------------- | ------------------------------- |
| Chrome | 60% of disk | 60% of disk | LRU when >80% full |
| Firefox | 10% of disk (max 10GB) | 50% of disk (max 8TB) | LRU by origin |
| Safari | ~1GB total | Not supported | 7 days without user interaction |
**Safari's aggressive eviction**: Safari deletes all website data (IndexedDB, Cache API, localStorage) after 7 days without user interaction when Intelligent Tracking Prevention (ITP) is enabled. This fundamentally breaks long-term offline storage for Safari users.
> "After 7 days of Safari use without user interaction on your site, all the website's script-writable storage forms are deleted." — WebKit Blog, 2020
**Storage API fragmentation**: Different storage mechanisms have different characteristics:
| Storage Type | Max Size | Persistence | Indexed | Transaction Support |
| -------------- | ------------ | ------------------ | ------- | ------------------- |
| localStorage | 5MB | Session/persistent | No | No |
| sessionStorage | 5MB | Tab session | No | No |
| IndexedDB | Origin quota | Persistent | Yes | Yes |
| Cache API | Origin quota | Persistent | No | No |
| OPFS | Origin quota | Persistent | No | No |
### Network Realities
**`navigator.onLine` is unreliable**: This API only indicates whether the browser has a network interface—not whether it can reach the internet. A LAN connection without internet access reports `online: true`.
```typescript
// Don't rely on this for actual connectivity
navigator.onLine // true even without internet access
// Instead, detect actual connectivity
async function checkConnectivity(): Promise {
try {
const response = await fetch("/api/health", {
method: "HEAD",
cache: "no-store",
})
return response.ok
} catch {
return false
}
}
```
**Network transitions are complex**: Users move between WiFi, cellular, and offline. Requests can fail mid-flight. Servers can be reachable but slow. Offline-first apps must handle all these states gracefully.
### Scale Factors
The right offline strategy depends on data characteristics:
| Factor | Simple Offline | Full Offline-First |
| ------------------- | --------------------- | ----------------------- |
| Data size | < 10MB | 100MB+ |
| Update frequency | < 1/hour | Real-time |
| Concurrent editors | Single user | Multiple users |
| Offline duration | Minutes | Days/weeks |
| Conflict complexity | Overwrites acceptable | Must preserve all edits |
## Storage Layer
### IndexedDB: The Foundation
IndexedDB is the primary storage mechanism for offline-first apps. It's a transactional, indexed object store that can handle large amounts of structured data.
**Transaction model**: IndexedDB uses transactions with three modes:
- `readonly`: Multiple concurrent reads allowed
- `readwrite`: Serialized writes, blocks other readwrite transactions on same object stores
- `versionchange`: Schema changes, exclusive access to entire database
```typescript collapse={1-3, 22-30}
// Database initialization with versioning
const DB_NAME = "offline-app"
const DB_VERSION = 2
function openDatabase(): Promise {
return new Promise((resolve, reject) => {
const request = indexedDB.open(DB_NAME, DB_VERSION)
request.onupgradeneeded = (event) => {
const db = request.result
const oldVersion = event.oldVersion
// Version 1: Initial schema
if (oldVersion < 1) {
const store = db.createObjectStore("documents", { keyPath: "id" })
store.createIndex("by_updated", "updatedAt")
}
// Version 2: Add sync metadata
if (oldVersion < 2) {
const store = request.transaction!.objectStore("documents")
store.createIndex("by_sync_status", "syncStatus")
}
}
request.onsuccess = () => resolve(request.result)
request.onerror = () => reject(request.error)
})
}
```
**Versioning is critical**: IndexedDB schema changes require version increments. Opening a database with a lower version than exists fails. The `onupgradeneeded` handler must handle all version migrations sequentially.
**Cursor operations for large datasets**: For datasets too large to load entirely, use cursors:
```typescript collapse={1-2, 20-25}
async function* iterateDocuments(db: IDBDatabase): AsyncGenerator {
const tx = db.transaction("documents", "readonly")
const store = tx.objectStore("documents")
const request = store.openCursor()
while (true) {
const cursor = await new Promise((resolve) => {
request.onsuccess = () => resolve(request.result)
})
if (!cursor) break
yield cursor.value
cursor.continue()
}
}
```
### Origin Private File System (OPFS)
OPFS provides file system access within the browser sandbox—faster than IndexedDB for binary data and large files.
**When to use OPFS over IndexedDB**:
- Binary files (images, videos, documents)
- Large blobs (>10MB)
- Sequential read/write patterns
- Web Workers with synchronous access needed
```typescript collapse={1-2, 18-22}
// OPFS access
async function saveFile(name: string, data: ArrayBuffer): Promise {
const root = await navigator.storage.getDirectory()
const fileHandle = await root.getFileHandle(name, { create: true })
// Async API (main thread or worker)
const writable = await fileHandle.createWritable()
await writable.write(data)
await writable.close()
}
// Synchronous API (workers only) - much faster
function saveFileSync(name: string, data: ArrayBuffer): void {
const root = navigator.storage.getDirectory()
const fileHandle = root.getFileHandleSync(name, { create: true })
const accessHandle = fileHandle.createSyncAccessHandle()
accessHandle.write(data)
accessHandle.close()
}
```
**OPFS limitations**:
- No indexing (unlike IndexedDB)—you manage your own file organization
- Synchronous API only in Web Workers
- No cross-origin access
- Same quota as IndexedDB (shared origin quota)
### Storage Manager API
The Storage Manager API provides quota information and persistence requests:
```typescript
async function checkStorageStatus(): Promise<{
quota: number
usage: number
persistent: boolean
}> {
const estimate = await navigator.storage.estimate()
const persistent = await navigator.storage.persisted()
return {
quota: estimate.quota ?? 0,
usage: estimate.usage ?? 0,
persistent,
}
}
async function requestPersistence(): Promise {
// Chrome auto-grants for "important" sites (bookmarked, installed PWA)
// Firefox prompts the user
// Safari doesn't support persistent storage
if (navigator.storage.persist) {
return await navigator.storage.persist()
}
return false
}
```
**Persistence reality**: Without persistent storage, browsers can evict your data at any time when storage pressure occurs. Chrome uses LRU eviction by origin. Safari's 7-day limit applies regardless of persistence requests.
**Design implication**: Never assume local data will survive. Always design for re-sync from server. Treat local storage as a cache that improves UX, not as the source of truth.
## Service Workers
Service Workers are JavaScript workers that intercept network requests, enabling offline functionality and background sync.
### Lifecycle
Service Workers have a distinct lifecycle that affects how updates propagate:
```
Install → Waiting → Activate → Running → Idle → Terminated
↑ ↓
└──────── Fetch event ─────────┘
```
**Installation**: Service Worker is downloaded and parsed. `install` event fires—use this to pre-cache critical assets.
**Waiting**: New Service Worker waits until all tabs using the old version close. This prevents breaking in-flight requests.
**Activation**: Old Service Worker is replaced. `activate` event fires—use this to clean up old caches.
```typescript collapse={1-5, 30-40}
// service-worker.ts
const CACHE_VERSION = "v2"
const STATIC_CACHE = `static-${CACHE_VERSION}`
const DYNAMIC_CACHE = `dynamic-${CACHE_VERSION}`
self.addEventListener("install", (event: ExtendableEvent) => {
event.waitUntil(
caches.open(STATIC_CACHE).then((cache) => {
return cache.addAll(["/", "/app.js", "/styles.css", "/offline.html"])
}),
)
// Skip waiting to activate immediately (use carefully)
// self.skipWaiting();
})
self.addEventListener("activate", (event: ExtendableEvent) => {
event.waitUntil(
caches.keys().then((keys) => {
return Promise.all(keys.filter((key) => !key.includes(CACHE_VERSION)).map((key) => caches.delete(key)))
}),
)
// Take control of all pages immediately
// self.clients.claim();
})
```
**skipWaiting pitfall**: Calling `skipWaiting()` activates the new Service Worker immediately, but existing pages still have old JavaScript. This can cause version mismatches between page code and Service Worker. Only use if your update is backward-compatible.
### Caching Strategies
Jake Archibald's "Offline Cookbook" defines canonical caching strategies. Each has distinct trade-offs:
**Cache-First**: Serve from cache, fall back to network. Best for static assets that rarely change.
```typescript
async function cacheFirst(request: Request): Promise {
const cached = await caches.match(request)
if (cached) return cached
const response = await fetch(request)
if (response.ok) {
const cache = await caches.open(STATIC_CACHE)
cache.put(request, response.clone())
}
return response
}
```
**Network-First**: Try network, fall back to cache. Best for frequently-updated content where freshness matters.
```typescript collapse={1-2, 18-22}
async function networkFirst(request: Request, timeout = 3000): Promise {
try {
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), timeout)
const response = await fetch(request, { signal: controller.signal })
clearTimeout(timeoutId)
if (response.ok) {
const cache = await caches.open(DYNAMIC_CACHE)
cache.put(request, response.clone())
}
return response
} catch {
const cached = await caches.match(request)
if (cached) return cached
throw new Error("Network failed and no cache available")
}
}
```
**Stale-While-Revalidate**: Serve from cache immediately, update cache in background. Best for content where slight staleness is acceptable.
```typescript
async function staleWhileRevalidate(request: Request): Promise {
const cache = await caches.open(DYNAMIC_CACHE)
const cached = await cache.match(request)
const fetchPromise = fetch(request).then((response) => {
if (response.ok) {
cache.put(request, response.clone())
}
return response
})
return cached ?? fetchPromise
}
```
**Strategy selection by resource type**:
| Resource Type | Strategy | Rationale |
| ------------------------- | ------------------------ | ----------------------- |
| App shell (HTML, JS, CSS) | Cache-first with version | Immutable builds |
| API responses | Network-first | Freshness critical |
| User-generated content | Stale-while-revalidate | UX + eventual freshness |
| Images/media | Cache-first | Rarely change |
| Authentication endpoints | Network-only | Must be fresh |
### Background Sync
Background Sync API allows deferring actions until connectivity is available:
```typescript collapse={1-5, 25-35}
// In your application code
async function queueSync(data: SyncData): Promise {
// Store the data in IndexedDB
await saveToSyncQueue(data)
// Register for background sync
const registration = await navigator.serviceWorker.ready
await registration.sync.register("sync-pending-changes")
}
// In service worker
self.addEventListener("sync", (event: SyncEvent) => {
if (event.tag === "sync-pending-changes") {
event.waitUntil(processSyncQueue())
}
})
async function processSyncQueue(): Promise {
const pending = await getPendingSyncItems()
for (const item of pending) {
try {
await fetch("/api/sync", {
method: "POST",
body: JSON.stringify(item),
})
await markSynced(item.id)
} catch {
// Will retry on next sync event
throw new Error("Sync failed")
}
}
}
```
**Background Sync limitations**:
- Chrome-only (as of 2024)
- No guarantee of timing—browser decides when to fire sync event
- Limited to ~3 minutes of execution time
- Requires Service Worker to be registered
**Periodic Background Sync**: Allows periodic sync even when app is closed. Requires explicit permission and Chrome only:
```typescript
// Check support and register
if ("periodicSync" in navigator.serviceWorker) {
const registration = await navigator.serviceWorker.ready
const status = await navigator.permissions.query({
name: "periodic-background-sync" as PermissionName,
})
if (status.state === "granted") {
await registration.periodicSync.register("sync-content", {
minInterval: 24 * 60 * 60 * 1000, // 24 hours minimum
})
}
}
```
### Workbox
Workbox (Google) encapsulates Service Worker patterns in a production-ready library. It's used by ~54% of mobile sites with Service Workers.
```typescript collapse={1-8, 25-35}
import { precacheAndRoute } from "workbox-precaching"
import { registerRoute } from "workbox-routing"
import { CacheFirst, NetworkFirst, StaleWhileRevalidate } from "workbox-strategies"
import { BackgroundSyncPlugin } from "workbox-background-sync"
// Precache app shell (injected at build time)
precacheAndRoute(self.__WB_MANIFEST)
// API calls: network-first with background sync fallback
registerRoute(
({ url }) => url.pathname.startsWith("/api/"),
new NetworkFirst({
cacheName: "api-cache",
plugins: [
new BackgroundSyncPlugin("api-queue", {
maxRetentionTime: 24 * 60, // 24 hours
}),
],
}),
)
// Images: cache-first
registerRoute(
({ request }) => request.destination === "image",
new CacheFirst({
cacheName: "images",
plugins: [
new ExpirationPlugin({
maxEntries: 100,
maxAgeSeconds: 30 * 24 * 60 * 60, // 30 days
}),
],
}),
)
```
**Why use Workbox**:
- Handles cache versioning and cleanup automatically
- Precaching with revision hashing
- Built-in plugins for expiration, broadcast updates, background sync
- Webpack/Vite integration for build-time manifest generation
## Sync Strategies
When clients make changes offline, syncing those changes creates the hardest problems in offline-first architecture.
### Last-Write-Wins (LWW)
The simplest conflict resolution: most recent timestamp wins.
```typescript
interface Document {
id: string
content: string
updatedAt: number // Unix timestamp
}
function resolveConflict(local: Document, remote: Document): Document {
return local.updatedAt > remote.updatedAt ? local : remote
}
```
**When LWW works**:
- Single-user applications
- Data where loss is acceptable (analytics, logs)
- Coarse-grained updates (entire document, not fields)
**When LWW fails**:
- Multi-user editing (Alice's changes overwrite Bob's)
- Fine-grained updates (field-level changes lost)
- Clock skew between clients causes wrong "winner"
**Clock skew problem**: Client clocks can drift. A device with clock set to the future always wins. Solutions:
- Use server timestamps (but requires connectivity)
- Hybrid logical clocks (Lamport timestamp + physical time)
- Vector clocks (discussed below)
### Vector Clocks
Vector clocks track causality—which events "happened before" others—without synchronized physical clocks.
```typescript
type VectorClock = Map
function increment(clock: VectorClock, nodeId: string): VectorClock {
const newClock = new Map(clock)
newClock.set(nodeId, (newClock.get(nodeId) ?? 0) + 1)
return newClock
}
function merge(a: VectorClock, b: VectorClock): VectorClock {
const result = new Map(a)
for (const [nodeId, count] of b) {
result.set(nodeId, Math.max(result.get(nodeId) ?? 0, count))
}
return result
}
function compare(a: VectorClock, b: VectorClock): "before" | "after" | "concurrent" {
let aBefore = false
let bBefore = false
const allNodes = new Set([...a.keys(), ...b.keys()])
for (const nodeId of allNodes) {
const aCount = a.get(nodeId) ?? 0
const bCount = b.get(nodeId) ?? 0
if (aCount < bCount) aBefore = true
if (bCount < aCount) bBefore = true
}
if (aBefore && !bBefore) return "before"
if (bBefore && !aBefore) return "after"
return "concurrent" // True conflict
}
```
**Vector clocks detect conflicts but don't resolve them**: When `compare` returns `'concurrent'`, you have a true conflict that needs application-specific resolution.
**Space overhead**: Vector clocks grow with number of writers. For N clients, each entry is O(N). Dynamo-style systems use "version vectors" with pruning.
### Operational Transform (OT)
Operational Transform models changes as operations that can be transformed when concurrent.
**How OT works**:
1. Client captures operations: `insert('Hello', position: 0)`
2. Client applies operation locally (optimistic update)
3. Client sends operation to server
4. Server transforms operation against concurrent operations
5. Server broadcasts transformed operation to other clients
```typescript
interface TextOperation {
type: "insert" | "delete"
position: number
text?: string // for insert
length?: number // for delete
}
// Transform op1 given op2 was applied first
function transform(op1: TextOperation, op2: TextOperation): TextOperation {
if (op2.type === "insert") {
if (op1.position >= op2.position) {
return { ...op1, position: op1.position + op2.text!.length }
}
} else if (op2.type === "delete") {
if (op1.position >= op2.position + op2.length!) {
return { ...op1, position: op1.position - op2.length! }
}
// More complex cases: overlapping deletes, etc.
}
return op1
}
```
**OT requires central coordination**: The server maintains operation history and performs transforms. This means OT doesn't work for true peer-to-peer or extended offline scenarios.
**OT complexity**: Transformation functions are notoriously difficult to get right. Google Docs has had OT-related bugs despite years of engineering. The transformation must satisfy mathematical properties (convergence, intention preservation) that are hard to verify.
**Where OT excels**: Real-time collaborative editing with always-on connectivity. Low latency because changes apply immediately with optimistic updates.
### CRDTs (Conflict-free Replicated Data Types)
CRDTs are data structures mathematically designed to merge without conflicts. Any order of applying changes converges to the same result.
**Two types of CRDTs**:
**State-based (CvRDT)**: Replicate entire state, merge using mathematical join.
```typescript
// G-Counter: Grow-only counter
type GCounter = Map
function increment(counter: GCounter, nodeId: string): GCounter {
const newCounter = new Map(counter)
newCounter.set(nodeId, (newCounter.get(nodeId) ?? 0) + 1)
return newCounter
}
function merge(a: GCounter, b: GCounter): GCounter {
const result = new Map(a)
for (const [nodeId, count] of b) {
result.set(nodeId, Math.max(result.get(nodeId) ?? 0, count))
}
return result
}
function value(counter: GCounter): number {
return Array.from(counter.values()).reduce((sum, n) => sum + n, 0)
}
```
**Operation-based (CmRDT)**: Replicate operations, apply in any order. Requires reliable delivery (all operations eventually arrive).
**Common CRDT types**:
| CRDT | Use Case | Trade-off |
| ------------------------------- | --------------- | ----------------------------- |
| G-Counter | Likes, views | Grow-only |
| PN-Counter | Votes (up/down) | Two G-Counters |
| G-Set | Tags, followers | Grow-only set |
| OR-Set (Observed-Remove) | General sets | Handles concurrent add/remove |
| LWW-Register | Single values | Last-write-wins |
| RGA (Replicated Growable Array) | Text editing | Complex, high overhead |
**Text CRDTs**: For collaborative text editing, specialized CRDTs like RGA, WOOT, or Yjs's implementation track character positions with unique IDs that survive concurrent edits.
```typescript
// Simplified RGA node structure
interface RGANode {
id: { clientId: string; seq: number }
char: string
tombstone: boolean // Deleted but kept for ordering
after: RGANode["id"] | null // Insert position
}
```
**CRDT trade-offs**:
- **Pros**: Mathematically guaranteed convergence, works fully offline, no central server required
- **Cons**: High memory overhead (tombstones, metadata), complex implementation, eventual consistency only
> "CRDTs are the only data structures that can guarantee consistency in a fully decentralized system, but many published algorithms have subtle bugs. It's easy to implement CRDTs badly." — Martin Kleppmann
**Interleaving anomaly**: When two users type "foo" and "bar" at the same position, naive CRDTs may produce "fboaor" instead of "foobar" or "barfoo". Production CRDTs (Yjs, Automerge) handle this with sophisticated tie-breaking.
### Sync Strategy Comparison
| Factor | LWW | Vector Clocks | OT | CRDT |
| ------------------------- | ----------------- | --------------------- | -------------------- | -------------------- |
| Conflict resolution | Automatic (lossy) | Detect only | Server-based | Automatic (lossless) |
| Offline duration | Any | Any | Short (needs server) | Any |
| Implementation complexity | Low | Medium | High | Very High |
| Memory overhead | Low | Medium | Low | High |
| P2P support | Yes | Partial | No | Yes |
| Data loss risk | High | Application-dependent | Low | None |
## Design Paths
### Path 1: Cache-Only PWA
**Architecture**: Service Worker caches static assets and API responses. No local database. Changes require network.
```
Browser → Service Worker → Cache API → (Network when available)
```
**Best for**:
- Read-heavy applications (news, documentation)
- Short offline periods (subway, airplane mode)
- Content that doesn't change offline
**Implementation complexity**:
| Aspect | Effort |
|--------|--------|
| Initial setup | Low |
| Feature additions | Low |
| Sync logic | None |
| Testing | Low |
**Device/network profile**:
- Works well on: All devices, any network
- Struggles on: Extended offline, collaborative editing
**Trade-offs**:
- Simplest implementation
- No sync conflicts
- Limited offline functionality
- Stale data possible
### Path 2: Sync Queue Pattern
**Architecture**: Changes stored in IndexedDB queue, processed when online. Server is source of truth.
```
App → IndexedDB (queue) → Background Sync → Server → IndexedDB (confirmed)
```
**Best for**:
- Form submissions (surveys, orders)
- Single-user data (personal notes, todos)
- Tolerance for occasional conflicts
**Implementation complexity**:
| Aspect | Effort |
|--------|--------|
| Initial setup | Medium |
| Feature additions | Medium |
| Sync logic | Medium (queue management) |
| Testing | Medium |
**Key implementation concerns**:
- Idempotency: Server must handle duplicate submissions
- Ordering: Queue processes FIFO, but network latency can reorder
- Failure handling: Permanent failures need user notification
```typescript collapse={1-8, 30-40}
interface SyncQueueItem {
id: string
operation: "create" | "update" | "delete"
entity: string
data: unknown
timestamp: number
retries: number
}
async function processQueue(): Promise {
const queue = await getSyncQueue()
for (const item of queue) {
try {
await sendToServer(item)
await removeFromQueue(item.id)
} catch (error) {
if (isPermanentError(error)) {
await markAsFailed(item.id)
notifyUser(`Failed to sync: ${item.entity}`)
} else {
await incrementRetry(item.id)
if (item.retries >= MAX_RETRIES) {
await markAsFailed(item.id)
}
}
}
}
}
```
**Trade-offs**:
- Handles common offline scenarios
- Server-side conflict resolution
- May lose changes on permanent failures
- Doesn't support real-time collaboration
### Path 3: CRDT-Based Local-First
**Architecture**: Local CRDT state is authoritative. Peers sync directly or through relay server. No central source of truth.
```
App → CRDT State (IndexedDB) ↔ Peer/Server ↔ Other Clients' CRDT State
```
**Best for**:
- Collaborative editing (documents, whiteboards)
- P2P applications
- Extended offline with multiple editors
**Implementation complexity**:
| Aspect | Effort |
|--------|--------|
| Initial setup | High |
| Feature additions | High |
| Sync logic | Very High (CRDT implementation) |
| Testing | Very High |
**Library options**:
| Library | Focus | Bundle Size | Mature |
|---------|-------|-------------|--------|
| Yjs | Text/structured data | ~15KB | Yes |
| Automerge | JSON documents | ~100KB | Yes |
| Liveblocks | Real-time + CRDT | SaaS | Yes |
| ElectricSQL | Postgres sync | ~50KB | Emerging |
```typescript collapse={1-5, 25-35}
import * as Y from "yjs"
import { IndexeddbPersistence } from "y-indexeddb"
import { WebsocketProvider } from "y-websocket"
// Create CRDT document
const doc = new Y.Doc()
// Persist to IndexedDB
const persistence = new IndexeddbPersistence("my-doc", doc)
// Sync with server/peers when online
const provider = new WebsocketProvider("wss://sync.example.com", "my-doc", doc)
// Get shared types
const text = doc.getText("content")
const todos = doc.getArray("todos")
// Changes automatically sync
text.insert(0, "Hello")
```
**Trade-offs**:
- True offline-first with guaranteed convergence
- Supports P2P architecture
- Complex implementation
- High memory overhead
- Eventual consistency only (no transactions)
### Decision Framework

## Real-World Implementations
### Figma: Canvas-Level Offline
**Challenge**: Complex vector graphics with potentially millions of objects, multiple concurrent editors.
**Approach**:
- CRDT-based multiplayer engine
- 30-day offline window (7 days on Safari due to ITP)
- Changes stored in IndexedDB with timestamp metadata
- On reconnect, changes merge via CRDT semantics
**Technical details**:
- WebAssembly for CRDT operations (performance critical)
- Custom CRDT for vector graphics (not text-focused)
- Selective sync—only download what's viewed, not entire project
- Background prefetch of likely-needed files
**Limitation**: Can't download entire project for offline. Must have previously opened a file within the offline window.
**Key insight**: "The hardest part isn't the CRDT—it's making the UX feel instant while syncing in the background." — Evan Wallace, Figma CTO
### Notion: Block-Based CRDT
**Challenge**: Rich text documents with blocks (paragraphs, code, embeds), tables, and databases.
**Approach**:
- Custom CRDT system (inspired by Martin Kleppmann's research)
- Per-page sync with `lastDownloadedTimestamp` tracking
- Selective sync—only fetch pages with newer `lastUpdatedTime` on server
**Technical details**:
- Peritext integration for rich text formatting CRDTs (handles formatting spans)
- Database views sync separately from underlying data
- 50-row database limit in initial offline version (increased over time)
**Limitation**: Non-text properties (select fields, relations) harder to merge. Some conflict resolution requires user intervention for complex database changes.
**Source**: Notion engineering blog, 2024
### Linear: Delta Sync
**Challenge**: Project management with issues, projects, and workflows. Must feel instant.
**Approach**:
- Bootstrap process downloads initial state
- WebSocket for incremental delta packets
- IndexedDB for local cache, not full offline editing
- Server maintains authoritative sync ID (incremental integer)
**Technical details**:
- Sync ID increments with each server-side transaction
- Client tracks last seen sync ID, requests deltas since that ID
- Optimistic updates with rollback on server rejection
- Not true offline-first—designed as connectivity failsafe
**Trade-off accepted**: Offline is "best effort"—can view cached data, but edits require eventual connectivity. Simpler than full CRDT but limits offline duration.
### Excalidraw: P2P with localStorage
**Challenge**: Collaborative whiteboard with no backend requirement.
**Approach**:
- Pseudo-P2P: Central server relays end-to-end encrypted messages
- State stored in localStorage (keys: `excalidraw` for objects, `excalidraw-state` for UI)
- Union merge for conflict resolution—all elements from all clients combine
- End-to-end encryption—server never sees content
**Technical details**:
- WebSockets via Socket.IO for message relay
- No server-side storage—all state is client-side
- Room-based collaboration with shareable links
- Works fully offline for local edits
**Limitation**: Union merge means no true delete—"deleted" elements can reappear if another client hadn't seen the delete. Trade-off for simplicity.
## Browser Constraints Deep Dive
### Storage Quota Management
```typescript
async function manageStorageQuota(): Promise {
const { quota, usage } = await navigator.storage.estimate()
const usagePercent = (usage! / quota!) * 100
if (usagePercent > 80) {
// Proactive cleanup before hitting quota
await evictOldCache()
}
if (usagePercent > 95) {
// Critical: may start failing writes
await aggressiveCleanup()
notifyUser("Storage nearly full")
}
}
async function evictOldCache(): Promise {
const cache = await caches.open("dynamic-cache")
const requests = await cache.keys()
// Sort by access time (stored in custom header or IndexedDB metadata)
const sorted = await sortByLastAccess(requests)
// Evict oldest 20%
const toEvict = sorted.slice(0, Math.floor(sorted.length * 0.2))
await Promise.all(toEvict.map((req) => cache.delete(req)))
}
```
**Quota exceeded handling**: When quota is exceeded, `IndexedDB` and `Cache API` throw `QuotaExceededError`. Always wrap storage operations:
```typescript
async function safeWrite(key: string, value: unknown): Promise {
try {
await writeToIndexedDB(key, value)
return true
} catch (error) {
if (error.name === "QuotaExceededError") {
await evictOldCache()
try {
await writeToIndexedDB(key, value)
return true
} catch {
notifyUser("Storage full. Some data may not be saved offline.")
return false
}
}
throw error
}
}
```
### Safari's 7-Day Eviction
Safari's ITP deletes all script-writable storage after 7 days without user interaction. Mitigation strategies:
1. **Prompt for PWA installation**: Installed PWAs are exempt from 7-day limit
2. **Request persistent storage**: Not supported in Safari, but doesn't hurt
3. **Design for re-sync**: Assume local data may disappear
4. **Track last interaction**: Warn users approaching 7-day cliff
```typescript
const SAFARI_EVICTION_DAYS = 7
function checkEvictionRisk(): { daysRemaining: number; atRisk: boolean } {
const lastInteraction = localStorage.getItem("lastInteraction")
if (!lastInteraction) {
localStorage.setItem("lastInteraction", Date.now().toString())
return { daysRemaining: SAFARI_EVICTION_DAYS, atRisk: false }
}
const daysSince = (Date.now() - parseInt(lastInteraction)) / (1000 * 60 * 60 * 24)
const daysRemaining = SAFARI_EVICTION_DAYS - daysSince
// Update interaction timestamp
localStorage.setItem("lastInteraction", Date.now().toString())
return {
daysRemaining: Math.max(0, daysRemaining),
atRisk: daysRemaining < 2,
}
}
```
### Cross-Browser Testing
Offline-first behavior varies significantly across browsers. Test matrix:
| Scenario | Chrome | Firefox | Safari |
| ----------------------------- | ------------------- | ------------------ | ------------------- |
| Quota exceeded | QuotaExceededError | QuotaExceededError | QuotaExceededError |
| Persistent storage | Auto-grant for PWAs | User prompt | Not supported |
| Background sync | Supported | Not supported | Not supported |
| Service Worker + private mode | Works | Works | Limited |
| IndexedDB in iframe | Works | Works | Blocked (3rd party) |
## Common Pitfalls
### 1. Trusting navigator.onLine
**The mistake**: Using `navigator.onLine` to determine if sync should happen.
```typescript
// Don't do this
if (navigator.onLine) {
await syncData()
}
```
**Why it fails**: `navigator.onLine` only checks for network interface, not internet connectivity. LAN without internet, captive portals, and firewalls all report `online: true`.
**The fix**: Use actual fetch with timeout as connectivity check:
```typescript
async function canReachServer(): Promise {
try {
const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), 5000)
const response = await fetch("/api/health", {
method: "HEAD",
signal: controller.signal,
cache: "no-store",
})
clearTimeout(timeoutId)
return response.ok
} catch {
return false
}
}
```
### 2. Ignoring IndexedDB Versioning
**The mistake**: Not handling schema upgrades properly.
```typescript
// Dangerous: no version handling
const request = indexedDB.open("mydb")
request.onsuccess = () => {
const db = request.result
const tx = db.transaction("users", "readwrite") // May not exist!
}
```
**Why it fails**: If schema changes, existing users have old schema. Without proper `onupgradeneeded`, code accessing new object stores crashes.
**The fix**: Always increment version and handle migrations:
```typescript
const DB_VERSION = 3 // Increment with each schema change
request.onupgradeneeded = (event) => {
const db = request.result
const oldVersion = event.oldVersion
// Migrate through each version
if (oldVersion < 1) {
db.createObjectStore("users", { keyPath: "id" })
}
if (oldVersion < 2) {
db.createObjectStore("settings", { keyPath: "key" })
}
if (oldVersion < 3) {
const users = request.transaction!.objectStore("users")
users.createIndex("by_email", "email", { unique: true })
}
}
```
### 3. Service Worker Update Races
**The mistake**: Using `skipWaiting()` without considering page code compatibility.
**Why it fails**: Old page JavaScript + new Service Worker can have API mismatches. Cached responses may not match expected format.
**The fix**: Either reload the page after Service Worker update, or ensure backward compatibility:
```typescript
// In Service Worker
self.addEventListener("message", (event) => {
if (event.data === "skipWaiting") {
self.skipWaiting()
}
})
// In page
navigator.serviceWorker.addEventListener("controllerchange", () => {
// New SW took over, reload to ensure consistency
window.location.reload()
})
```
### 4. Unbounded Storage Growth
**The mistake**: Caching without eviction policy.
```typescript
// Grows forever
const cache = await caches.open("api-responses")
cache.put(request, response) // Never cleaned up
```
**Why it fails**: Eventually hits quota, causing write failures. User experience degrades suddenly rather than gracefully.
**The fix**: Implement LRU or time-based eviction:
```typescript
const MAX_CACHE_ENTRIES = 100
const MAX_CACHE_AGE_MS = 7 * 24 * 60 * 60 * 1000 // 7 days
async function cacheWithEviction(request: Request, response: Response): Promise {
const cache = await caches.open("api-responses")
const keys = await cache.keys()
// Evict if over limit
if (keys.length >= MAX_CACHE_ENTRIES) {
await cache.delete(keys[0]) // FIFO, or implement LRU
}
// Store with timestamp
const headers = new Headers(response.headers)
headers.set("x-cached-at", Date.now().toString())
const newResponse = new Response(response.body, {
status: response.status,
headers,
})
await cache.put(request, newResponse)
}
```
### 5. Sync Conflict Denial
**The mistake**: Assuming conflicts won't happen because "users don't edit the same thing."
**Why it fails**: Conflicts happen when the same user edits on multiple devices, when sync is delayed, or when retries duplicate operations.
**The fix**: Design for conflicts from the start:
- Use idempotent operations with unique IDs
- Implement conflict detection and resolution UI
- Log conflicts for debugging
- Test with simulated network partitions
## Conclusion
Offline-first architecture inverts the traditional web assumption: data lives locally, sync is background, and network is optional. This enables responsive UX regardless of connectivity but introduces complexity in storage management, sync strategies, and conflict resolution.
Key architectural decisions:
**Storage choice**: IndexedDB for structured data with indexing needs, OPFS for binary files and performance-critical access, Cache API for HTTP responses. All share origin quota—monitor and manage proactively.
**Sync strategy**: LWW for simple, loss-tolerant cases. Sync queues for form-style interactions. OT for real-time collaboration with reliable connectivity. CRDTs for true offline-first with guaranteed convergence.
**Browser reality**: Safari's 7-day eviction breaks long-term offline. Persistent storage is unreliable. `navigator.onLine` is useless. Design for data loss and re-sync.
The technology is mature—Yjs, Automerge, and Workbox provide production-ready foundations. The complexity is in choosing the right trade-offs for your use case and handling the edge cases that browser APIs don't abstract away.
## Appendix
### Prerequisites
- **Browser storage APIs**: localStorage, IndexedDB concepts
- **Service Workers**: Basic lifecycle and fetch interception
- **Distributed systems basics**: Consistency models, network partitions
- **Promises/async**: Modern JavaScript async patterns
### Terminology
- **CmRDT**: Commutative/operation-based CRDT—replicate operations, apply in any order
- **CvRDT**: Convergent/state-based CRDT—replicate state, merge with join function
- **ITP**: Intelligent Tracking Prevention—Safari's privacy feature that limits storage
- **LWW**: Last-Write-Wins—conflict resolution where latest timestamp wins
- **OPFS**: Origin Private File System—browser file system API
- **OT**: Operational Transform—sync strategy that transforms concurrent operations
- **PWA**: Progressive Web App—web app with offline capability via Service Worker
- **Tombstone**: Marker for deleted item in CRDT—kept for ordering, never truly removed
### Summary
- **Local-first data model**: Application reads/writes to IndexedDB or OPFS immediately; network sync is asynchronous
- **Service Workers**: Intercept requests, implement caching strategies (cache-first, network-first, stale-while-revalidate), enable background sync
- **Storage constraints**: Quotas vary (Safari ~1GB, Chrome 60% disk); Safari evicts after 7 days without interaction; persistent storage helps but isn't guaranteed
- **Conflict resolution**: LWW loses data; OT requires server; CRDTs guarantee convergence but are complex; choose based on offline duration and collaboration needs
- **Production patterns**: Figma uses CRDTs with 30-day window; Notion uses CRDTs with selective sync; Linear uses delta sync (not true offline-first); Excalidraw uses union merge with localStorage
### References
- [Service Workers Spec - W3C](https://www.w3.org/TR/service-workers/) - Normative specification
- [The Offline Cookbook - Jake Archibald](https://jakearchibald.com/2014/offline-cookbook/) - Canonical caching patterns
- [IndexedDB API - W3C](https://www.w3.org/TR/IndexedDB/) - Storage specification
- [CRDTs: The Hard Parts - Martin Kleppmann](https://martin.kleppmann.com/2020/07/06/crdt-hard-parts-hydra.html) - CRDT design challenges
- [Workbox Documentation - Google](https://developer.chrome.com/docs/workbox/) - Service Worker library
- [Storage Quotas and Eviction - MDN](https://developer.mozilla.org/en-US/docs/Web/API/Storage_API/Storage_quotas_and_eviction_criteria) - Browser storage limits
- [Origin Private File System - web.dev](https://web.dev/articles/origin-private-file-system) - OPFS guide
- [Full Third-Party Cookie Blocking - WebKit](https://webkit.org/blog/10218/full-third-party-cookie-blocking-and-more/) - Safari's 7-day storage limit
- [Figma's Multiplayer Technology - Evan Wallace](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/) - Production CRDT implementation
- [How We Made Notion Available Offline - Notion Engineering](https://www.notion.com/blog/how-we-made-notion-available-offline) - Block-based CRDT sync
- [Linear's Sync Engine - Reverse Engineering](https://marknotfound.com/posts/reverse-engineering-linears-sync-magic/) - Delta sync approach
- [Excalidraw P2P Collaboration](https://plus.excalidraw.com/blog/building-excalidraw-p2p-collaboration-feature) - Union merge pattern
- [CRDT Papers Collection](https://crdt.tech/papers.html) - Academic CRDT research
- [Yjs Documentation](https://docs.yjs.dev/) - Production CRDT library
- [Automerge](https://automerge.org/) - JSON CRDT library
---
## Client State Management
**URL:** https://sujeet.pro/legacy-personal-site/v4/articles/system-design/frontend-system-design/client-state-management
**Category:** System Design / Frontend System Design
**Description:** Choosing the right state management approach requires understanding that “state” is not monolithic—different categories have fundamentally different requirements. Server state needs caching, deduplication, and background sync. UI state needs fast updates and component isolation. Form state needs validation and dirty tracking. Conflating these categories is the root cause of most state management complexity.
# Client State Management
Choosing the right state management approach requires understanding that "state" is not monolithic—different categories have fundamentally different requirements. Server state needs caching, deduplication, and background sync. UI state needs fast updates and component isolation. Form state needs validation and dirty tracking. Conflating these categories is the root cause of most state management complexity.

State categories map to specialized tools—using the wrong tool for a category creates unnecessary complexity.
## Abstract
Client state management reduces to one principle: **match the tool to the state category**.
- **Server state** (API data) → Use TanStack Query or SWR. These handle caching, deduplication, background refetch, and optimistic updates. Trying to manage server state with Redux or Context creates cache invalidation nightmares.
- **UI state** → Start with `useState`. Escalate to Context for cross-component sharing, Zustand for module-level state, or Jotai for fine-grained reactivity. Global state libraries are overkill for most UI state.
- **Form state** → Use React Hook Form for performance-critical forms (uncontrolled inputs, minimal re-renders). Built-in browser validation covers most cases.
- **URL state** → Store filters, pagination, and view modes in query parameters. Makes state shareable and survives refresh.
- **Derived state** → Compute, don't store. Use selectors with memoization to avoid stale data and reduce state surface area.
The 2025 consensus: push server state to specialized libraries, keep global state minimal, and let URLs carry shareable state.
## State Categories
Not all state is equal. Each category has distinct characteristics that dictate the right management approach.
### Server State
Data fetched from APIs. Characteristics:
| Property | Implication |
| -------------- | ---------------------------------- |
| Asynchronous | Needs loading/error states |
| Shared | Multiple components read same data |
| Cacheable | Avoid redundant fetches |
| Stale | Data can become outdated |
| Owned remotely | Server is source of truth |
**Why specialized tools exist:** Managing server state with `useState` or Redux requires manually handling caching, deduplication, background refetch, cache invalidation, and optimistic updates. TanStack Query and SWR solve these problems out of the box.
### UI State
Local, synchronous state for user interactions:
- Modal open/closed
- Dropdown selections
- Accordion expansion
- Animation states
- Component-specific flags
**Design principle:** UI state should live as close to where it's used as possible. Lifting state to global stores creates unnecessary coupling and re-renders.
### Form State
Specialized category with unique requirements:
| Requirement | Why It Matters |
| ---------------- | ---------------------------- |
| Dirty tracking | Show unsaved changes warning |
| Validation | Field-level and form-level |
| Submission state | Disable button, show loading |
| Error mapping | Associate errors with fields |
| Array fields | Dynamic add/remove rows |
**Why form libraries exist:** Native form handling with `useState` creates excessive re-renders (every keystroke triggers render). Libraries like React Hook Form use uncontrolled inputs to batch updates.
### URL State
Query parameters and path segments that represent application state:
```
/products?category=electronics&sort=price&page=2
```
**Why URL state matters:**
1. **Shareable** — Users can share filtered views
2. **Bookmarkable** — Browser back/forward works correctly
3. **Server-renderable** — Initial state from URL, no hydration mismatch
4. **Debuggable** — State visible in address bar
### Persistent State
State that survives page refresh or browser close:
| Storage | Capacity | Persistence | Use Case |
| -------------- | ------------ | ------------ | ----------------------- |
| localStorage | 5MB | Permanent | Preferences, tokens |
| sessionStorage | 5MB | Tab lifetime | Temporary wizard state |
| IndexedDB | ~50% of disk | Permanent | Large datasets, offline |
| Cookies | 4KB | Configurable | Auth tokens (httpOnly) |
## Server State Management
### The Stale-While-Revalidate Pattern
SWR (the library) is named after RFC 5861's `stale-while-revalidate` cache directive. The pattern:
1. Return cached (potentially stale) data immediately
2. Revalidate in the background
3. Update UI when fresh data arrives
**Why this design:** Users see instant responses while data freshness is maintained. The alternative—waiting for network—creates perceived slowness even on fast connections.
### TanStack Query Deep Dive
As of TanStack Query v5, the core caching model uses two time-based controls:
| Setting | Default | Purpose |
| ----------- | ------- | ------------------------------------- |
| `staleTime` | 0 | How long data is considered fresh |
| `gcTime` | 5 min | How long inactive data stays in cache |
**Critical relationship:** `gcTime` must be ≥ `staleTime`. If cache is garbage collected before data goes stale, you lose the cached data before you'd naturally refetch it.
```ts title="query-config.ts" collapse={1-2,12-20}
import { QueryClient } from "@tanstack/react-query"
// Aggressive refetching (default): always revalidate
const defaultConfig = {
staleTime: 0, // Data immediately stale
gcTime: 5 * 60_000, // Keep in cache 5 minutes
}
// Conservative refetching: stable data
const stableDataConfig = {
staleTime: 15 * 60_000, // Fresh for 15 minutes
gcTime: 30 * 60_000, // Cache for 30 minutes
}
const queryClient = new QueryClient({
defaultOptions: {
queries: defaultConfig,
},
})
```
**When staleTime > 0 makes sense:**
- User profile data (changes infrequently)
- Configuration/feature flags
- Reference data (countries, currencies)
- Data with explicit invalidation triggers
**Design rationale behind staleTime: 0 default:** TanStack Query chose safety over efficiency. Stale data causes bugs that are hard to trace; extra network requests are visible and debuggable.
### Automatic Refetch Triggers
TanStack Query refetches stale queries on:
| Trigger | Default | Rationale |
| ---------------------- | ------- | ------------------------------------ |
| `refetchOnMount` | true | Component might show stale data |
| `refetchOnWindowFocus` | true | User returned, data may have changed |
| `refetchOnReconnect` | true | Network was down, data may be stale |
**Production consideration:** Disable `refetchOnWindowFocus` for data that rarely changes or where refetch is expensive:
```ts collapse={1-3}
import { useQuery } from "@tanstack/react-query"
useQuery({
queryKey: ["analytics", "dashboard"],
queryFn: fetchDashboard,
refetchOnWindowFocus: false, // Heavy query, explicit refresh only
staleTime: 5 * 60_000,
})
```
### Request Deduplication
TanStack Query automatically deduplicates in-flight requests:
```ts collapse={1-4}
// Both components mount simultaneously
// Only ONE network request is made
// Component A
const { data } = useQuery({ queryKey: ["user", 1], queryFn: fetchUser })
// Component B (mounts at same time)
const { data } = useQuery({ queryKey: ["user", 1], queryFn: fetchUser })
// ↑ Shares the in-flight request from Component A
```
**How it works:** Query keys are serialized and compared. If a request for that key is already pending, subsequent calls wait for the same Promise.
### Cache Invalidation Strategies
Invalidation marks queries as stale, triggering refetch if they're currently rendered.
```ts title="invalidation-patterns.ts" collapse={1-3,25-30}
import { useQueryClient } from "@tanstack/react-query"
const queryClient = useQueryClient()
// Invalidate all queries starting with 'todos'
queryClient.invalidateQueries({ queryKey: ["todos"] })
// Invalidate exact key only
queryClient.invalidateQueries({
queryKey: ["todos", "list"],
exact: true,
})
// Predicate-based invalidation
queryClient.invalidateQueries({
predicate: (query) => query.queryKey[0] === "todos" && query.state.data?.some((todo) => todo.assignee === userId),
})
// Invalidate and refetch immediately (don't wait for render)
queryClient.invalidateQueries({
queryKey: ["todos"],
refetchType: "all", // Also refetch inactive queries
})
```
**Design decision:** Invalidation is separate from refetching. `invalidateQueries` only marks as stale—actual refetch happens when the query is rendered. Use `refetchQueries` for immediate network call regardless of render state.
### Optimistic Updates
Update the UI immediately, roll back on error:
```ts title="optimistic-update.ts" collapse={1-5,35-45}
import { useMutation, useQueryClient } from "@tanstack/react-query"
interface Todo {
id: string
title: string
completed: boolean
}
const queryClient = useQueryClient()
const mutation = useMutation({
mutationFn: updateTodo,
onMutate: async (newTodo) => {
// Cancel outgoing refetches (they'd overwrite optimistic update)
await queryClient.cancelQueries({ queryKey: ["todos"] })
// Snapshot previous value
const previous = queryClient.getQueryData(["todos"])
// Optimistically update
queryClient.setQueryData(["todos"], (old) => old?.map((t) => (t.id === newTodo.id ? newTodo : t)))
// Return snapshot for rollback
return { previous }
},
onError: (err, newTodo, context) => {
// Roll back on error
queryClient.setQueryData(["todos"], context?.previous)
},
onSettled: () => {
// Refetch to ensure server state
queryClient.invalidateQueries({ queryKey: ["todos"] })
},
})
```
**Why `onSettled` invalidation:** Even on success, the server might have modified the data (timestamps, computed fields). Invalidation ensures the cache matches server state.
### SWR Comparison
SWR takes a more minimal approach:
| Feature | TanStack Query | SWR |
| --------------------- | ------------------ | --------------------- |
| Devtools | Built-in | Community |
| Mutations | `useMutation` hook | Manual with `mutate` |
| Infinite queries | `useInfiniteQuery` | `useSWRInfinite` |
| Optimistic updates | Via `onMutate` | Via `optimisticData` |
| Request deduplication | Automatic | Configurable interval |
| Bundle size | ~13KB | ~4KB |
**When to choose SWR:** Simpler apps, smaller bundle priority, preference for minimal API. SWR's `mutate` function handles most cases without dedicated mutation hooks.
## Global UI State
### When Global State Is Necessary
Global state is appropriate when:
1. **Multiple unrelated components** need the same data
2. **No common ancestor** can reasonably hold the state
3. **State persists** across route changes
Common legitimate cases:
- Authentication state
- Theme/appearance settings
- Feature flags
- Toast/notification queue
- Modal registry
### When to Avoid Global State
**Server state doesn't belong in global stores.** This was the primary mistake of early Redux applications. Patterns like "fetch in action creator, store in Redux" create:
- Manual cache invalidation logic
- No request deduplication
- No background refetch
- Complex loading state management
> "With the hooks provided by React, you can share these pieces of application state with the rest of the application. No need for a library to do it for you."
> — Kent C. Dodds, "Application State Management with React"
### React Context Limitations
Context is not optimized for frequent updates:
```ts title="context-problem.tsx" collapse={1-3}
import { createContext, useContext, useState } from 'react'
interface AppState {
user: User
theme: 'light' | 'dark'
notifications: Notification[]
}
const AppContext = createContext(null)
function App() {
const [state, setState] = useState