LLM Router Architecture: From Distributed Systems Patterns to a Production Circuit Breaker in Redis

Published: 2026-06-23(Last updated: 2026-06-24)

Distributed Systems LLM Router Comparison

I recently spent some time digging into LLM Routers and AI Gateways. What stood out was how much of their architecture borrows directly from distributed systems patterns. That connection isn’t new — it’s been written about before. What I wanted to do here is go one level deeper: pick one piece of it, the Circuit Breaker, and follow it all the way through to a working implementation.

What an LLM Router actually is

When I first looked at LLM Routers, my mental model was something like “a convenient wrapper around multiple LLM providers.” Once I broke it down, though, the real shape of it turned out to be an AI-native evolution of the Layer 7 (application-layer) load balancer / reverse proxy. Think of it as the traffic-control paradigm that NGINX and Envoy have handled for years, rebuilt to fit a payload — the LLM request — that’s non-linear and unpredictable.

Mapping it to existing distributed systems patterns

Break an LLM Router down into its core functions, and four map cleanly onto established distributed-systems patterns.

1. API normalization and protocol translation (Reverse Proxy)

The equivalent here is what API gateways like Kong or Envoy already do — gRPC ↔ REST, XML ↔ JSON translation. For an LLM Router, this means converting each provider’s distinct API shape (Gemini, Anthropic, Cohere, and so on) into the de facto industry standard: an OpenAI-compatible schema, in real time.

2. Throttling and rate limiting

The equivalent is Token Bucket / Leaky Bucket. But there’s an LLM-specific wrinkle. A normal API can assume “one request = one unit consumed.” An LLM request is “one request = an unknown number of tokens consumed” — you don’t know the real cost until the response comes back. How to close that gap is what the later sections dig into.

3. Resilience and recovery (Circuit Breaker)

The equivalent is the Circuit Breaker pattern from microservices — first described by Michael Nygard in Release It! (2007), then brought to microservices architecture by Netflix’s Hystrix (2012). On detecting a 429 (rate limited) or 503 (unavailable), the router temporarily cuts off the failing provider, then after a cooldown period uses “the next real request” (what this article will later call the probe) to check whether it has recovered. This article is entirely about how this pattern combines with the Token Bucket above.

4. Semantic routing

This replaces traditional path-matching or regex-based branching with routing based on semantic distance — how close a prompt is to some reference point in vector space. It shows up in two places: cost optimization (route simple tasks to a lighter model) and security (detect prompt injection attempts).

This is a well-established field in its own right (vector search, embeddings) and would be worth a post by itself. But it’s hard to verify deterministically, so it’s out of scope for this article.

The full picture: a multi-layer defense pipeline

Put these together, and a production-grade AI backend tends to look like a multi-layer pipeline:

LLM Router Request Pipeline

Scope note: This article focuses on Layer 4 — specifically, how Circuit Breaker and Token Bucket integrate. Layers 2 and 3 (semantic cache and cost router) both depend on semantic routing, which warrants its own post.

The next section covers why I picked Circuit Breaker out of the four.

Why Circuit Breaker

Of the four patterns, why dig into Circuit Breaker specifically? It came down to something that kept nagging at me while I was reading about it.

Most LLM Router write-ups explain Circuit Breaker up to the point of “once it’s Half-Open, the next request checks whether the provider has recovered.” What they don’t get into is what “the next request” actually means once you try to implement it.

Two things specifically didn’t sit right.

1. “The next request” can arrive more than once, at the same time

In production, the moment a Circuit Breaker’s cooldown ends, it’s entirely plausible that multiple requests hit the same provider simultaneously. Half-Open is supposed to mean “try exactly one.” Without any safeguard, every request waiting at that moment assumes it’s the one probe and fires off at once — which, if the provider genuinely hasn’t recovered yet, just multiplies the load on something already struggling.

2. Circuit Breaker and Token Bucket look similar, but they’re different mechanisms

In an LLM Router, Circuit Breaker (“is the provider broken?”) and rate limiting / Token Bucket (“do I still have budget left?”) are two different concerns — but it’s easy to see how they’d end up sharing one failure counter if implemented carelessly.¹ When that happens, a request that fails only because your own budget ran out gets miscounted as “the provider is down,” and you end up tripping the breaker on a perfectly healthy provider.

The first issue — concurrent probes — doesn’t really come up in what I read; most pieces stop at “Half-Open retries once it’s recovered” and never get into what happens when more than one request hits that moment at once. The second issue isn’t unknown — it’s the kind of thing experienced teams know to watch for² — but the specifics are easy to miss until you’re actually writing the integration code yourself. That’s why this article isn’t about Circuit Breaker on its own — it’s about getting the combination of Circuit Breaker and Token Bucket right.

The next section works through issue #2 first (the budget/breaker mix-up), then comes back to issue #1 (concurrent probes).

Integrating Circuit Breaker and Token Bucket

The shape of budget management

In an LLM Router, “budget management” boils down to a two-phase process: Reserve (claim an estimated amount up front) and Reconcile (settle the difference once the real cost is known). It’s close to an expense report — submit an estimate, then true it up once the receipt comes in.

Two parties sit on either side of this:

The local side (your own router): decides “I should still have budget left” based only on its own record of past requests.
The provider side (the API itself): holds the actual, authoritative state. Only the provider knows the real, current usage against your API key.

If both sides always agreed, this would be trivial. They don’t.

Why local budget tracking is structurally approximate

The local side’s judgment is, at best, an estimate inferred from its own request history. There are several reasons it can drift: another process sharing the same API key, a billing-cycle boundary that doesn’t line up with your local clock, a tokenizer used for estimation that doesn’t exactly match the provider’s real accounting.

When that estimate drifts, this happens: locally, you decide you still have budget and send the request — but the provider has already hit its limit, and you get a 429 back. In other words, passing the local check doesn’t guarantee the server will accept it. This isn’t an edge case; it’s an unavoidable consequence of the local model being an approximation in the first place.

How this gap should feed back into Circuit Breaker

This raises a question: when that 429 comes back, was it because your budget ran out, or because the provider itself is overloaded and rejecting everyone? Which is it?

Honestly, you can’t reliably tell the two apart from the response alone — many providers return the same 429 for both.

So the design I went with here is to skip the root-cause question and treat “a 429 came back” as the only signal that matters. Rather than building out classification logic, I treat it as confirmed evidence that the local model has drifted from reality: force the local bucket to zero and feed it into the Circuit Breaker’s failure tracking. That avoids the cost of building a root-cause classifier while keeping the safe-side behavior: once a drift is confirmed, back off from this provider for a while.

This has a real limit, though. If a provider’s response headers include something like Retry-After or X-RateLimit-Remaining, overwriting the local bucket with that value directly is more accurate than guessing. I haven’t implemented that here — it’s the natural next improvement if you want more precision.

A separate problem in Half-Open: probe single-flight

That covers the budget side. There’s a second, unrelated problem inside Circuit Breaker itself — the one mentioned earlier: multiple requests racing to claim the probe slot the instant a cooldown ends.

This is independent of the budget mismatch above, so it needs a different fix. The ADMIT script in the next section handles both of these — the budget drift and the concurrent-probe problem — but through two separate mechanisms.

Implementation: the ADMIT / REPORT scripts

Why Lua

A naive implementation looks like this:

read remaining balance (GET)
check if it's enough
if so, decrement it (SET)

This splits read, check, and write into three separate steps. If two requests land in between those steps, you get overcommitment — pulling more tokens than actually exist. Redis runs single-threaded, so bundling those three steps into one Lua script guarantees nothing else can interleave while it runs. That’s the whole reason for moving multi-step checks into Lua. Redis also has MULTI/EXEC (queue up several commands, then run them all atomically) — but it can’t branch on a value it just read. Once “what to write” depends on “what we just read,” Lua is the right tool.

Script 1: ADMIT — should this request go through?

This script does three things in a single atomic call:

Checks the Circuit Breaker state (CLOSED / OPEN / HALF_OPEN).
If OPEN and the cooldown has elapsed, attempts the transition to HALF_OPEN and claims the probe slot.
Checks the Token Bucket balance and, if sufficient, reserves the estimated tokens.

local circuit_key = KEYS[1]
local bucket_key  = KEYS[2]

local now              = tonumber(ARGV[1])
local cooldown_ms       = tonumber(ARGV[2])
local probe_lock_ttl_ms = tonumber(ARGV[3])
local capacity          = tonumber(ARGV[4])
local refill_rate       = tonumber(ARGV[5])
local estimated         = tonumber(ARGV[6])
local reservation_id    = ARGV[7]
local reservation_ttl   = ARGV[8]

local state          = redis.call('HGET', circuit_key, 'state') or 'CLOSED'
local opened_at       = tonumber(redis.call('HGET', circuit_key, 'opened_at')) or 0
local probe_in_flight = redis.call('HGET', circuit_key, 'probe_in_flight') or '0'
local probe_claimed_at = tonumber(redis.call('HGET', circuit_key, 'probe_claimed_at')) or 0

local is_probe = false

-- Breaker check
if state == 'OPEN' then
  if (now - opened_at) < cooldown_ms then
    return {0, 'BREAKER_OPEN', 0}
  end
  if probe_in_flight == '1' and (now - probe_claimed_at) < probe_lock_ttl_ms then
    return {0, 'PROBE_IN_PROGRESS', 0}
  end
  redis.call('HSET', circuit_key, 'state', 'HALF_OPEN', 'probe_in_flight', '1', 'probe_claimed_at', now)
  is_probe = true
elseif state == 'HALF_OPEN' then
  if probe_in_flight == '1' and (now - probe_claimed_at) < probe_lock_ttl_ms then
    return {0, 'HALF_OPEN_WAIT', 0}
  end
  redis.call('HSET', circuit_key, 'probe_in_flight', '1', 'probe_claimed_at', now)
  is_probe = true
end

-- Reserve (token bucket, lazy refill)
local tokens = tonumber(redis.call('HGET', bucket_key, 'tokens')) or capacity
local last   = tonumber(redis.call('HGET', bucket_key, 'last_refill')) or now
local elapsed = now - last
tokens = math.min(capacity, tokens + elapsed * refill_rate)

if tokens < estimated then
  redis.call('HSET', bucket_key, 'tokens', tokens, 'last_refill', now)
  if is_probe then
    redis.call('HSET', circuit_key, 'state', 'OPEN', 'probe_in_flight', '0')
  end
  return {0, 'INSUFFICIENT_TOKENS', 0}
end

tokens = tokens - estimated
redis.call('HSET', bucket_key, 'tokens', tokens, 'last_refill', now)
redis.call('SET', 'reservation:' .. reservation_id,
  estimated .. ':' .. (is_probe and '1' or '0'),
  'PX', reservation_ttl)

return {1, reservation_id, is_probe and 1 or 0}

There are two things worth calling out here.

The probe_in_flight flag resolves who gets to probe. Even if multiple requests notice the cooldown has ended at the same moment, this flag means: “if it’s not set, I set it, and I become the probe.” Exactly one request ends up winning. This is the same idea as a distributed lock built on SETNX.

That alone creates a new problem, though: what if the request that claimed the probe slot crashes afterward? The cleanup step (REPORT, below) never runs, probe_in_flight never clears, and the system gets stuck believing a probe is permanently in flight. That’s what orphan recovery handles. probe_claimed_at records a timestamp, and once probe_lock_ttl_ms has elapsed, the old probe claim is treated as stale and a new request is allowed to take over. The elseif state == 'HALF_OPEN' branch in the script above is exactly where this staleness check happens.

Once a reservation succeeds, it stores the estimated token count and whether this was a probe in a reservation:{id} key with a TTL. That TTL is the safety net in case REPORT is never called.

Script 2: REPORT — feeding the real outcome back

After the actual API call returns, a second script reflects that outcome back into the bucket and the breaker state.

local circuit_key = KEYS[1]
local bucket_key  = KEYS[2]

local reservation_id   = ARGV[1]
local success           = ARGV[2] == '1'
local actual_tokens     = tonumber(ARGV[3]) or 0
local is_rate_limit_err = ARGV[4] == '1'
local failure_threshold = tonumber(ARGV[5])
local now               = tonumber(ARGV[6])
local capacity          = tonumber(ARGV[7])

local reservation_key = 'reservation:' .. reservation_id
local raw = redis.call('GET', reservation_key)

if not raw then
  return {0, 'RESERVATION_EXPIRED'}
end

local sep = string.find(raw, ':')
local estimated = tonumber(string.sub(raw, 1, sep - 1))
local is_probe = string.sub(raw, sep + 1) == '1'

redis.call('DEL', reservation_key)

if success then
  local delta = estimated - actual_tokens
  local tokens = tonumber(redis.call('HGET', bucket_key, 'tokens')) or capacity
  tokens = math.min(capacity, tokens + delta)
  redis.call('HSET', bucket_key, 'tokens', tokens)
  redis.call('HSET', circuit_key, 'state', 'CLOSED', 'failure_count', 0, 'probe_in_flight', '0')
  return {1, 'OK'}
end

if is_rate_limit_err then
  redis.call('HSET', bucket_key, 'tokens', 0)
end

if is_probe then
  redis.call('HSET', circuit_key, 'state', 'OPEN', 'opened_at', now, 'probe_in_flight', '0')
else
  local failure_count = redis.call('HINCRBY', circuit_key, 'failure_count', 1)
  if failure_count >= failure_threshold then
    redis.call('HSET', circuit_key, 'state', 'OPEN', 'opened_at', now)
  end
end

return {1, 'RECORDED'}

On success, it reconciles the estimate against the actual usage and unconditionally moves the breaker back to CLOSED. On failure, a 429 forces the bucket to zero; if this was the probe, it goes straight back to OPEN without waiting on the failure threshold. A regular (non-probe) failure just increments the failure counter, tripping the breaker only once the threshold is hit.

Verification: running it against a real local Redis

I didn’t want this to stop at “should work in theory.” So I spun up a real Redis instance and ran the ADMIT/REPORT scripts against it. No LLM was actually called — the part standing in for doActualApiCall() is a plain mock that can be told to succeed, fail, or return a 429 on demand. Since this is purely testing the bookkeeping, no real API key is needed anywhere.

Five scenarios:

Scenario 1: the happy path. Reserve 100 tokens, the actual usage comes back as 80 — confirmed the bucket correctly gets 20 tokens back.

Scenario 2: tripping the breaker after the failure threshold. Five consecutive failures move the breaker from CLOSED to OPEN. Also confirmed that, while OPEN, Reserve itself gets rejected immediately — the call never even reaches the API.

Scenario 3: probe single-flight under Half-Open. This was the one I most wanted to see. Simulated the moment cooldown ends by firing three requests at once via Promise.all. Result: exactly one won the probe slot; the other two were rejected with HALF_OPEN_WAIT. This is direct confirmation that the “should be atomic” claim actually holds under real concurrent execution.

Scenario 4: a failed probe snaps straight back to OPEN. When the probe request itself fails with a 429, confirmed it goes back to OPEN immediately, without waiting on the failure threshold, and that the bucket gets forced to zero.

Scenario 5: recovering from a crashed probe (orphan recovery). Simulated a request claiming the probe slot and then “crashing” — never calling REPORT. Confirmed new requests are correctly rejected within probe_lock_ttl_ms, and that once that window passes, a different request can successfully claim a new probe slot.

5 scenarios, 18 assertions, all passing.

While running this, I made a basic arithmetic mistake in my own test expectations for Scenario 1 — nothing sophisticated, just a wrong number: 10000 - 100 (reserve) + 20 (reconcile delta) = 9920 is correct; I’d written 9980. It’s a small thing on its own, but it’s a useful reminder of how this category of system actually breaks in practice: not from flawed reasoning, but from a number being off by a little. A lock TTL set in the wrong unit, a threshold one digit short — these are the kind of mistakes that don’t show up until you’re actually running the thing, and they’re exactly the kind that matter most here. Reasoning about a design on paper doesn’t give you that. Running it does.

There are two ways to follow this verification yourself. One is the code directly — clone the repo, start redis-server, run node test.js, and you’ll get the same result. The other is the demo below: hit “Fire 3 Concurrent Requests” and you can watch the exact behavior from Scenario 3 — three requests in, exactly one wins the probe slot — happen live in the browser.

A few parameters — cooldown, probe lock TTL, the failure threshold — are tuned shorter in the demo than in the test above, just so you don’t have to sit and wait. The logic being exercised is identical.

You can also open the standalone demo here.

A couple of things this didn’t cover

One thing this article didn’t directly touch on: what should happen to the token budget when a request fails for reasons that have nothing to do with rate limits — a timeout, a 500 from the provider, a dropped connection mid-stream?

I looked for an industry standard here and didn’t find one. What I did find is that Azure’s API Management documentation at least names the underlying tension: the real cost of a request often isn’t known until the response comes back, which means the provider has frequently already spent real compute by the time a failure happens.

That’s really the crux of it. Inference isn’t free — when a request fails partway through, someone already paid for the GPU time, win or lose. Whether that cost should land on the provider (refund the tokens) or the caller (keep the charge) is a question worth taking seriously, not just a UX nitpick. From a user’s side, “it failed, so give it back” feels obviously fair. From a provider’s side, the compute was already spent regardless of what the client received — refunding doesn’t undo that cost, it just relocates it.

I don’t have a confident answer here, and I’m skeptical anyone has a fully settled one. A reasonable middle ground probably treats failures differently by cause — refunding on a clear 5xx (the provider’s problem), not refunding on something the client could have caught earlier — but that’s a direction to explore, not a conclusion this article reaches.

Another point worth being precise about: this article is about routing across providers — your own infrastructure deciding between OpenAI, Anthropic, Google, and so on. That’s a different question from what a single provider does inside their own stack — e.g., routing a request to a lighter or heavier internal model before it ever leaves their infrastructure. The underlying idea (classify the request, route to the right amount of compute) is the same; the operational boundary isn’t.³

Closing thoughts

I went back through the architecture of LLM Routers and AI Gateways with a distributed-systems lens. The broad framing — “an LLM Router is an AI-native evolution of an L7 load balancer” — isn’t new; it’s already out there. What I did here was pick one piece of it (Circuit Breaker), work through how it should combine with Token Bucket, and actually run it against a real local Redis instance to check.

Worth being precise about where this sits relative to existing work.

At the design level: keeping Circuit Breaker (failure detection) and Token Bucket (budget tracking) as separate pieces of state, and treating something like a 429 as a uniform failure signal rather than trying to classify its root cause — as far as I could find, this matches an established pattern already present in existing AI Gateways like LiteLLM.

At the implementation level: the two-phase Reserve/Reconcile token accounting, the single-flight mechanism for claiming a Half-Open probe, and the forced bucket depletion on a confirmed 429 — I didn’t find prior art for this specific combination in what I looked at. I took the established pattern, applied it to constraints specific to LLMs (usage isn’t known up front; multiple probes can arrive concurrently), built it from there, and checked it against 5 scenarios and 18 assertions on a real local Redis.

Circuit Breaker for LLM APIs is a focused corner of infrastructure. For anyone building at this layer, the details are where things get interesting — which is why I didn’t stop at the concept — I built it, and actually ran it to check.

Wrap Up

This post started from the observation that LLM Routers map cleanly onto distributed systems patterns, and zoomed into one of them all the way to a working implementation. Semantic routing itself — and the two layers built on top of it (semantic cache, cost router) — each have enough depth to warrant their own post, and I may get to them if the opportunity arises.

The code is on GitHub. The demo is live at circuit-breaker-token-bucket-demo.vercel.app.