How TRM Eliminated Manual RPC Failover

Octo: How We Stopped Manually Failing Over RPC Providers and Cut Overage Costs

Engineering

May 8, 2026

10 min

Octo: How We Stopped Manually Failing Over RPC Providers and Cut Overage Costs

TRM built Octo, an in-house RPC load balancer and router, to replace 28+ developer days of annual manual provider failover with automatic multi-vendor routing and circuit breaking — cutting quota usage 30% and eliminating third-party overages for the first time in over a year.

[

Abdallah Ragab,

]

If you've ever been paged because an upstream vendor went down, you already know this problem. Any system that depends on external APIs inherits their failure modes — and the default response is manual: someone notices, someone switches, someone updates the runbook.

At TRM, that dependency was RPC providers powering our blockchain data pipelines, and we spent over 28 developer days in a single year managing it by hand. This is what we built to stop doing that.

Key takeaways

Single-provider architectures are a reliability tax. They feel simple until the first incident; then they get expensive.
Routing should be a platform concern, not chain-specific glue. The more chains you support, the less sustainable ad hoc failover becomes.
Config is architecture. A system where weights and providers are tunable without code changes is one you can actually operate at scale.
An AI assistant closes the metrics-to-config loop. The playground sends RPC traffic through Octo to live providers; each run’s per-provider errors, latency, and failover stats become the inputs for the next draft weight proposal.

‍

The problem

At TRM, we run our own nodes — but for resilience and coverage across chains, we also rely on third-party remote procedure call (RPC) providers. For a long time, the default was also the simplest: wire each chain to one provider and move on.

That model works — until a provider has an outage, starts rate limiting, returns stale responses, or quietly changes behavior in a way that degrades your pipeline. And when it breaks, someone gets paged.

We spent over 28 developer days in a single year manually switching providers, adjusting configs, and firefighting vendor incidents. Some chains were effectively locked into a single provider because of a specific RPC method no other vendor supported. Others were bleeding overages because they had no way to distribute load. The real cost was solving the same problem chain by chain, incident by incident, with no shared layer to lean on.

What we built

Octo is an in-house RPC load balancer and router that sits between our blockchain ingestion code and upstream providers. Chains opt in and immediately get multi-vendor routing, weighted load balancing, automatic failover, and centralized credential management — all driven by config, not code.

The key shift: provider logic moved out of chain-specific code into a shared control layer. Chain code stopped caring about which vendor it was talking to.

The architecture

Composition

Octo splits responsibilities across a few composable types. Here’s how a single OctoClient wires them together.

Each layer has one job:

Config — Declares the chain’s upstream endpoints, traffic-shaping knobs (weights / per-method overrides), retry budget, and resilience thresholds — what should exist, not how to call it
SecretsClient — Boundary for retrieving sensitive material used when materializing upstream clients (e.g. tokens, keys) so credentials are not embedded directly in static config
CircuitBreaker — Per-upstream outcome memory: exposes a continuous “How safe is this upstream right now?” signal and applies temporary exclusion after sustained retryable failures; successful calls reset the bad streak
NodeAdapters — Concrete client implementations keyed by upstream identity: translate Octo’s request object into vendor-specific wire calls and vendor-specific errors
SelectionStrategy — Pure “Where should the next attempt go?” logic: chooses an upstream from the registered set using configured preferences plus the resilience health signal (and optional per-request labels like method)
Router — The coordinator loop: ask the strategy for a target, delegate to the matching adapter, classify the outcome, update resilience, apply pacing/backoff between attempts, and stop when a response succeeds or retries are exhausted

OctoClient is the public entry; it holds a Router built from these parts.

Runtime

This is what happens after OctoClient is built: each call() enters the router’s retry loop until a vendor returns a successful response, a non-retryable error stops the call, or retries are exhausted.

If every vendor’s effective weight hits zero, selection fails fast — that’s the “no healthy upstream” case.

Selection and resilience

Selection is not “pick at deploy time.” Each attempt can see a different vendor_health(); the breaker remembers outcomes so the next select() is different.

The idea is simple: when a provider keeps failing in a retryable way, we treat it as “less trustworthy” for the next pick. That trust score ramps down as failures stack up, so fewer new requests go that way. If failures keep coming, that provider can drop out completely for a cooldown window. After a successful call, we reset the streak and it’s back in the mix.

In practice, this means a provider that starts degrading at 3:00am is automatically deprioritized for the next request — no config change, no deploy, no human waking up to notice. The system corrects itself within the request loop.

Configuration

Every chain that opts into Octo defines a JSON config that describes its provider setup. Here's an illustrative example:

{
"chain_id": "chain-x",
"max_retries": 3,
"providers": [{
	"id": "Provider A",
	"weight": 70,
	"endpoint": "{{SECRET_SOURCE:VENDOR_A_ENDPOINT}}",
	"methods": ["*"]
	},{
	"id": "Provider B",
	"weight": 10,
	"endpoint": "{{SECRET_SOURCE:VENDOR_B_ENDPOINT}}",
	"methods": ["*"]
	},{
	"id": "Provider C",
	"weight": 10,
	"endpoint": "{{SECRET_SOURCE:VENDOR_C_ENDPOINT}}",
	"methods": ["specific_method_only"]
	},{
	"id": "Provider D",
	"weight": 10,
	"endpoint": "{{NO_AUTH:VENDOR_D_ENDPOINT}}",
	"methods": ["*"]
	}
],
}

This is a simplified, renamed sketch of the ideas we encode in config — not a verbatim production file

A few things worth noting here:

Weights are per-provider, not binary. You're not choosing a primary and a fallback —you're expressing a traffic distribution you can tune without touching code.
Method-level routing is supported. If one vendor is the only option for a specific RPC call (a real scenario we hit), you can scope it to just that method while other vendors handle the rest.
Secrets are referenced, never inlined. Credentials stay in the secret store; the config just holds the reference key.
Failover behavior is explicit. What counts as a retryable failure is declared, not assumed.

The practical upside: shifting load off a degraded vendor, onboarding a new provider, or adjusting weights for a backfill window is a config change — reviewable, auditable, and deployable without touching chain logic.

Demonstrating multi-vendor load balancing across four providers

Failover that actually works

Many stacks retry across providers; fewer define failover as an operational contract you can measure.

For us, failover wasn't "eventually use another provider." It was a specific operational contract: provider incidents that used to require a human should resolve automatically, within a window short enough that downstream pipelines never notice.

That shaped a few concrete decisions in the design:

Provider selection had to be dynamic per request, not wired at startup. A healthy provider at boot time is not necessarily healthy at 2:00am.
Failover had to happen inside the request path, with the router moving on from a retryable error before it cascades downstream
The routing layer had to be observable by default. If a provider starts timing out, the signal needs to exist before anyone gets paged — not after a lag accumulates in a downstream table.

Successful requests shifting from Provider A to Provider B as A’s health degraded during a burst of retryable failures. The configured weights stay the same; routing becomes less willing to pick A until it recovers.

The goal wasn't dramatic incident response, but rather removing humans from the critical path of routine provider failure so the on-call queue stays quiet.

Letting AI tune the config

Deploying Octo was step one. Getting weights right per chain was the harder problem.

Provider behavior isn't uniform. What works for a high-throughput chain doesn't necessarily translate to one with archival query patterns. The optimal weight distribution depends on latency profiles, error rates under load, failover frequency, and how different providers behave across different RPC methods — data you can only get from observation, not intuition.

So we built a playground that made the observation loop fast. The playground fires real RPC calls through Octo's router against live providers, collects per-vendor stats — traffic distribution, error rates, latency percentiles, failover counts — and surfaces them in a structured output an agent can reason over.

Then we gave an AI agent access to it. The agent could run a config, read the outcome, reason about what the stats implied — which provider was absorbing too much of the failover burden, where latency was clustering, whether the weight distribution matched actual reliability — and propose a revised config. Run again. Observe. Iterate.

For rolled-out chains, this process converged on configurations that human intuition would have taken days of production observation to reach. The agent wasn't guessing — it had the same feedback signal a human would use, just processed faster and without the operational fatigue of waiting for real incidents to reveal the gaps.

Results

For one of our highest-throughput chains, we'd been locked into a single vendor — the only provider that supported a specific RPC method we depended on. Octo let us route around that constraint. 50% of traffic shifted to other providers once alternatives became viable.

Across chains that adopted Octo:

30% reduction in overall RPC quota usage
Zero third-party overages since rollout — for the first time in over a year

The operational change was harder to quantify but more meaningful day-to-day:

Provider incidents that used to require a human now resolve automatically
The on-call queue got quieter
Engineers stopped getting paged for problems they never had to touch

Lessons we carry forward

Building Octo reinforced a few things that weren't obvious until we were deep in it.

External dependencies deserve first-class architecture

We treat our own services as real infrastructure — versioned, monitored, with clear failure modes. RPC providers got treated as implementation details. That gap between how we thought about them and how much we actually depended on them is where all the operational pain came from.

Config is a better place to put complexity than code.

Every time we'd hardcoded routing behavior in chain-specific logic, we paid for it later during an incident — when the person on-call isn't the person who wrote it. Moving that behavior into config meant it was visible, auditable, and changeable without a deploy.

A shared layer compounds in ways a per-chain fix doesn't

The first chain we migrated to Octo cut overages. The second avoided a provider outage it never would have recovered from alone. The third got faster backfills without anyone touching its code. Three different problems, one shared layer — and each chain after that got all three benefits on day one.

What’s next

Octo is live and adding chains. The near-term work is mostly execution: rolling out to more workloads, tightening provider-level observability, and making the onboarding path faster as adoption grows.

The more interesting open questions are around routing intelligence. Static weights work well once tuned, but a router that can adapt to real-time provider health — adjusting without a config change when a vendor starts degrading — is a meaningfully better system. That's the direction we're headed. And the AI-assisted tuning loop is still in its early stages. As more chains onboard, it becomes a feedback engine rather than a one-time calibration tool.

The best infrastructure is the kind your on-call team stops talking about — because they stop having to.

[

Abdallah Ragab,

]