How We Built Real-time Blockchain Ingestion at TRM

Vlad Duta
Ankush Sharma
How We Built Real-time Blockchain Ingestion at TRM

This post is for engineers building real-time ingestion, streaming pipelines, or correctness-sensitive systems.

Key takeaways

  1. Define “real-time” explicitly: Freshness, correctness, and reorg behavior must be product-level contracts — not implementation details.
  2. Absorb complexity so chain integrators don’t have to: A small set of reusable primitives beats re-solving streaming edge cases for every blockchain.
  3. Data contracts > orchestration: Stable event interfaces let you run the same pipeline locally, in CI, in streaming, and in backfill — without rewriting logic.
  4. Fast feedback loops compound: The most important feature of an ingestion system is how quickly engineers can validate correctness at scale.

{{horizontal-line}}

Two real-time paths (and what this post covers)

At TRM, we follow money at internet scale: nearly 150 blockchains, every material movement, every token, every value shift, every counterparty. That ambition is the point — because we’re building a safer financial system for billions. To do it, we ingest the raw firehose from every chain we support.

Most blockchains don’t hand you a clean “sender → amount → receiver.” They give you chain-specific encodings, nested call traces, program logs, and fragments of intent that only become real after deep decoding and modeling. Our work is to turn that chaos into precise, auditable flows of funds, then normalize it into a single internal schema — whether the underlying reality is one-to-one, one-to-many, or many-to-many.

For years, our ingestion pipeline was “fast enough.” Then high-throughput chains arrived — and customer expectations shifted. Speed wasn’t the bar anymore. Freshness was.

At TRM, “real-time” lives in two systems:

  • Customer-facing real-time pipeline (event-based): A low-latency pipeline that powers live product experiences.
  • Blockchain data ingestion pipeline (modeling-heavy): A blockchain ingestion-and-modeling pipeline that transforms raw chain data into correctness-sensitive outputs — standardized transfers, balances, and flow-of-funds — through joins, state, backfills, and reorg-aware correction.

This post is about that backbone: How we modernized ingestion so modeled outputs aren’t just correct — they’re consistently fresh, reliable, and semantically precise. TRM already delivered real-time intelligence before this work; what changed is the freshness, reliability, and meaning of the pipeline that produces the outputs customers trust most.

What we gained from modernizing real-time ingestion

We have had significant impact in the overall delivery and latency of blockchain data:

MetricBeforeAfterImprovement
Median freshness (head → modeled events)2 - 5 mins< 15 seconds p50, < 30 seconds p99800%
Time to onboard a new chainn daysn/2 days50%
Cost per million events$x$.65x35%
Chains per quartery3*y300%
On-call Incidents per monthzz/4400%

System overview

Our original ingestion infrastructure was built when chain throughput was lower and latency requirements were looser. For the batch pipelines, we used a conventional microbatch approach: extract raw chain data, transform it into modeled tables, and load into BigQuery. We leaned heavily on SQL because it’s excellent for joins, modeling, and onboarding — plus BigQuery handles parallel execution and operational scaling well.

The architecture below describes our blockchain ingestion-and-modeling path. It runs alongside our separate customer-facing event pipeline; the focus here is the ingestion system that turns raw chain data into correctness-sensitive modeled outputs with explicit freshness and reorg semantics.

At a high level, the new architecture is event-driven and contract-first:

  1. Produce a well-defined event from any available data source or sources, regardless of the logical format and encoding
  2. Run a processor that transforms events into modeled outputs
  3. Validate continuously

Microbatching assumes you can afford to wait minutes to hours for correctness and joins. We developed a hybrid solution by eliminating streaming joins in our real-time pipelines: we unified the input streams and then consolidated processing around those unified feeds. Once customers wanted near-real-time visibility — and chains produced orders-of-magnitude more events — that assumption became our bottleneck.

{{horizontal-line}}

The inflection point: High-throughput chains + real-time expectations

Chains like Solana, BSC, and Base introduced a different operating regime:

  • Much higher event volume
  • More complex transaction structure (program-driven behavior)
  • Tighter product expectations (near-real-time monitoring)

We initially tried to adapt the existing ETL approach to streaming. Conceptually it felt straightforward: frameworks like Beam or Spark can run “the same” code as batch or streaming.

In practice, retrofitting batch into streaming meant confronting an entire class of requirements that didn’t exist in the batch-first world.

Why “just make our ETL streaming” didn’t work

On paper, converting a batch pipeline into streaming seems straightforward. Frameworks can run both batch and streaming versions of “the same code.” In practice, building batch first and then streaming later forces you to confront a new class of problems:

A. State and joins

  • How do you join massive reference datasets efficiently in streaming?
  • Do you reload data per worker per batch? Preload into in-memory indices? Deal with skew?
  • How do you update reference datasets without destabilizing the system?

B. Time and correctness

  • Out-of-order data and late arrivals complicate correctness guarantees.
  • In batch, the flow-of-funds for a transaction is either present or absent. In streaming, partial results are unavoidable unless you build explicit gating and completeness signals.
  • Streams can stall; windows can remain “open” longer than expected; consumers need a way to reason about freshness and completeness.

C. Delivery semantics

  • Duplicates happen. Idempotency must be end-to-end.
  • Retrying work should never corrupt balances or double-count events.

And then there’s reorg reality: in a real-time product, you often ingest data at the “edge of truth.” If you serve incorrect or incomplete data, customers see it immediately.

{{how-we-built-real-time-blockchain-ingestion-callout-1}}

At that point we realized: the ingestion-and-modeling system needed to be designed for real-time from the start — because joins, state, idempotency, late data, and reorg correction behave fundamentally differently at the edge of truth.

What we tried (and why it wasn’t enough)

Option A: Stretch microbatch (smaller batches / more frequent scheduling)

  • ✅ Minimal rewrite
  • ❌ Still not true real-time; compute costs explode with frequent reprocessing; correctness delayed

Option B: Dual batch + streaming in Beam/Spark (“same code”)

  • ✅ Promising abstraction
  • ❌ State/joins, late data, idempotency, and reorg correction are fundamentally different concerns; complexity doubled

Option C: Pure streaming with heavy state (in-memory joins, stateful ops everywhere)

  • ✅ Low latency
  • ❌ State blowups, skew, operational risk, harder backfills

Option D: Put everything in SQL / BigQuery streaming-ish

  • ✅ Strong join semantics
  • ❌ Expensive/awkward for per-event processing and reorg correction; difficult to enforce idempotency at the right grain

Option E: Contract-first event pipeline with reusable stages (what we built)

  • ✅ Isolates chain-specific code, supports streaming + backfill + CI with same contracts
  • ✅ Makes idempotency/reorg semantics explicit
  • ✅ Scales onboarding

{{horizontal-line}}

What “real-time” ingestion really requires

Before choosing tools, we wrote down the minimum set of engineering constraints that a real-time flow-of-funds system must satisfy.

1. A canonical event model

We needed stable representations for:

  • Raw chain envelopes (blocks, transactions, logs/traces)
  • Decoded actions
  • Flow-of-funds events
  • State updates (balances, positions, aggregates)

If every chain produced a bespoke shape, we’d never scale development across dozens of integrations.

2. Deterministic event IDs + end-to-end idempotency

There is no reliable “exactly once” in fault-tolerant streaming. Retries, timeouts, restarts, and backfills all create duplicates.

So we treat idempotency as a core contract:

  • Every event has a deterministic ID derived from chain provenance
  • Writes are idempotent (upserts/merge semantics, conflict-resistant updates)
  • Downstream consumers can safely reprocess data without double-counting

3. Time semantics: Out-of-order, late data, and stalled streams

Streaming is not “batch but continuous.” It introduces time:

  • Out-of-order arrivals
  • Late data
  • Stalled partitions
  • Sliding windows and watermarks

A real-time system must define:

  • Lateness tolerances
  • Completeness semantics (what “done” means)
  • How consumers reason about partial results

4. Joins against large reference datasets

Real-time modeling still needs context: token metadata, contract mappings, pricing, chain parameters.

We needed a join strategy that’s:

  • Fast and memory-safe
  • Updatable
  • Consistent across local, CI, backfill, and streaming runtimes

5. Reorgs and eventual consistency

Blockchains are eventually consistent by design. In real-time ingestion you operate near the “edge of truth,” where reorgs and partial propagation are more visible.

We needed explicit contracts for:

  • What “confirmed” means per chain
  • How corrections propagate
  • How we reconcile modeled outputs after reorgs

6. Validation at scale

Correctness on small test sets is necessary but not sufficient. We must:

  • Fail fast on cost/perf issues
  • Validate invariants continuously
  • Test end-to-end behavior against real chain data early

The goal: A framework that makes the common case fast and safe

TRM continues to onboard new blockchains rapidly. We needed a system where chain onboarding is mostly:

  • Chain-specific decoding
  • Minimal modeling customization
  • Strong defaults for the rest

But then we went beyond the technicalities. We wanted to define a mindset:

Make it easy to do the right thing (idempotency, correctness, reorg handling, validation), and hard to accidentally build a fragile one-off pipeline. It prevent us from introducing any logic that is not suitable for realtime. The goal is to not have to stop and think about the next step. To not have to design the system again and again for each chain. The framework dictates what we have to do step by step. It provides clarity and direction, yet is flexible enough to not require redesigning when requirements change.

The architecture: Event-driven ingestion with contract-first interfaces

We adopted an event-driven architecture where data contracts are the interface, not orchestration.

Instead of building a waterfall of tasks and SQL queries, the unit of work becomes:

  1. Produce a well-defined event from any available data source or sources, regardless of the logical format and encoding
  2. Run a processor that transforms events into modeled outputs
  3. Validate continuously

Contracts: The concept that makes everything composable

A contract-first design lets components be swapped, scaled independently, and run in different environments without rewriting logic.

Here’s a simplified example of one kind of event we use:

TransferEvent (simplified)

  • transaction_hash
  • timestamp
  • from
  • to
  • asset
  • amount
  • event_id (deterministic)

The details differ by chain, but the contract remains stable. That stability is what turns “onboarding new chains” into “implement decoding + minimal modeling,” instead of “rebuild a pipeline.”

Events are published via the contract layer. Transport can vary (JSON/HTTP, Protobuf/gRPC, Parquet or Json files, etc.). The key requirement is contract adherence.

{{horizontal-line}}

Developer workflow: Onboarding a chain without reinventing streaming

This is where the framework pays off day-to-day.

Instead of assembling a bespoke workflow and a cascade of SQL, chain onboarding becomes:

  1. Implement decoding
    • Parse logs/traces into canonical intermediate actions
    • A key rule: decoding should be deterministic and reproducible
  2. Implement/extend modeling
    • Map intermediate actions into flow-of-funds events + state change events
    • Use common primitives for IDs, idempotency, and joins
    • This is where chain-specific encoding complexity lives
  3. Run the integration harness
    • Inject static dependencies
    • Replay a known block range locally
    • Compare modeled outputs against expected invariants and golden test vectors
  4. Scale test early
    • Validate throughput and cost against realistic load
    • Fail fast before committing to production rollout
  5. Ship with continuous validation
    • Real-time invariants (no negative amounts, no gaps in block numbers, no duplicates)
    • Reconciliation checks (transfer amounts and balances add up)
    • Reorg correction propagation
    • Lag monitoring on multiple dimensions to quickly identify bottlenecks

The mental shift is important: Engineers focus on business logic and correctness, while the framework absorbs streaming mechanics and operational hazards.

One framework, multiple runtimes

A practical test of the architecture was: Can we run the same logic in multiple modes?

We do, with minimal changes:

  • Local Docker for development and iteration
  • CI/CD for end-to-end integration testing
  • Streaming runtime (e.g. Beam + Kubernetes) for real-time ingestion
  • Airflow for batch backfills across long history windows
  • Standalone Kubernetes for specialized live querying

Because interfaces are contract-based, not tightly coupled to an orchestration engine, we can move workloads to the best runtime for the job.

Reorg handling as a first-class behavior

In real-time ingestion, “finality” is a spectrum. We treat it as explicit state:

  • Observed: Seen at/near head, potentially reorgable
  • Confirmed: Past a chain-specific finality threshold
  • Reorg’ed: Events that were previously emitted but invalidated

Instead of trying to hide reorg behavior, we model it explicitly so downstream consumers can react predictably — either by only consuming confirmed data or by consuming observed data with correction awareness.

{{horizontal-line}}

What’s next

A real-time ingestion framework is never “done.” We plan to expand the framework beyond just blockchain modeling. We have many other suitable use cases that are planned for the upcoming quarters. Areas we continue to invest in:

  • Stronger automatic reconciliation and correction tooling
  • Improved contract versioning and backward-compatibility patterns
  • More reusable decoders/modelers as chain ecosystems evolve
  • Earlier and more predictive scale testing for cost/performance

Interested in this kind of work? We’re hiring engineers who like complex distributed systems and correctness problems. Check out our open roles and apply here.

This is some text inside of a div block.
Subscribe and stay up to date with our insights
No items found.