eBPF at Scale: How TRM Labs Modernized Its Observability Stack with Groundcover

Michael Shaffer
eBPF at Scale: How TRM Labs Modernized Its Observability Stack with Groundcover

TRM processes data across 55+ blockchains, ingesting trillions of data points to power real-time investigations and compliance workflows. At that scale, observability isn't optional, but our previous platform made it expensive.

The cost of knowing

Our previous observability platform had a cost model that didn't scale with our infrastructure. We paid for every ingested event and every query we ran, which created constant pressure to limit coverage and keep exploratory analysis cheap. That model had compounded into four constraints we couldn't engineer our way out of:

  • Cost at ingest and at query: Every stored event and every query carried a cost. That created constant pressure to reduce ingest volume and keep exploratory analysis cheap. It also meant that agentic workflows, which often query continuously and at volumes no human analyst would, carried a real and growing cost penalty.
  • Coverage compromises: We deliberately chose not to ingest large classes of data purely for cost reasons.
  • Vendor lock-in: Every dashboard and automation we built deepened our dependency on a proprietary query language, raising switching costs at the same time that per-unit prices kept creeping upward.
  • Data residency: All of our observability data lived in a third-party SaaS region. As we expand into new geographies, we want the option to keep production telemetry inside our own cloud and align with data sovereignty requirements.

The most visible symptom of these challenges was drop rules. To stay within budget, our infrastructure team maintained an explicit allowlist of approved metrics. Kubernetes gave us hundreds of signals for free, but we were only permitted to keep a few dozen. The rest were dropped at the collection layer before they ever reached storage. Drop rules are a reasonable response to a pricing model that doesn't scale with high-cardinality infrastructure, but they represent a fundamental trade-off: observability coverage for cost predictability. The gaps they create are invisible until you need them, typically at 2:00am during an incident.

We knew we needed a switch.

When we began evaluating Groundcover, the first thing that stood out was a simple architectural difference: it doesn't charge per ingested event. It charges per node. That single shift changed the paradigm from "What can we afford to observe?" to "Observe everything, query what you need."

That was enough to make us pay attention.

What eBPF actually changes

Groundcover is built on eBPF (Extended Berkeley Packet Filter), a Linux kernel technology that lets programs run sandboxed logic at the kernel level, intercepting system calls, network packets, and process behavior without modifying application code.

In practice, this means Groundcover deploys a single DaemonSet (one pod per node) across your Kubernetes clusters, and from that single deployment, you automatically get:

  • Distributed traces across all services, captured at the network layer
  • An API catalog populated automatically from observed HTTP/gRPC traffic
  • Infrastructure metrics (CPU, memory, disk, network) without configuring a single scrape job
  • Log collection correlated with the traces they belong to
  • Service maps generated from real traffic, not from configuration

No language agents. No SDK dependencies. No sampling decisions made by your application. No code changes required. For the infra team, this also meant no per-cluster agent projects: deploy the sensor and observability simply appears.

For our Kubernetes-based infrastructure (data pipeline orchestration, blockchain data ingestion, EU-region services) this was essentially instant observability. We deployed the sensor, waited a few minutes, and our teams were navigating a populated API catalog and end-to-end traces they had never had before. Services that had minimal instrumentation on our previous platform suddenly had full trace coverage. "We do have API tracing working for the EU site because they are deployed to Kubernetes and we get it automatically with the eBPF sensors" became a recurring refrain as we rolled out cluster by cluster.

Groundcover's BYOC (Bring Your Own Cloud) ClickHouse model means our telemetry now lives inside our own cloud perimeter, in regions we control. Fully self-hosted observability would have given us the same data sovereignty, but at real operational cost. BYOC is the middle path: you own the storage and the data without owning the complexity of running the collection and query layer yourself.

The challenge: Not everything runs on Kubernetes

TRM's production environment is not a monolith. Alongside our Kubernetes infrastructure, we run customer-facing product APIs on Render, a managed PaaS platform. Managed platforms like Render don't expose kernel access, which means eBPF can't reach them.

This asymmetry defined the architecture of our migration. We needed two distinct paths:

Path A (Kubernetes): Deploy the eBPF sensor. Full telemetry collection begins automatically. Run in parallel with our previous platform for validation, then cut over. Zero application code changes.

Path B (Render/PaaS): Build a custom ingestion pipeline that works within the constraints of a managed platform.

Path B is where things got interesting.

Engineering around a TLS constraint

Render supports log drains over syslog/TLS, but with a catch. It only allows a single destination, and that destination must accept Layer 4 TLS connections. We needed to accept a TLS syslog stream from Render, parse it, and fan the output to both Groundcover and our previous observability platform simultaneously for the duration of the migration.

Our solution was a two-component pipeline:

1. A TLS-terminating load balancer at the edge

We stood up a cloud load balancer to handle TLS termination, accepting Render's syslog-over-TLS connections and forwarding plain TCP to the backend. This bridged Render's TLS requirement with an OTEL Collector that would have otherwise needed its own certificate management.

2. A Kubernetes-hosted OpenTelemetry Collector as the processing layer

The collector receives the plain TCP syslog stream, parses RFC 5424 format, extracts JSON from message bodies, normalizes severity levels, and fans the output to multiple exporters simultaneously from the same pipeline. During the entire migration window, both systems received identical data, enabling side-by-side dashboard and alert validation before any cutover.

For metrics and traces on Render, we built a composite OTEL module in our shared observability package. Rather than a hard switch between vendors, the module emits to both destinations simultaneously via separate exporters. Switching a service fully to Groundcover required changing two environment variables. No code deployment, no risk. Each team could validate their dashboards in Groundcover at their own pace and flip the switch when they were confident.

This dual-write architecture was the safety net that made a cross-company migration feel manageable rather than terrifying.

What we learned

Rethink your queries, don't just translate them

The instinct when switching platforms is to reproduce every existing query. That instinct leads you astray.

Proprietary DSLs and PromQL handle time-series aggregation differently. Functions that behaved implicitly in a proprietary query language require explicit look-back windows in PromQL. When we understood why a query worked, not just what it literally said, we were able to transition smoothly. When we copy-pasted query logic, we often ended up chasing false discrepancies.

Treat the migration as an opportunity to audit your metrics strategy. In many cases, replacing a complex log parsing query with a properly emitted OTEL metric (a counter, gauge, or histogram) produces dashboards that are faster, more reliable, and easier for both humans and LLMs to reason about. If you find yourself extracting a number from a log string every time you want to graph something, the real fix is to emit a metric.

Open standards give AI leverage

Before the first dashboard gets migrated, the team should align on the tools and patterns everyone will use. We landed on Grafana backed by ClickHouse SQL as our standard. Grafana is a known quantity with extensive community resources, and querying via standards-based tools means engineers don't have to learn a new proprietary DSL. Because Grafana and ClickHouse SQL are vendor-agnostic and work against any self-hosted telemetry backend, this choice keeps us maximally mobile.

It also makes AI a force multiplier. LLMs understand open query languages deeply, which made it practical to use Claude to write first drafts of Grafana panel definitions and significantly accelerate the migration of over 100 dashboards. One engineer maintained a running "context file" that an LLM iteratively improved as he worked through migrations, encoding hard-won patterns and sharing it across the team. That file compressed weeks of trial-and-error into seconds of context lookup. Proprietary query languages don't give you this kind of leverage.

Invest in parsing rules early

Logs from PaaS environments often arrive as semi-structured strings. Without parsing rules that promote fields into queryable attributes, you lose the ability to filter, alert, and build dashboards on log data effectively. This isn't a Groundcover-specific concern. It's a reflection of how much normalization work a mature platform may have been doing automatically, often invisibly.

Teams that invested in parsing rules before handing over on-call responsibilities had a dramatically smoother experience. Start with the logs your on-call team reaches for most during incidents. Getting the most important structured fields queryable early made the difference between on-call engineers trusting Groundcover and reverting to old habits.

From dashboards to agents: MCP-powered incident discovery

The clearest sign to me that we had crossed into something new came from a demo session I prepared for an all-hands meeting. Rather than a scripted walkthrough, I connected the Groundcover MCP to Claude Opus and asked a single open-ended question: "Is there anything demo-worthy in this production metrics set?"

Within ten seconds, the model surfaced what looked like an incident. Elevated system load, query concurrency patterns correlating with CPU and memory pressure, ephemeral disk pressure cascading into pod evictions. It wasn't synthetic. It was a real, undetected production issue that had been sitting in the data waiting to be found.

What stood out wasn't just that the issue was discovered, but how: the MCP correlated signals across metrics, events, and logs to identify a specific failure mode, not just "high load." The investigation started with a vague exploratory question and converged naturally on a root cause with actionable remediation steps, all validated against live system behavior.

Groundcover MCP now sits at the core of our internal AI workflows, enabling engineers on call to query production telemetry in natural language without needing to know which dashboard to open first. Teams have since built self-triage capabilities on top of it, allowing services to own their incident workflows while relying on a shared observability foundation.

This kind of exploratory querying wouldn't have been financially realistic under a strict ingest-and-query usage model. Now, it's the backbone of our incident response strategy.

Results

  • Over 80% reduction in observability costs, while collecting over twice the data
  • Drop rules fully deprecated. Every metric and log is now collected. No more deliberate blind spots.
  • Full-fidelity traces across the request path, including spans down to individual database queries
  • Telemetry inside our own cloud perimeter, with region-level control and longer retention windows
  • AI-native incident response via Groundcover MCP integrated into our AI workflows
  • New services start on Groundcover by default. New platform projects begin with modern observability from day one.
  • Standardized on open, portable tooling. Grafana and OpenTelemetry are vendor-agnostic by design. Our dashboards, pipelines, and instrumentation are not tied to any single observability backend.

What we'd tell our past selves

eBPF, OTEL, and MCP are a stack, not a menu

eBPF gives you breadth for free across Kubernetes: automatic metrics, traces, logs, and service maps with zero instrumentation overhead. OTEL gives you depth and control for PaaS and non-Kubernetes environments, and lets you enrich spans with business context your kernel can't see. MCP closes the loop by making all of that data queryable by both humans and AI agents. The right model is eBPF as the baseline, OTEL for enrichment, and MCP as the interface.

Dual-write is worth the engineering investment

The composite telemetry module and syslog fan-out pipeline added real upfront work, but they reduced migration risk to near zero. We never had a dark cutover. Every team validated their Groundcover setup before giving up their safety net.

Different paradigms, not just gaps

No two observability platforms model the same concepts identically. Every migration surfaces workflows that need to be rebuilt differently, not because one tool is better or worse, but because they make different design choices. We addressed the differences with Grafana dashboards and runbooks, and we've found the Groundcover team to be responsive to product feedback. Treating these as paradigm differences to adapt to, rather than blockers, led to better outcomes than expecting a one-for-one feature match.

Choose partners, not just products

The quality of support you get from a vendor matters as much as the product itself, especially during a migration. The Groundcover team has been a genuine partner throughout: quick to respond, proactive about sharing guidance, and willing to roll up their sleeves in hands-on debugging sessions. Weekly syncs during the migration gave us a direct line to the people building the product. If you're evaluating observability platforms, ask yourself not just what the software does, but how the team shows up when things get hard.

Active knowledge transfer is non-negotiable

Infrastructure ran hands-on pairing sessions with each team, embedded in on-call rotations during the transition window, and wrote runbooks that met engineers where they were. A new observability platform is only as good as the team's ability to use it under pressure.

{{horizontal-line}}

It's a rare occasion when you can cut observability costs by over 80% and make the platform strictly better at the same time. Broader coverage, better incident response, data in your own cloud, AI-native workflows. This migration was one of those occasions.

The constraint was never Kubernetes. It was the assumption that you have to choose what to observe based on cost. eBPF breaks that assumption for every workload it can reach. For the rest, a well-designed OTEL pipeline comes close enough. And when your observability platform is queryable by an AI that can find real production incidents before you know to look for them, you've crossed into something that feels genuinely different from a dashboard you check when things go wrong.

This is some text inside of a div block.
Subscribe and stay up to date with our insights
No items found.