Leadership Lessons From a 400 TB Migration

Elevating Execution: Lessons From a 400 TB Migration on an Ambitious Timeline

Engineering

June 30, 2026

14 min

Elevating Execution: Lessons From a 400 TB Migration on an Ambitious Timeline

See how TRM's data platform team migrated from Postgres to StarRocks on Iceberg in just 1.5 quarters, cutting USD 1.2 million in annual cloud costs

[

Amit Plaha,

]

At TRM, “elevating execution” means up-leveling your team's delivery: setting the right goals, delivering on time, and raising the quality bar. It’s easy to nod at as a phrase but hard to do in practice, especially on the kind of project that decides whether a company crosses its next S-curve or stalls at the foot of it.

This post is about what elevating execution actually looked like during one such project: migrating Address Transfers. Address Transfers is the table our customers use to see how a blockchain address has sent and received funds. It’s the first table that gets created; all downstream calculations get affected by this dataset, hence, affecting most API routes and product features.

We had to move this data from Postgres to StarRocks on Apache Iceberg. We moved roughly 400TB of data across two Postgres clusters (blue/green) for about 70+ blockchains, migrated 30+ data-intensive API routes, saved north of USD 1.2 million in annual cloud cost, and unlocked customer-facing capabilities that were previously impossible. And we completed the migration of routes in about a quarter and a half.

Crossing an S-curve on an ambitious timeline came down to five leadership moves: working backwards from the date, re-shaping the team to match the work, spending in-person time on the moments that earn it, treating transparency as a mechanic rather than a virtue, and failing fast / failing different and adapting when the obvious path is wrong for customers. The story below is how each of these principles held up under pressure.

Overcoming the S-curve

Every durable engineering org eventually meets the ceiling of its current technology stack. Postgres had served TRM well for years, but as our data and customer base had grown, we started approaching its limits — on cost, scale, and the kinds of queries we could serve to customers. Address transfer tables were timing out on Postgres, and we simply could not give customers a view they needed. StarRocks on Iceberg offered a new ceiling, but came with its own challenges.

Serving Iceberg at single-digit-second latency is not a well-trodden path. Most companies use Iceberg as a data lake, where analytics queries can take minutes. Using it as a low-latency serving layer for a product UI is rare, and the small number of teams who have done it haven’t exactly written friendly how-to guides. We had no industry playbook to borrow from — no public reference architecture, no peer postmortem, no SDK that abstracted the hard parts. Every meaningful problem had to be solved from our own first principles.

The proof phase: What we tested before execution

Before any migration began, the platform team spent a full quarter de-risking the execution. We ran 9+ experiments answering specific questions about Iceberg's behavior under TRM's query patterns: Could we hit single-digit-second latency on the joins our APIs actually run? How did partitioning and bucket sizing affect tail latency? What did compaction cost in throughput? Could the system hold up at customer-facing concurrency?

The second major part was NGQF, our NextGen Query Federation framework, the layer that lets API routes serve from StarRocks-on-Iceberg (batch) and AlloyDB (real-time). Moving away from Postgres functions to a Typescript based business-logic layer.

By the end of the proof phase, we had a tested, measured, and documented case for why this would work.

We had to operate under three constraints

Proving the technology works was the prerequisite. Everything that followed was about executing under the real constraints of the business, our team, and the needs of our customers.

The customer experience floor was non-negotiable: Some impact during a migration of this scope was unavoidable, but we needed to be disciplined in detecting and acting on it immediately.
The timeline was ambitious but fixed: Roughly 1.5 quarters, with leadership commitments behind it**.** This tight timeline led to the convergence of two forcing functions:
1. A GCP contract renegotiation tied to a committed-use discount (CUD) purchase locked the cost savings to a calendar date.
2. We were starting to hit a wall with parallel copy on Postgres as we loaded Solana Address Transfers; the loading pipeline was getting harder to manage every week.
The org shape was wrong: The natural owner of this work, the team that owned the API routes’ business logic and SLOs, was tied up with other commitments and could not take it on.

Execution came down to five concrete practices, each anchored in a guiding principle or leadership lesson. The pillars that follow are how each one held up under pressure.

Pillar 1: Work backwards from the timeline, not forward from the plan

The most common failure mode I see on ambitious projects is working forward: "Here’s everything we want to do, here’s how long each piece takes, here’s when we will finish." But that math almost always lands somewhere unacceptable, and then teams must either negotiate the deadline (sometimes not possible) or quietly cut quality.

Working backwards is different. You start from the go-live date and ask, "What is the minimum viable shape of ‘done,’ and what has to be true two weeks before then, four weeks before then, eight weeks before then?" It’s a forcing function that makes you confront which scope is real, which is decorative, and which you’ve just been carrying around because you never sat down to question it.

For this project, working backwards meant:

First, de-risking the technical complexity completely before committing the org. As covered in the proof phase, we didn’t start migrating routes "in parallel" with the experiments; we let the proof finish.
Then, sequencing the riskiest and most critical routes first (not the easiest ones). Most teams default to tackling what’s easiest first because it builds confidence and visible momentum. But we did the opposite. If a critical, high-traffic route fundamentally could not meet our latency bar on Iceberg, we needed to know in week 3 — while we still had room to re-optimize, re-architect, or escalate — not in week 10, when there would be no good options left. The price was a harder first month, but the payoff was no major late-stage surprises.

Pillar 2: Re-shape the org to match the work

For a migration of this scale and complexity, the “textbook” approach would be to have the team that already owns the API routes (their business logic and their SLOs) to also own the migration. But the textbook answer wasn’t an option for us — that team had other commitments that couldn’t move.

We had two choices: wait or re-shape. Waiting also wasn’t on the table — the dollar cost of the delay compounded with every quarter, and customers couldn’t get the Address Transfer tables’ data reliably and performantly. So we re-shaped.

The platform team — which had originally scoped itself to the proof of the technology — absorbed the migration of 30+ routes as well. We pulled in other platform engineers whose roadmap work had to be partially deprioritized to free them up. I went into those conversations eyes-open: this project had a real cost, and I owned it explicitly with stakeholders rather than pretending the work would still get done on the old schedule.

A few things that mattered here:

The cost of re-shaping was named, not hidden. When you ask people to context-switch onto a new project, the work they were doing before doesn’t just disappear. Pretending otherwise burns trust with adjacent teams who later wonder why their priorities slipped. We were explicit about what was deferred and why.
The right people, not the available people. It’s tempting to onboard whoever has bandwidth. We picked engineers based on the shape of the work — Iceberg fluency, comfort with performance tuning, willingness to own a route end-to-end including the rollback if it came to that.
Authority traveled with responsibility. The engineers running the migration had the authority to call a rollback without escalating to me. If they had to wait for permission, they would wait too long and customers would notice.

Pillar 3: Spend in-person time on the moments that earn it

We’re a distributed team; most of the time that works well. But there are specific moments — like this one, where a new team had to ramp on novel technology under a hard deadline — when remote collaboration cannot compress trust, context, and decision velocity fast enough.

So I organized a one-week in-person “work week.” This got us all collaborating in the same room, in the same timezone, on the same whiteboards (and eating the same lunches). The result of this intentional in-person time was even better than I had expected.

Knowledge transfer that would have taken weeks happened in days. Folks who had built deep intuition during the proof phase and from long tenures at TRM helped others get ramped up. By watching the tenured folks build and debug live, not by reading docs alone, others ramped up quickly.
Decisions that would have ping-ponged across time zones got made in single conversations. When you can read the room, you can tell when consensus is real — or when someone is quietly unconvinced.
A team identity formed around the project. This is the part that’s hardest to quantify, but the most important. The engineers came out of that week not just better informed but more personally invested in the outcome — this carried them through the hard weeks that followed.

Pillar 4: Bad news travels at the speed of good news

This is the pillar I think about most often, and the one I would most want a younger version of me to internalize earlier.

During the migration, several routes were slower on the new stack than our pre-migration benchmarks predicted. The customer experience floor was non-negotiable, so the answer in each case was the same: roll back, investigate, optimize, re-attempt.

Every rollback was a small failure. Cumulatively, they would have looked like a project in trouble if you only saw the rollback count and nothing else. I told our CTO, Rahul Raina, about every single one. In the same weekly update, in the same format, at the same level of detail as the wins. Not because I was performing transparency, but because I had learned — sometimes the hard way on earlier projects — that transparency is mechanic.

Every honest update compounds into discretionary capital. When you’ve shown up week after week with the unvarnished truth, leadership knows what your green status actually means. When you eventually need air cover — for a slip, for a budget request, for an unpopular decision — you have the credit balance to spend.
Early bad news is actionable; late bad news is only reportable. A CTO who hears about a rollback in week 3 can still help — reallocate budget, share pattern-match from a prior project, adjust an external commitment before it becomes a broken promise. The same news in week 7, alongside a missed milestone, leaves leadership with nothing to do but react. Telling them early gives them a job that’s not just damage control.

In this project, the weekly cadence worked because the format was consistent: What landed, what rolled back, what we learned, what is next. The hard weeks read as hard. The good weeks read as good. By the end of the project, there were no surprises — exactly what you want when the stakes are high.

Pillar 5: Fail fast, fail different; adapt when reality moves

Plans are hypotheses. The discipline of execution is running tight loops and making sure every cycle teaches you something new. This pillar showed up at two scales.

At iteration scale

When a route regressed on the new stack, we rolled back, formed a new hypothesis, made one change, and tried again. The rule was that consecutive failures had to look different. If the same route was failing the same way twice, the loop itself was broken, not just the latest attempt. Re-running a broken approach with cosmetic tweaks is how most migrations stall.

At project scale

Mid-flight, an adjacent decision had to be made — moving the real-time data layer behind NGQF off Postgres and onto a managed AlloyDB deployment landed on the same 30+ routes we were migrating. The conventional answer was migrate routes twice. We evaluated it quickly: doubled customer-facing risk window, real-time data pushed out by months, doubled engineering cost. Instead, we absorbed the combined scope and migrated once — the integrated path. The original target was July 30; but that timeline shifted to the first week of September. We re-baselined visibly with our CTO.

Iteration scale: Every cycle fails differently from the last. If you cannot name what was different about the second attempt, the loop is broken.
Project scale: Run the loop on the new constraint, not the playbook. Convention is often wrong when reality has changed. Evaluate fast, walk away from the wrong answer, re-baseline loudly.

Our companion post on real-time blockchain ingestion covers the upstream half of that story.

The outcome

We shipped roughly 400TB of data migrated to Iceberg with 30+ API routes cut over to StarRocks and our NextGen Query Federation framework. We realized annualized cloud savings north of USD 1.2 million. Address Transfer tables are now servable to customers with latency within reasonable limits while delivering the real-time blockchain data flowing through those same routes — a capability customers had been asking for and that a sequential migration would have pushed out by months. And (less easy to put on a slide but more durable) a platform team that now knows how to execute together at this level, on novel technology, under real pressure. It’s muscle that will make the next hard project possible.

This last point connects to a concept I keep coming back to as a manager: burn multiple. The point of crossing an S-curve is not the technology — it’s reducing the cost of every future dollar of engineering investment. A team that can deliver Address Transfers in 1.5 quarters can deliver the next critical migration in less time, with lower coordination tax.

What didn't go perfectly, and what we learned

A migration of this scope, on a timeline this tight, on technology this novel, was never going to be clean. A few big lessons we learned the hard way:

Latency regressions on specific routes reached customers before we caught them. Our pre-migration benchmarks did not always predict real-world load and query shapes. We have since invested in synthetic load testing and shadow-traffic tooling so the next migration starts from a stricter pre-cutover bar.
A handful of routes saw brief errors during cutover which we resolved as they occurred. We iterated to come up with better cutover automation, staged rollouts using feature flags, and implemented automatic rollback triggers as standard practice to prevent issues from happening again.
We were inventing the playbook, and the tools were not yet ready to help us shortcut it. We had no precedent that we could lean on, and the agentic AI tooling available in 2025 could not reliably reason about route-level performance optimization. Iceberg observability around cardinality of our data was also missing, making optimizations difficult to identify. The team carried the full cognitive load — often solving the same hard problem more than once before it stuck. That has changed dramatically since.

In part two of this blog series, we’ll cover how we used these learnings and the agentic AI tooling that had matured to accelerate the next 180+ route migrations in roughly the same window.

‍

Credits and gratitude: Andrew Fisher for being the technical lead and anchor. Andrew Fisher and Pedro for proving the path; Elena, Kendall, and Stan Kudrow for absorbing the migration alongside their roadmap commitments; Anand Lalvani for the real-time DB platform work; and Michael Andrews and Rahul Raina for the mentorship.

Overcoming the S-curve

The proof phase: What we tested before execution

We had to operate under three constraints

Pillar 1: Work backwards from the timeline, not forward from the plan

Pillar 2: Re-shape the org to match the work

Pillar 3: Spend in-person time on the moments that earn it

Pillar 4: Bad news travels at the speed of good news

Pillar 5: Fail fast, fail different; adapt when reality moves

At iteration scale

At project scale

The outcome

What didn't go perfectly, and what we learned

Further reading

Leadership lesson

Leadership lesson

Leadership lesson

Leadership lesson

Leadership lesson