Upgrading StarRocks Mid-flight With a Closed-loop AI Agent

Replacing the Engine Mid-flight: How a Closed-loop AI Agent Upgraded StarRocks Across 58 Releases With Zero Downtime

Engineering

April 17, 2026

11 min

Replacing the Engine Mid-flight: How a Closed-loop AI Agent Upgraded StarRocks Across 58 Releases With Zero Downtime

We built a closed-loop AI system that turns production traffic into automated database optimization — and used it to upgrade StarRocks across 58 releases with zero customer impact.

[

Vijay Shekhawat,

]

At TRM Labs, StarRocks is one of the most critical components of our serving layer (we wrote about adopting it during our lakehouse migration). It powers the analytical queries behind our blockchain intelligence products — entity risk scores, transaction flow analysis, exposure calculations, and more. Every time a compliance team screens an address or an investigator traces funds, that query hits StarRocks.

We run three production clusters handling distinct workloads. Together, they serve millions of API queries across hundreds of distinct API endpoints. When StarRocks is slow, customers feel it. When StarRocks is wrong, customers get bad data. The stakes are high.

‍

The problem: Replacing the engine mid-flight

We had been running version 3.2.11 for nearly two years, deliberately prioritizing stability over upgrades. But the trade-off had a cost. Going from 3.2.11 to 4.0.6 meant skipping 58 releases — four minor versions and one major version bump. Query optimizer decisions change. Default settings change. How specific joins, hints, and scan operations behave — all of it can shift in subtle ways.

The conventional playbook — deploy the new version, spot-check a few queries, and hope for the best — doesn't scale when you have hundreds of API endpoints serving production traffic. And running the new version locally or against staging wouldn't reveal how it behaves at production scale — different data volumes, different query concurrency, different cache pressure.

We needed a fundamentally different approach.

The insight: A closed-loop optimization system

The breakthrough wasn't any single tool — it was the loop that connected them.

We realized that upgrading a critical database safely requires a system that can automatically: detect what regressed, diagnose why it regressed, produce a targeted fix, and validate that the fix works - all against real production traffic, at scale.

The data told us what regressed. The profiles told us why. AI produced the fix. The shadow environment validated it. That closed loop — data → diagnosis → code change → validation — is what made it possible to complete this upgrade in weeks instead of months.

This isn't just a StarRocks upgrade pattern; it's a methodology that applies to any database version migration, configuration change, or schema evolution where you need to validate behavior across hundreds of query patterns at production scale. The loop is the reusable idea, and he tools are how we implemented it.

‍

The loop in practice

Step 1: Data | Shadow every query

We built infrastructure to run real production traffic against both versions simultaneously. A second StarRocks cluster running v4.0.6 received mirrored production queries from our primary cluster (v3.2.11). The primary path remained completely untouched, with zero added latency to customers.

Every query was logged with latency, success/failure, and an MD5 hash of the SQL text (query_hash) for cross-system correlation. Critically, we weren't just measuring performance, we were catching queries that stopped working entirely. Any query that succeeded on the primary cluster but failed on the shadow cluster was flagged immediately, providing a continuous regression signal across all API endpoints.

The shadow proxy mirrors production queries to a second StarRocks cluster asynchronously, while profile collectors poll both clusters for execution profiles. All data lands in GCS as Hive-partitioned JSONL, queryable via BigQuery/Trino/StarRocks.

Step 2: Diagnosis | Profile every regression

The query_hash is what ties the system together. With both clusters instrumented, every query gets a deterministic identifier that links the shadow proxy logs (what happened) to full execution profiles (why it happened).

The query_hash (MD5 of the SQL text) is the join key that correlates proxy logs with execution profiles across both clusters, enabling per-endpoint performance comparisons and root-cause analysis.

This lets us write queries like:

SELECT
    proxy.query_hash,
    proxy.duration_ms AS primary_ms,
    shadow.duration_ms AS shadow_ms,
    profiles.profile_text
FROM shadow_proxy_logs proxy
JOIN shadow_proxy_logs shadow
    ON proxy.query_hash = shadow.query_hash
    AND proxy.target = 'primary'
    AND shadow.target = 'shadow'
JOIN starrocks_profiles profiles
    ON proxy.query_hash = profiles.query_hash
WHERE shadow.duration_ms > proxy.duration_ms * 1.5

This pipeline takes us from "endpoint X is 200ms slower on v4.0" all the way down to "the regression is in a specific scan operator due to missing data cache warmup."

Step 3: Code change | The agentic AI optimization loop

Step 3 is itself an agentic AI loop within the outer loop. While the main agentic loop (data → diagnosis → code change → validation) runs across the full upgrade, the AI optimization agent runs its own inner cycle for each individual endpoint.

The execution profiles from Step 2 don't just explain regressions to humans; they feed directly into an autonomous AI agent that can optimize endpoints without manual intervention.

For each slow endpoint, the agent:

Captures a baseline: Collects the full context — query profile (runtime metrics), EXPLAIN plan, table schemas (clustering keys, partitions), and a result snapshot for correctness comparison.
Identifies the bottleneck: Reads the profile and schema artifacts, applies StarRocks-specific tuning knowledge, and pinpoints the issue (expensive scans, suboptimal joins, missing aggregation pushdown, predicate placement).
Modifies the SQL template: Makes one targeted change at a time, not a “big-bang” rewrite.
Verifies via compare-baseline: A test harness renders SQL from both the modified branch and the baseline, executes both against StarRocks with multiple iterations and cache busting, then compares many dimensions, primarily:
- Correctness: Row-by-row result diff
- Performance: Median latency delta
- EXPLAIN plan: Structural diff
- Query profile: CPU, I/O, memory
Pass or revert: If the change passes all checks, keep it and try the next optimization. If it fails, revert and try a different approach. Max three attempts per endpoint.
Ships the fix: Makes the code changes and creates a PR with a detailed performance analysis report.

The test harness that powers the verification step is also available as a standalone tool —developers can use it independently to validate any SQL change before merging.

This lets us systematically improve performance across dozens of API endpoints without introducing regressions.

Step 4: Validate | Re-test in shadow

Every fix re-enters the loop. The modified queries run against the shadow cluster with real production traffic, and the same profiling infrastructure confirms the improvement. This isn't a spot-check — we compare the full latency distribution (P50, P90, P95, P99) before and after, across thousands of real queries. A fix that improves the median but blows up the P99 tail gets caught here, not in production. Only when the data confirms improvement across the full distribution does the change move toward production.

‍

What the loop caught

A critical correctness bug

The most important catch wasn't a performance regression, but a correctness issue. The newer version was returning inconsistent results on specific Iceberg queries related to data file bucket encoding.

We traced through StarRocks' 2 million+ line C++ codebase, using AI to navigate code paths, understand parameter interactions, and narrow down the root cause across multiple layers of abstraction. We found the bug, applied a fix, and confirmed the results were consistent before proceeding. We reported it upstream as StarRocks #70215.

Failing queries caught before cutover

Not all regressions are slow queries; some are broken queries. The shadow proxy flagged every query that returned an error on v4.0.6 but succeeded on v3.2.11. This surfaced dozens of failures that would have become customer-facing errors on cutover: queries rejected due to stricter SQL validation, changed function signatures, altered type-casting behavior, and subtle SQL standard enforcement changes — for example, columns in a USING clause can no longer be table-qualified in the SELECT, something older versions silently allowed. These behavioral changes don't show up in changelogs, but they break production queries. Each failure was caught, diagnosed, and fixed in the shadow environment before we touched production.

Stability issues reported upstream

During shadow testing, we identified stability issues in StarRocks 4.0 and reported them directly to the StarRocks team, contributing fixes back to the open-source project.

‍

The results

The combination of StarRocks 4.0's engine improvements and dozens of code-level endpoint optimizations identified by the closed-loop AI optimizer, delivered these gains.

The engine upgrade alone delivered a significant improvement. Of all the API endpoints, 36 got faster between v3.2.11 and v4.0.6, 25 with zero code changes. The new query planner, cost-based optimizer, and execution engine delivered gains out of the box:

Route	P90 Impact	P99 Impact
Single-Key Lookup	36.9% faster	44.0% faster
Property Resolution	68.8% faster	36.0% faster
Bulk Screening	32.1% faster	32.2% faster
Flow Aggregation	49.8% faster	51.0% faster
Cross-Entity Join (Type A)	55.5% faster	94.4% faster
Cross-Entity Join (Type B)	48.4% faster	94.3% faster

By automatically profiling every production query against both clusters, the closed-loop AI optimizer identified patterns in our own SQL that were preventing the new optimizer from reaching its full potential. That analysis led to 19 targeted optimizations across 20+ API endpoints, grouped into four categories:

Category	Endpoints Optimized	Impact
Query plan fixes (predicate pushdown, dead-code cleanup)	~13	I/O reduced up to 83%; planner time reduced up to 61%
Scan and JOIN elimination (unnecessary JOINs, deferred JOINs)	~4	Eliminated redundant table scans and JOINs; memory usage reduced up to 70%
Query restructuring (consolidate scans, eliminate CROSS JOINs)	~2	Reduced table scans from 5 to 2; eliminated nested-loop join patterns
Data partitioning and correctness (per-chain filtering, type fixes)	~4	Resolved timeouts; fixed a v4.0 correctness regression

In total, 25 endpoints improved automatically from the StarRocks engine upgrade, and 20+ endpoints were further optimized through targeted code changes — many benefiting from both.

All three production clusters were upgraded successfully with zero customer-facing incidents.

‍

The tools: Open-source

The loop is powered by two tools we're open-sourcing today.

Shadow Proxy

The StarRocks Shadow Proxy is a MySQL-protocol-aware proxy written in Go. It sits between your application and StarRocks, forwarding queries to the primary cluster synchronously and mirroring them to the shadow cluster asynchronously.

Key design decisions:

Protocol-aware packet parsing: The proxy parses MySQL packet headers to identify complete packets, enabling intelligent routing without packet fragmentation issues.
One shadow worker per client connection: Each client connection spawns a dedicated shadow goroutine with a bounded queue (default 10,000 packets). If the shadow cluster falls behind, packets are dropped — never blocking the primary path.
TLS termination: The proxy terminates client-facing TLS and optionally uses STARTTLS for shadow connections, handling the SSL handshake dance required by StarRocks' MySQL protocol.
Selective mirroring: Query filtering lets you include or exclude specific SQL operations or patterns, and a sampling rate lets you control the volume of shadow traffic.

Every query is logged to GCS in a Hive-compatible layout for direct BigQuery/Trino/StarRocks table access:

gs://bucket/query-logs/year=2026/month=03/day=05/hour=14/20260305_143022_a1b2c3d4.jsonl

Each log entry includes an MD5 hash of the SQL text — the query_hash — which becomes the join key connecting what happened with why it happened.

Profile Collector

The StarRocks Profile Collector is a companion tool that continuously polls StarRocks Frontend (FE) nodes for full query execution profiles. These profiles are enriched explain plans that include scan statistics, data cache hit rates, pipeline timing, and complete query plans.

The collector:

Discovers FE nodes dynamically via Kubernetes headless services or static host lists.
Polls incrementally using the FE's /api/query_detail endpoint, fetching only new queries since the last poll.
Deduplicates using an LRU cache to avoid re-fetching profiles within a configurable TTL.
Computes the same query_hash (MD5 of SQL text) as the shadow proxy, enabling cross-system joins.

Profiles are written to GCS in the same Hive-partitioned format, making it trivial to set up BigQuery/Trino/StarRocks tables and start querying immediately.

Both tools currently write to GCS as Hive-partitioned JSONL. Apache Iceberg support as a storage backend is planned for both.

If you run StarRocks in production and have ever been nervous about an upgrade, a config change, or a schema migration, these tools give you a way to validate with real traffic before committing. Both are designed to run as Kubernetes sidecars, support GCS and S3 backends, and export Prometheus metrics for production observability.

We hope the StarRocks community finds them useful. If you do, we'd love to hear about it.

‍

Join our team

At TRM, the data platform team builds infrastructure that directly impacts how financial institutions, regulators, and law enforcement fight financial crime. If you're excited about solving hard problems at the intersection of databases, distributed systems, and blockchain analytics, we're hiring.

[

Vijay Shekhawat,

]