January 29, 2025

From BigQuery to Lakehouse: How We Built a Petabyte-Scale Data Analytics Platform – Part 1

At TRM Labs, we provide blockchain intelligence tools to help financial institutions, crypto businesses, and government agencies detect and investigate crypto-related financial crime and fraud.

Our analytics platform processes petabytes of blockchain data across 30+ blockchains and answers 500+ customer queries per minute with ultra-low latency, powered by distributed Postgres and BigQuery. For years, we optimized BigQuery to scale efficiently, but when we needed to run on-premise, we hit a wall. BigQuery couldn’t support our multi-environment needs, and scaling Postgres for massive data ingestion became too costly. We needed an open, self-hosted, secure, and performant solution—so we built a petabyte-scale data lakehouse with Apache Iceberg and StarRocks to power our user-facing analytics needs. This is how we did it, what we learned, and why it might change how you think about data architecture.

1. First Generation Data Platform

In our First Generation Data Platform, we used a distributed Postgres cluster (Citus Data) for fast point lookups and small pre-aggregates. When workloads exceeded the distributed Postgres cluster's capacity, we federated larger queries and ad hoc aggregations to BigQuery.

[Figure 1] Shows how TRM's First-Generation Data Platform handled user-facing analytics and routed queries through Postgres and BigQuery.

2. Beyond BigQuery and towards a next generation open data lakehouse

While BigQuery served our customer analytics needs well for years, we encountered limitations as we expanded into multi-environment deployments, including on-premise environments. The need to share blockchain analytics data across multiple sites made managed services like BigQuery impractical, and our serving workloads required a new scaling approach.

Key requirements driving the shift

Multi-site deployment: The need to deploy our platform across multiple on-premise sites while maintaining data sharing capabilities required using open source solutions that could be deployed on Kubernetes.
Scale and performance: Our largest customer-facing workload, which previously relied on BigQuery for serving, contains 115 TB+ of data and grows by 2-3% monthly. These read queries involve complex multi-level joins with time-based and array-based filters. Meeting a three second P95 latency at high concurrency with BigQuery proved challenging without investing in expensive compute slots. Moving workloads like this to distributed Postgres would be expensive for storage reasons alone.

Our Next Generation Data Platform needed to combine a data lake's flexibility with a data warehouse's performance and reliability. Building a modern data lakehouse around Apache Iceberg enabled interoperability with query engines and distributed compute engines that support the Iceberg specification. After benchmarking several query engines, we chose StarRocks. This combination of Apache Iceberg and StarRocks met our multi-site deployment and performance requirements while providing key advantages for future growth.

Opportunities we saw

Open standards: Apache Iceberg's open source implementation provides schema evolution, time travel, and efficient metadata management on object storage. Its flexibility enables deployment in on-premise multi-site environments, making it perfect for sharing blockchain analytics data across multiple locations.
High-performance data lake: StarRocks provides ultra-low latency and high concurrency through advanced caching and fully vectorized query processing. Combining StarRocks with Iceberg gives us data warehouse performance while maintaining data lake flexibility.
Query engine independence: Building our data lakehouse on Apache Iceberg gives us the flexibility to integrate any compatible query engine, ensuring we can adapt as technology evolves. In the year since running our benchmarks, we've observed rapid advancement in query engine performance. We're eager to re-evaluate both incumbent solutions (e.g., Trino and DuckDB) and newcomers (e.g., Clickhouse and Crunchy Data Warehouse). This flexibility keeps us at the cutting edge of performance and cost-effectiveness, independent of any single vendor.
Cost reduction: Since data and metadata are stored efficiently on object storage, we identified an opportunity to migrate workloads from our distributed Postgres cluster, reducing SSD storage costs.

3. Why Apache Iceberg + StarRocks for a data lakehouse

With multi-environment deployments, including on-premise, becoming a key requirement, we needed an alternative solution for customers facing analytics use cases. We had a few key observations from our work with BigQuery and Postgres:

Minimizing data read at query time is critical by using data compression, clustering, and partitioning to optimize scans.
Traditional B-tree style indexes become inefficient at petabyte scale.
Modern vectorized CPU execution (e.g., SIMD) significantly speeds up query processing.
Horizontal scaling enables high concurrency while keeping costs reasonable.
Separating compute from storage lets us flexibly switch between query engines or combine them for optimal workload performance, without duplicating data.

Based on these insights, we moved beyond traditional OLAP data stores (e.g., Clickhouse) and began exploring the emerging "Data Lakehouse" market. We needed to make two key decisions: (1) our storage format and (2) our query engine.

3.1 Storage format

At TRM, our storage needs, especially with the advent of high-throughput blockchains, are growing exponentially each year. We needed to ensure our storage system was both performant and cost-effective as we onboarded additional blockchains in the future.

Starting with cost, we knew we needed to move away from SSDs toward object stores since even the most expensive object store is 4x cheaper than the least expensive SSD.

With our option set narrowed to object stores, we evaluated 3 of the most popular storage formats for building a data lakehouse.

Although Delta Lake offered compelling features and performance, we ruled it out due to its lack of partition evolution and its overlap with Iceberg for large-scale analytics and batch processing. We then benchmarked Apache Hudi; our best-performing Hudi table was 3x slower than Apache Iceberg.

We made a bet on Apache Iceberg, which provided exceptional read performance while sporting wide community adoption, an active development community, and broad catalog and query engine support.

3.2 Query engine

Having chosen our table format, we benchmarked several query engines that were compatible with Iceberg in their open source versions. We evaluated three engines: Trino, StarRocks, and DuckDB. Our benchmarks showed that StarRocks consistently outperformed the others (see figure 2 below).

Trino: An open source, distributed query engine designed to query very large datasets.
StarRocks: A fast, open source query engine for analytics on and off the data lakehouse.
DuckDB: Open source, in-process analytical SQL query engine.

[Figure 2] Our benchmark results show query performance comparisons for lookup/filter operations across multiple configurations of three engines, tested on a 2.57 TB blockchain analytics dataset. StarRocks delivered consistently superior response times in all configurations.

3.3 Experimentations results

Our experimentation focused on two main workloads [6.1.2]: point-lookup queries with filtering and complex aggregation queries with filtering. We used JMeter to conduct load testing and verify that the query engines could maintain performance under high concurrency.

3.3.1 Experiment with Lookup/Filter

Figure 2 shows our results for this workload, where we tested point-lookup and range-lookup queries that return small subsets from a 2.57 TB dataset. We observed:

StarRocks: Consistently delivered the best performance across all configurations, achieving query response times as low as 470 ms with data caching.
Trino: Delivered response times between 1,410 ms and 1,030 ms, varying with cluster size.
DuckDB: Achieved reasonable 2-3 second performance on a powerful single node. We halted DuckDB testing after this benchmark due to limitations in its Iceberg table support. We are waiting for the DuckDB Iceberg Extension to add support for predicate pushdown for future evaluation.

3.3.2 Experiment with Complex Aggregation

[Figure 3] Our benchmark results comparing complex aggregation query performance between Trino and StarRocks across different cluster configurations.

In our next experiment, we tested queries that performed SUM, COUNT, and GROUP BY operations with array and date-range filters on a 2.85 TB dataset. Our findings showed:

StarRocks: StarRocks handled complex aggregate workloads exceptionally well, achieving latencies of ~2 seconds without caching and as low as 500 ms with caching on our largest test cluster.
Trino: Although Trino's performance improved significantly with larger clusters, it hit a performance ceiling at approximately 2.5 seconds.

3.3.3 Stress testing

We used JMeter to stress test Trino and StarRock's performance under high concurrency.

StarRocks: StarRocks consistently outperformed Trino during high-concurrency testing for both lookup and aggregation workloads. When data caching was enabled, performance improved even further.
Trino: Trino's performance degraded as concurrent user load increased. When we conducted these tests in early 2024, Trino did not have data caching capabilities for data lake tables. While this feature was later added in Trino 439, we have not yet evaluated it.

[Figure 4] Results of a multi-threaded test using the Complex Aggregations workload (left) and Lookup/Filter workload (right).

4. Our path forward

[Figure 5] Illustrates our Next Generation Data Platform for user-facing analytics.

Based on our evaluation of three Open Table Formats and experimentation with multiple query engines, we decided to build a data lakehouse with StarRocks and Apache Iceberg as core components to address key requirements for building TRM’s data platform across multiple sites and to improve performance for our customers.

Data lakehouse delivers dual benefits: Our data lakehouse approach combines a data lake's flexibility with a data warehouse's performance, enabling fast and reliable customer-facing analytics.
Apache Iceberg: With its open standards, robust schema evolution, and efficient metadata handling, Iceberg delivers the cross-engine interoperability we need.
StarRocks: Through strategic optimization of Iceberg table partitioning, clustering, StarRocks cluster sizing, and caching strategies, we achieved exceptional performance with low latency and high concurrency. These improvements resulted in a 50% improvement in P95 response times and 54% reduction in query timeout errors, ensuring we could meet our target query performance goals.
Testing is key: Real-world workloads revealed usage patterns and optimization opportunities that benchmarks alone couldn't identify, underscoring the critical importance of thorough testing at scale.

In part two of this series, we will explore how we brought this architecture to life, from deploying Apache Iceberg on object storage to optimizing StarRocks for multi-environment deployments, including on-premise environments.

5. Data Engineering at TRM

At TRM Labs, we are driven by a bold mission: to protect civilization from AI crime and build a safer world for billions. By advancing blockchain intelligence and creating the blockchain data platform of the future, we tackle the toughest challenges in financial crime and blockchain analytics.

Our mission is powered by experts like:

Vijay Shekhawat (co-author), a key member of the TRM Labs Data Platform team, brings deep expertise in real-time streaming, Data Lakehouse architectures, and building secure, high-throughput pipelines for petabyte-scale analytics—driving TRM's mission forward.
Andrew Fisher (co-author), a Staff Software Engineer on the TRM Labs Data Platform team, specializing in large-scale batch data loads and Data Lakehouse solutions that power petabyte-scale analytics in the fight against cryptocurrency fraud.
Elena Tomlinson, Moamen Ali, Brice Kamgne, Steven Hope, and Sharad Bhadouria, key contributors from the TRM Labs Data Product Team, have been instrumental in validating the data lakehouse architecture by migrating key workloads. Their work has been critical in ensuring the platform’s scalability, efficiency, and high performance under real-world use cases.

Special thanks to Michael Andrews and Amit Plaha for their insightful reviews and ongoing dedication to technical excellence throughout this project.

Join our team

Our engineers are building a petabyte-scale data lakehouse with ultra-low latency response times, tackling some of the toughest challenges in blockchain analytics to fight crime and build a safer world for billions of people. Excited and ready to make an impact? Or think you can build it better? Explore opportunities and apply today .

‍

6. Appendix / detailed experiments

6.1 — Experimentation journey

To evaluate Apache Iceberg's query performance, we benchmarked typical read workloads using different query engines to find the one that best met our performance and scale requirements. We focused on two categories of workloads: lookup/filter and complex aggregation.

6.1.1 — Data preparation

‍For each workload, we created Iceberg tables from our existing BigQuery data. Our queries typically reference specific blockchain addresses, off-chain entities, or time periods, so we strategically partitioned our datasets to ensure queries would access only relevant data subsets. Additionally:

We exported multiple tables (2-3 TBs) from BigQuery to Parquet format on Google Cloud Storage.
Using PySpark, we transformed these Parquet files into Iceberg tables, optimizing them with bucketing and sorting configurations.
We used Dataproc Metastore to maintain schema and table metadata.

6.1.2 — Workload description

6.1.2.1 Experiment with Lookup/Filter

-- Sample Iceberg table for blockchain transactions
CREATE TABLE blockchain_transactions (
    transaction_id STRING,
    block_number BIGINT,
    from_address STRING,
    to_address STRING,
    amount DECIMAL(38,18),
    timestamp TIMESTAMP,
    chain STRING
) USING iceberg
PARTITIONED BY (bucket(transaction_id, 3000));

-- Sample Query to filter transactions by ID
SELECT 
    transaction_id,
    from_address,
    to_address,
    amount,
    timestamp
FROM blockchain_transactions 
WHERE transaction_id = '0x1234abcd...';

6.1.2.2 Experiment with Complex Aggregation

-- Sample Query for Complex Aggregation
SELECT
    e.id::TEXT AS entity_id,
    SUM(CASE WHEN t.transaction_type = 'type_a' THEN t.weight_normalized END) AS type_a_volume,
    SUM(CASE WHEN t.transaction_type = 'type_b' THEN t.weight_normalized END) AS type_b_volume,
    SUM(t.weight_normalized) AS total_volume,
    SUM(CASE WHEN t.transaction_type = 'type_a' THEN 1 ELSE 0 END) AS type_a_count,
    SUM(CASE WHEN t.transaction_type = 'type_b' THEN 1 ELSE 0 END) AS type_b_count,
    COUNT(*) AS total_count
FROM transactions t
JOIN entities e ON t.entity_id = e.id
JOIN categories c ON e.category_id = c.id
WHERE t.chain = __chain::TEXT
  AND t.chain_id = __chain_id::TEXT
  AND t.timestamp BETWEEN __start_date::TIMESTAMP AND __end_date::TIMESTAMP
  AND t.path_length BETWEEN __min_hop AND __max_hop
  AND (
        __entity_ids IS NULL 
        OR t.entity_id = ANY(__entity_ids)
      )
  AND (
        __category_ids IS NULL 
        OR EXISTS (
            SELECT 1 
            FROM unnest(t.entity_category_ids) AS cat_id
            WHERE cat_id = ANY(__category_ids)
        )
      )
  AND e.id IS NOT NULL
GROUP BY e.id
ORDER BY __sort_column __sort_order NULLS LAST
LIMIT __limit
OFFSET __offset;

6.1.3 — Infrastructure

‍To evaluate performance scaling, we deployed each query engine with varying configurations of compute resources. Detailed configuration specifications are provided in the next section.

StarRocks: Deployed on GKE using StarRocks Kubernetes operator.
Trino: Deployed on Google Dataproc (managed Hadoop/Spark platform).
DuckDB: Deployed on GCP VMs.

6.2 — Infrastructure

6.2.1 Trino

How: Deployed on Google Dataproc (managed Hadoop/Spark platform) for its simplicity and built-in autoscaling capabilities.
Version: Trino 433
Cluster sizes for experimentation:

6.2.2 StarRocks

How: Deployed on GKE using the StarRocks Kubernetes operator.
Version: 3.1
Cluster sizes for experimentation:

6.2.3 DuckDB

How: Deployed on Google Cloud Platform (GCP) virtual machines.
Version: 0.9.2
Cluster sizes for experimentation:

6.3 — Stress testing methodology

Tools and setup:

JMeter: Utilized for simulating user requests to the query server.
- JMeter generates realistic server loads by simulating concurrent user requests. It supports multiple protocols including JDBC, HTTP, and FTP.
- Key components:
  - Thread group: A set of simulated users performing specified actions, with adjustable user count, ramp-up period, and execution frequency.
Parameterized queries & CSV data: Queries utilize dynamic parameters from CSV files to create diverse, realistic workloads.
JMeter concurrent threads: 1, 3, 5, 8, 13, 21, 34, 55 (simulates concurrent users)
Execution:
- Duration: Each test runs for 1 minute continuously.‍
- Query limit: Tests operate without query limits to measure maximum system capacity.