June 25, 2025

Journey to KEDA: Optimizing Airflow Worker Scaling in Kubernetes

The TRM Labs engineering team recently migrated our Apache Airflow infrastructure from Google Cloud Composer to a self-managed solution on Google Kubernetes Engine (GKE). While this migration gave us greater control and flexibility, we needed to optimize our worker scaling strategy to achieve better cost efficiency — without compromising performance.

What is the role of TRM's infrastructure team?

TRM Labs provides blockchain analytics solutions to help law enforcement and national security agencies, financial institutions, and cryptocurrency businesses detect, investigate, and disrupt crypto-related fraud and financial crime. Our platform combines advanced analytics with blockchain data to provide comprehensive solutions for transaction monitoring, wallet screening, and entity risk scoring.

Our infrastructure team plays a critical role in maintaining TRM's technical operations by:

Managing critical infrastructure components — including Kubernetes clusters, observability systems, and CI/CD
Deploying and maintaining tools for automated infrastructure deployment using Infrastructure-as-Code (IaC)
Leading disaster recovery efforts in collaboration with data engineering team to maintain system uptime
Working closely with security teams on infrastructure hardening and patch management

This case study describes one of our key infrastructure initiatives to optimize our Airflow worker deployment using KEDA.

The challenge: Dynamically adjusting worker count based on real-time demand

At TRM, we execute nearly two million tasks per day across all Airflow environments — which means our workers are almost always executing tasks. When we were initially focused on getting our GKE deployment up and running as quickly as possible, we decided to configure a number of static workers to run 24/7. This meant deploying enough workers to process our periods of highest task throughput, even though these peak periods only occurred during specific times of the day (such as during end-of-day batch runs).

During off-peak hours, these workers would sit largely idle, consuming expensive compute resources while utilizing only a fraction of their capacity. We were effectively paying for peak capacity 24/7, despite only needing it for a few hours each day.

This static worker count approach created a clear mismatch between resource allocation and actual usage patterns. We needed a way to dynamically adjust our worker count based on real-time demand, rather than continuously provisioning for worst-case scenarios.

The solution: Scaling workers with KEDA

KEDA (Kubernetes Event-driven Autoscaling) presented an attractive solution for specific workloads. It offered the ability to scale our Airflow workers with a horizontal pod autoscaler based on the actual workload — specifically, the number of running and queued tasks. This promised three key benefits:

Cost optimization: Automatically scaling down workers during low-activity periods
Performance adaptability: Scaling up workers during peak throughput to handle increasing task volumes
Future-proof infrastructure: As more DAGs and tasks were thrown at our Airflow environments over time, the number of workers deployed would naturally scale with the workload volume

Our implementation approach

KEDA is driven by the concept that event throughput should define the number of replicas requested by the horizontal pod autoscaler. In Airflow, event throughput has a very clean mapping to task volume, which means we wanted our worker count to scale with the number of tasks being executed at any given time. Our implementation focused on simplicity and reliability:

Used out-of-box KEDA scaling queries without custom modifications to the scaling query
- At TRM, we use the Celery Executor for our worker configuration, which means we can directly reference our worker_concurrency parameter to determine the ideal worker replica count

SELECT ceil(COUNT(*)::decimal / {{ .Values.config.celery.worker_concurrency }}) FROM task_instance WHERE state='running' OR state='queued'

‍

Set minimum replica floor at one pod to ensure basic task processing availability for DAGs that generate observability or operational tasks
- Without a floor of one worker, miscellaneous tasks that execute at intervals of one hour or less would require spinning up and tearing down one worker in loop
Configured maximum worker count at 1.5x our previous static count to handle burst periods
- Setting our max replica count at the same value as we had it pre-KEDA means we wouldn’t unlock burstable task throughput

Understanding workload patterns

Before implementing KEDA, we needed to carefully analyze our workload patterns. We identified two distinct types:

1. Batch workloads

These workloads have clear peaks and valleys in resource usage, making them ideal candidates for KEDA:

Predictable daily batch processes
Periodic data transformations
End-of-day reporting tasks

2. Continuous workloads

Some workloads aren't suitable for KEDA, particularly ones which experience consistent traffic. Here's why:

Overhead of pod creation/deletion outweighs benefits: When we want to scale up the number of replicas due to a large number of queued tasks, those few minutes between Kubernetes requesting the worker pod and the pod accepting new tasks can be detrimental to environments with tight SLOs. This issue compounds when new pods require new nodes.
Risk of processing delays during scale-up events

Initial implementation challenges

Our first implementation revealed some interesting challenges:

1. Rethinking StatefulSets

Our investigation into KEDA led us to question our use of StatefulSets for Airflow workers. We discovered that:

StatefulSets were originally chosen for log persistence, but our logs were since updated to ship remotely to GCS
Workers didn't require stable network identifiers or ordered deployment
The sequential nature of StatefulSet updates was hindering our scaling capabilities

This realization was a pivotal moment; we had been using StatefulSets out of convention rather than necessity. Moving to Deployments would not only solve our KEDA implementation challenges but also simplify our overall architecture.

2. The timing problem

Understanding this scenario requires familiarity with Celery's Soft Shutdown mechanism.

When workers receive a QUIT signal from the Kubernetes control plane, they enter a graceful shutdown state. In this state, workers:

Mark themselves as unschedulable to the Airflow scheduler
Continue processing existing tasks for GracePeriodSeconds seconds
Terminate all processes if tasks haven't completed within the grace period

This mechanism ensures that existing tasks have an opportunity to complete before worker pods are terminated during scaling events. Below is an example of poor interactions between grace periods and stateful set workers.

At 23:30 UTC: Two replicas of the worker pod are executing tasks. One of the workers picks up a task that takes GracePeriodSeconds to execute.
At 23:50 UTC: KEDA scales to one replica, marking one pod for termination. That worker goes into a warm shutdown period where it won't accept any more tasks but will take GracePeriodSeconds to release resources back to the node because of the task it picked up at 23:30. Pods are marked at random, so our worker executing the task from 23:30 is selected.
At 00:00 UTC: New batch run starts.
At 00:10 UTC: Thousands of tasks queued, triggering a scale event on the HPA, but it cannot create more replicas because one of the workers is still terminating.
At 23:30 + GracePeriodSecond UTC: The worker with the long-running task is released to the node pool, finally unblocked the creation of the desired replica count.

In the scenario above, at worst, we're held hostage by our own grace period.

Testing and rollout

We adopted a methodical approach to testing and deployment:

One week testing period in staging environments to validate performance
- First migrating the workers from StatefulSet to Deployment, then enabling KEDA with the above configurations
Careful monitoring of task completion times and SLOs
After a week of validating impacts on usability and SLO, promote changes to our production environments

Measurable results and lessons learned

The implementation of KEDA yielded significant improvements:

Approximately 50% reduction in worker node pool costs
Achieved via worker count reduction of roughly 50% over a 24-hour period
Consistent SLO performance in batch environments where KEDA was enabled
Successfully managed peak loads of 150,000+ daily tasks per environment

We also learned several valuable lessons throughout this journey:

The importance of understanding workload patterns before implementing autoscaling
Not all workloads benefit from event-driven scaling
The value of questioning conventional infrastructure choices
Simple, out-of-box configurations can often deliver significant benefits
Migrating stateful objects to stateless objects yielded positive results for environments that didn’t even have KEDA enabled in this — an unrelated win outside the scope of our initial goals

Future considerations

While our implementation has been successful, we continue to explore opportunities for further optimization:

Fine-tuning scaling parameters based on continued observation
Monitoring for potential adjustments to minimum and maximum replica counts
Evaluating additional KEDA features for specific use cases
Evaluating node management tools to execute more efficient bin packing and workload rightsizing
Splitting worker queues so that different types of tasks route to different types of worker machines
- General purpose workers will never allow us to achieve perfect node utilization because both memory and CPU-intensive tasks are both being executed by the same workers

By implementing KEDA with a focus on simplicity and careful testing, we achieved substantial cost savings while maintaining performance. The success of this project demonstrates the value of applying the right tool to the right problem, even when using basic configurations.

TRM Team