Journey to KEDA: Optimizing Airflow Worker Scaling in Kubernetes
The TRM Labs engineering team recently migrated our Apache Airflow infrastructure from Google Cloud Composer to a self-managed solution on Google Kubernetes Engine (GKE). While this migration gave us greater control and flexibility, we needed to optimize our worker scaling strategy to achieve better cost efficiency — without compromising performance.
What is the role of TRM's infrastructure team?
TRM Labs provides blockchain analytics solutions to help law enforcement and national security agencies, financial institutions, and cryptocurrency businesses detect, investigate, and disrupt crypto-related fraud and financial crime. Our platform combines advanced analytics with blockchain data to provide comprehensive solutions for transaction monitoring, wallet screening, and entity risk scoring.
Our infrastructure team plays a critical role in maintaining TRM's technical operations by:
- Managing critical infrastructure components — including Kubernetes clusters, observability systems, and CI/CD
- Deploying and maintaining tools for automated infrastructure deployment using Infrastructure-as-Code (IaC)
- Leading disaster recovery efforts in collaboration with data engineering team to maintain system uptime
- Working closely with security teams on infrastructure hardening and patch management
This case study describes one of our key infrastructure initiatives to optimize our Airflow worker deployment using KEDA.
The challenge: Dynamically adjusting worker count based on real-time demand
At TRM, we execute nearly two million tasks per day across all Airflow environments — which means our workers are almost always executing tasks. When we were initially focused on getting our GKE deployment up and running as quickly as possible, we decided to configure a number of static workers to run 24/7. This meant deploying enough workers to process our periods of highest task throughput, even though these peak periods only occurred during specific times of the day (such as during end-of-day batch runs).
During off-peak hours, these workers would sit largely idle, consuming expensive compute resources while utilizing only a fraction of their capacity. We were effectively paying for peak capacity 24/7, despite only needing it for a few hours each day.
This static worker count approach created a clear mismatch between resource allocation and actual usage patterns. We needed a way to dynamically adjust our worker count based on real-time demand, rather than continuously provisioning for worst-case scenarios.
The solution: Scaling workers with KEDA
KEDA (Kubernetes Event-driven Autoscaling) presented an attractive solution for specific workloads. It offered the ability to scale our Airflow workers with a horizontal pod autoscaler based on the actual workload — specifically, the number of running and queued tasks. This promised three key benefits:
- Cost optimization: Automatically scaling down workers during low-activity periods
- Performance adaptability: Scaling up workers during peak throughput to handle increasing task volumes
- Future-proof infrastructure: As more DAGs and tasks were thrown at our Airflow environments over time, the number of workers deployed would naturally scale with the workload volume
Our implementation approach
KEDA is driven by the concept that event throughput should define the number of replicas requested by the horizontal pod autoscaler. In Airflow, event throughput has a very clean mapping to task volume, which means we wanted our worker count to scale with the number of tasks being executed at any given time. Our implementation focused on simplicity and reliability:
- Used out-of-box KEDA scaling queries without custom modifications to the scaling query
- At TRM, we use the Celery Executor for our worker configuration, which means we can directly reference our
worker_concurrency
parameter to determine the ideal worker replica count
- At TRM, we use the Celery Executor for our worker configuration, which means we can directly reference our
SELECT
ceil(COUNT(*)::decimal / {{ .Values.config.celery.worker_concurrency }})
FROM task_instance
WHERE state='running' OR state='queued'
- Set minimum replica floor at one pod to ensure basic task processing availability for DAGs that generate observability or operational tasks
- Without a floor of one worker, miscellaneous tasks that execute at intervals of one hour or less would require spinning up and tearing down one worker in loop
- Configured maximum worker count at 1.5x our previous static count to handle burst periods
- Setting our max replica count at the same value as we had it pre-KEDA means we wouldn’t unlock burstable task throughput
Understanding workload patterns
Before implementing KEDA, we needed to carefully analyze our workload patterns. We identified two distinct types:
1. Batch workloads
These workloads have clear peaks and valleys in resource usage, making them ideal candidates for KEDA:
- Predictable daily batch processes
- Periodic data transformations
- End-of-day reporting tasks
2. Continuous workloads
Some workloads aren't suitable for KEDA, particularly ones which experience consistent traffic. Here's why:
- Overhead of pod creation/deletion outweighs benefits: When we want to scale up the number of replicas due to a large number of queued tasks, those few minutes between Kubernetes requesting the worker pod and the pod accepting new tasks can be detrimental to environments with tight SLOs. This issue compounds when new pods require new nodes.
- Risk of processing delays during scale-up events
Initial implementation challenges
Our first implementation revealed some interesting challenges:
1. Rethinking StatefulSets
Our investigation into KEDA led us to question our use of StatefulSets for Airflow workers. We discovered that:
- StatefulSets were originally chosen for log persistence, but our logs were since updated to ship remotely to GCS
- Workers didn't require stable network identifiers or ordered deployment
- The sequential nature of StatefulSet updates was hindering our scaling capabilities
This realization was a pivotal moment; we had been using StatefulSets out of convention rather than necessity. Moving to Deployments would not only solve our KEDA implementation challenges but also simplify our overall architecture.
2. The timing problem
Understanding this scenario requires familiarity with Celery's Soft Shutdown mechanism.
When workers receive a QUIT signal from the Kubernetes control plane, they enter a graceful shutdown state. In this state, workers:
- Mark themselves as unschedulable to the Airflow scheduler
- Continue processing existing tasks for
GracePeriodSeconds
seconds - Terminate all processes if tasks haven't completed within the grace period
This mechanism ensures that existing tasks have an opportunity to complete before worker pods are terminated during scaling events. Below is an example of poor interactions between grace periods and stateful set workers.
- At 23:30 UTC: Two replicas of the worker pod are executing tasks. One of the workers picks up a task that takes
GracePeriodSeconds
to execute. - At 23:50 UTC: KEDA scales to one replica, marking one pod for termination. That worker goes into a warm shutdown period where it won't accept any more tasks but will take
GracePeriodSeconds
to release resources back to the node because of the task it picked up at 23:30. Pods are marked at random, so our worker executing the task from 23:30 is selected. - At 00:00 UTC: New batch run starts.
- At 00:10 UTC: Thousands of tasks queued, triggering a scale event on the HPA, but it cannot create more replicas because one of the workers is still terminating.
- At 23:30 +
GracePeriodSecond
UTC: The worker with the long-running task is released to the node pool, finally unblocked the creation of the desired replica count.
In the scenario above, at worst, we're held hostage by our own grace period.
Testing and rollout
We adopted a methodical approach to testing and deployment:
- One week testing period in staging environments to validate performance
- First migrating the workers from
StatefulSet
toDeployment
, then enabling KEDA with the above configurations
- First migrating the workers from
- Careful monitoring of task completion times and SLOs
- After a week of validating impacts on usability and SLO, promote changes to our production environments
Measurable results and lessons learned
The implementation of KEDA yielded significant improvements:
- Approximately 50% reduction in worker node pool costs
- Achieved via worker count reduction of roughly 50% over a 24-hour period
- Consistent SLO performance in batch environments where KEDA was enabled
- Successfully managed peak loads of 150,000+ daily tasks per environment
We also learned several valuable lessons throughout this journey:
- The importance of understanding workload patterns before implementing autoscaling
- Not all workloads benefit from event-driven scaling
- The value of questioning conventional infrastructure choices
- Simple, out-of-box configurations can often deliver significant benefits
- Migrating stateful objects to stateless objects yielded positive results for environments that didn’t even have KEDA enabled in this — an unrelated win outside the scope of our initial goals
Future considerations
While our implementation has been successful, we continue to explore opportunities for further optimization:
- Fine-tuning scaling parameters based on continued observation
- Monitoring for potential adjustments to minimum and maximum replica counts
- Evaluating additional KEDA features for specific use cases
- Evaluating node management tools to execute more efficient bin packing and workload rightsizing
- Splitting worker queues so that different types of tasks route to different types of worker machines
- General purpose workers will never allow us to achieve perfect node utilization because both memory and CPU-intensive tasks are both being executed by the same workers
By implementing KEDA with a focus on simplicity and careful testing, we achieved substantial cost savings while maintaining performance. The success of this project demonstrates the value of applying the right tool to the right problem, even when using basic configurations.
Access our coverage of TRON, Solana and 23 other blockchains
Fill out the form to speak with our team about investigative professional services.