How we used machine learning to classify one million Ethereum addresses

TRM InsightsEngineering

January 28, 2019

We used machine learning — specifically active learning, to automatically identify and label Ethereum addresses that, with a high probability, belong to exchanges.

This data powers the TRM platform, which helps digital asset issuers and exchanges stay compliant and grow faster.

This effort builds on the work done by Sid Shekhar, Matthias De Aliaga, Will Price, and others by showing how active learning can be used to cluster and identify Ethereum addresses.

Can we use machine learning to identify exchange-owned addresses on Ethereum?

We attempted to answer this question with both unsupervised and supervised learning. We started with unsupervised learning to see what unexpected patterns might be in the data. Then we used supervised learning to get more definitive results.

First, we collected the data.

We used Google BigQuery’s Ethereum dataset to pull the top 1,000,000 addresses ranked by ETH volume traded.

In order to extract patterns from the addresses (e.g., which addresses belong to an exchange), we first defined the traits that we would draw comparisons on.

For each address, we calculated 40+ traits that help us categorize the addresses. These traits (or features, in machine learning-speak), included stats around which assets this address was holding, how often they transacted, and who they transacted with.

Now that we collected our data, it was time to run the numbers.

Approach 1: Unsupervised learning

Before kicking off, we cleaned the data a little further: some dimensionality reduction and scaling (Principal Component Analysis & T-SNE).


Visualization of addresses on the first two principal components

We trained a K-means algorithm to see if there are natural ‘clusters’ within Ethereum addresses. Our hope was to see multiple well-differentiated clusters.

We used a small set of labeled addresses from the TRM platform to test the accuracy of the model.

And, found the addresses to be well-differentiated.

This chart shows the correlation between different features and is used to select features.

One of the caveats that we saw in the clustering is that two exchange-owned addresses could look very different. For instance, this Binance-owned address has a very large ETH balance (1M+) and few transactions (~100), whereas this Bibox-owned address has a small ETH balance (3K) and many transactions (450K+).

The unsupervised learning helped us see that there could be clear differences between exchange-owned addresses and other types of addresses (e.g., market makers, OTC desks, retail investors).

Now, it was time to use supervised learning in order to predict whether a new, specific address is an exchange-owned address or not.

Approach 2: Supervised learning

Our goal: to build a system that can automatically detect and label exchange-owned Ethereum addresses.

We decided to use active learning because the number of unlabeled addresses is high and manual labeling is time-consuming and expensive.

We started by generating over 40 features for each address. As part of preprocessing, we discarded some of the features that had high correlations with other features.

Pearson correlation post

We trained a classification model on our initial set of labeled exchange addresses.

This decision tree classifier is used to visualize the most determinant features in our model.

Then we used this model to predict the probability of an unlabeled address being an exchange address.

Out of 100 addresses that our model predicted as an exchange with a “high probability”, 95 were confirmed to actually be exchange-owned addresses.

After validating the accuracy of our model, we ran it on the entire Ethereum blockchain in order to label many more exchange-owned addresses.

Conclusion

Through this project, we were able to label over 600,000 new Ethereum addresses. Next, we will apply our learnings to expand our labeled addresses across all categories: from market makers to dark net markets.

These newly-labeled Ethereum addresses help us advance our mission to make blockchains more trusted and secure. By de-anonymizing blockchain data, we make it easier for financial institutions to comply with regulations like KYC/AML.

About TRM: The TRM platform is the first platform designed specifically to streamline on-chain AML compliance for digital asset issuers, protocols, and exchanges, saving them time and reducing risk. The TRM platform includes solutions for on-chain customer due diligence, transaction monitoring, and relationship management.

Learn more about our tools for investigating illicit activity

Fill out the form to schedule a demo with our team.

Services of interest
Select
Transaction Monitoring/Wallet Screening
Training Services
Training Services
 
By clicking the button below, you agree to the TRM Labs Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Subscribe to our latest insights
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
You can unsuscribe at any time. Read our Privacy Policy.