Skip to main content

SocialFi Token Recommendation

Seshat helps you build and productionize ML dataflows. The goal of this use case is to develop a token recommendation system using Seshat SDK.

We will follow this road map:

  1. Problem understanding
  2. Data understanding
  3. Preprocess and clean data
  4. Vectorize entities
  5. Perform the recommendation

Problem understanding

The goal of this use-case is to receive a public address as input and recommend a set of tokens to the input address. We are specifically interested in designing a system for ERC20 tokens in the Ethereum network. Generally speaking, this is similar to any recommendation systems, such as Netflix or YouTube recommendation. The goal is to observe addresses' past behaviour, and recommend new items to them. That being said, we are focusing on blockchain data and build the recommendation for public addresses. Note that blockchain "public address" and "address" are used interchangeably in this document. Also, in this case, each user is represented by an address.

Data understanding

After understanding the problem, the next step is to understand what kind of data can be used for token recommendation. There are numerous ways a recommendation engine can be built. The easiest way is to just recommend the most popular tokens to anyone. However, this easy approach treats all addresses (i.e., users) the same. In this tutorial, we want to use past addresses' behaviour in order to recommend a customized set of tokens to each address. Thus, we will use the past transaction data for this purpose. Past transaction data is the data that shows which token is sent or received to which address and with which amount. The following table shows this data in a tabular format.

Here is the description of each of the attributes:

  1. ID: This is a unique identifier for each row. This can be randomly generated, comes from the data provider, or even be the hash of the transaction.
  2. block_number: This is the number of the block that this transaction is recorded on.
  3. contract_address: Note that in blockchain, each token is represented by a contract address. Hence, contract_address represents the unique identifier of a token in the blockchain.
  4. from_address: This is the public address of the sender of this transaction.
  5. to_address: This is the public address of the receiver of this transaction.
  6. amount_precise: This is the amount of toknes that the sender sends to the receiver in this transaction.
  7. symbol: This is the symbol of the token (i.e., contract). Note that unlike contract_address, symbol is not necessarily unique. That is why symbol is only used for presentation purposes and not in the recommendation process.
IDblock_numbercontract_addressfrom_addressto_addressamount_precisesymbol
1196451140x521dea5cd838b23b0a7c430e5f3e8e186b61c34d0xaf0a8ec616c4a9772ad344977f4ed2e03cd056700x408018da3786c6c4046f460011905c69d60659d05443975685.60311AIQQ
2196451260xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb480x3416cf6c708da44db2624d63ea0aaef7113527c60x74de5d4fcbf63e00296fd95d33236b97940166312298.98696USDC

Given this dataset, our goal is to build a system that takes this dataset as an input, and recommend tokens (i.e., contract_address) to public addresses. You can download a sanpshot of this data from this link. Note that this the ERC20 token transfers of 5,000 blocks starting from block 19,000,000. It was downloaded using Flipside API. This file must be unzipped before usage.

Preprocess and clean data

Before we start building the recommendation system, we need to perform a series of data preprocessing and cleaning steps. This is the first step to show how easy it is to develop a ML dataflow with Seshat SDK. In this tutorial, we assume that the input data resides in a local file. Note that Seshat supports other data sources too, such as getting data from other providers such as Flipside, or from a relational database.

In Seshat SDK, the entire ML operation is handled by FeatureView.
Within the FeatureView, we define the data source.
We also define a series of pipelines that will be performed on the input data source. The following code illustrates this:

from seshat.feature_view.base import FeatureView
from seshat.source.local import LocalSource
from seshat.transformer.pipeline import Pipeline
from seshat.transformer.trimmer.base import ZeroAddressTrimmer


class TokenRecommendation(FeatureView):
name = "Token Recommendation with Local csv as Source"
offline_source = LocalSource(path="data/token_transfer_19000000_19005100.csv")

offline_pipeline = Pipeline(
[
ZeroAddressTrimmer(
zero_address="0x0000000000000000000000000000000000000000"
)
]
)


recommender = TokenRecommendation()
view = recommender()

print(view.data.to_raw())

Let's see how this code works. First, we crate the TokenRecommendation class that inherits from FeatureView. Within the TokenRecommendation class, we assign a name and specify the offline_source attribute. Note that in this tutorial, the offline source is the csv file that contains the transactions. The offline source is an object of type LocalSource. Last part of the TokenRecommendation class is defining the pipeline. The pipeline is composed of a list of transformers that are applied on different datasets. As we will see later in this tutorial, Seshat SDK provides a number of transformers that streamline ML development for different applications. As our first pipeline, we only use one data cleaning transformers that effectively remove zero addresses from the input transaction dataset. In Ethereum network, zero address is a special address that has no balance. For the sake of token recommendation, we remove the transactions (i.e., rows in our input data), where zero address is either a sender or a receiver. Note that the pipline is of type Pipeline and it essentially receives a list of transformers classes as input. Each class, such as ZeroAddressTrimmer will be applied to the data and makes a specific task. After the TokenRecommendation class is defined, we initialize it and call the object recommender. Note that in Seshat SDK, initializing the class which was inherited from the FeatureView does not start the ML process. In order to start the ML process, we need to call the recommender object: view = recommender() Behind the scene, this makes Seshat SDK to call the __call__ function of the FeatureView that in turn fetch the data and execute the pipeline and all the transformers within the pipeline. The view object contains the preprocessed and clean dataset in an object called data.

Now that we covered the basics of the FeatureView class, let's see an extended version of the above example with more preprocessing pipelines.

from seshat.feature_view.base import FeatureView
from seshat.source.local import LocalSource
from seshat.transformer.pipeline import Pipeline
from seshat.transformer.trimmer.base import (
ZeroAddressTrimmer,
LowTransactionTrimmer,
FeatureTrimmer,
ContractTrimmer,
)
from seshat.general import configs
from seshat.utils.contracts import PopularContractsFinder


class TokenRecommendation(FeatureView):
name = "Token Recommendation with Local csv as Source ans save result to database"
offline_source = LocalSource(path="data/token_transfer_19000000_19005100.csv")

offline_pipeline = Pipeline(
[
ZeroAddressTrimmer(
zero_address="0x0000000000000000000000000000000000000000"
),
LowTransactionTrimmer(min_transaction_num=20),
FeatureTrimmer(
columns=[
configs.CONTRACT_ADDRESS_COL,
configs.FROM_ADDRESS_COL,
configs.TO_ADDRESS_COL,
configs.AMOUNT_PRICE_COL,
configs.BLOCK_NUM_COL,
]
),
ContractTrimmer(
PopularContractsFinder().find, contract_list_kwargs={"limit": 100}
),
]
)


recommender = TokenRecommendation()
view = recommender()

print(view.data.to_raw())

In the above code, we added three more pipelines:

  1. LowTransactionTrimmer: the purpose of this transformer is to remove addresses with low number of transactions. By default, we remove addresses in which the address does not have the minimum number of transactions as either sender or receiver.
  2. FeatureTrimmer: This transformer only keeps required attributes. This helps the dataset to become smaller and hence be more memory efficient for future transformers. Note that here we use the confings class within SeshatSDK to get the name of required attributes
  3. ContractTrimmer: This transformer performs processing on contracts. Remember that in the token recommendation use-case, a contract address represents a token. In our use-case, we are interested to only keep contracts with high trading frequency. More specifically, we want to keep top-k most frequent contracts. Thus, ContractTrimmer transformer receives PopularContractsFinder and the number of required contracts as input.

Vectorize entities

After preprocessing the input dataset, the next step is to convert addresses into vectors. Vectorizing addresses allow us to to easily compare them with each other. For the purpose of our token recommendation use-case, upon vectorizing addresses, we can find similar addresses. Thus, for any given input address, we first find its closes addresses in the vector space (these close addresses are called neighbours). We then find the tokens that are purchased by neighbour addresses, and recommend them to the input address. There are many ways to vectorize addresses. The first method that we cover in this tutorial is called pivoting. Essentially, for any given address, the vector representing that address is number of (or amount of) each tokens the address bought or sold (buy and sell of each token will be different dimensions). In order to produce higher quality vectors, and keep the vector space too large, we build these vectors only for popular tokens (i.e., contracts). So, for example, if we only keep 100 most popular contracts, then for each address, we create a vector of size 200: 100 vectors for the but contracts and 100 vectors for the sell contracts. Within Seshat SDK, this can be achieved with a few lines of code. We essentially add one transformer to the pipeline: TokenPivotVectorize

from seshat.feature_view.base import FeatureView
from seshat.source.local import LocalSource
from seshat.transformer.pipeline import Pipeline
from seshat.transformer.trimmer.base import (
ZeroAddressTrimmer,
LowTransactionTrimmer,
FeatureTrimmer,
ContractTrimmer,
)
from seshat.general import configs
from seshat.utils.contracts import PopularContractsFinder
from seshat.transformer.vectorizer.pivot import TokenPivotVectorizer


class TokenRecommendation(FeatureView):
name = "Token Recommendation with Local csv as Source ans save result to database"
offline_source = LocalSource(path="data/token_transfer_19000000_19005100.csv")

offline_pipeline = Pipeline(
[
ZeroAddressTrimmer(
zero_address="0x0000000000000000000000000000000000000000"
),
LowTransactionTrimmer(min_transaction_num=20),
FeatureTrimmer(
columns=[
configs.CONTRACT_ADDRESS_COL,
configs.FROM_ADDRESS_COL,
configs.TO_ADDRESS_COL,
configs.AMOUNT_PRICE_COL,
configs.BLOCK_NUM_COL,
]
),
ContractTrimmer(
PopularContractsFinder().find, contract_list_kwargs={"limit": 100}
),
TokenPivotVectorizer(
group_keys={"default": "default", "vector": "recom_pivot"}
),
]
)


recommender = TokenRecommendation()
view = recommender()

print(view.data["default"].to_raw())
print(view.data["recom_pivot"].to_raw())

Perform the recommendation

The last step of this tutorial is to perform the recommendation. Since we already have the vectorized representation of each address, we need to find the distances between these vectors. This will then help us to find the closest addresses (i.e., neighbours) for any given input address. In order to do this, we add CosineSimilarityVectorizer to the end of the pipeline.

from seshat.feature_view.base import FeatureView
from seshat.source.local import LocalSource
from seshat.transformer.pipeline import Pipeline
from seshat.transformer.trimmer.base import (
ZeroAddressTrimmer,
LowTransactionTrimmer,
FeatureTrimmer,
ContractTrimmer,
)
from seshat.general import configs
from seshat.utils.contracts import PopularContractsFinder
from seshat.transformer.vectorizer.pivot import TokenPivotVectorizer
from seshat.transformer.vectorizer.cosine_similarity import CosineSimilarityVectorizer


class TokenRecommendation(FeatureView):
name = "Token Recommendation with Local csv as Source ans save result to database"
offline_source = LocalSource(path="data/token_transfer_19000000_19005100.csv")

offline_pipeline = Pipeline(
[
ZeroAddressTrimmer(
zero_address="0x0000000000000000000000000000000000000000"
),
LowTransactionTrimmer(min_transaction_num=20),
FeatureTrimmer(
columns=[
configs.CONTRACT_ADDRESS_COL,
configs.FROM_ADDRESS_COL,
configs.TO_ADDRESS_COL,
configs.AMOUNT_PRICE_COL,
configs.BLOCK_NUM_COL,
]
),
ContractTrimmer(
PopularContractsFinder().find, contract_list_kwargs={"limit": 100}
),
TokenPivotVectorizer(
group_keys={"default": "default", "vector": "recom_pivot"}
),
CosineSimilarityVectorizer(square_shape=False)
]
)


recommender = TokenRecommendation()
view = recommender()

print(view.data["default"].to_raw())
print(view.data["recom_pivot"].to_raw())
print(view.data["cosine_sim"].to_raw())

Note that at the end of the above code, we print the result of these pipelines, which are three SFrames.