Quick Start
Seshat helps you build and productionize ML dataflows for Web2 and Web3 applications. It streamlines data preprocessing and cleaning, creates test and training data, and ensures that the data is up-to-date for inference.
In this document, we help you to quickly get up and running with Seshat SDK. The main goal of this tutorial is to build a vector representation for blockchain addresses (i.e., users). You can check our use case on SocialFi Token Recommendation for a comprehensive tutorial that starts from explaining required raw data to model inference.
System Requirements
This project requires Python 3.11 or higher. Make sure you have Python installed on your system before proceeding with the installation. You can download the latest version of Python from the official Python website.
Install Seshat
Use the following command to install the latest version of Seshat SDK.
pip install sdk-seshat-python
You also need to execute this command:
pip install "sdk-seshat-python[flipside_support]"
Run this command for PostgreSQL support:
pip install "sdk-seshat-python[postgres_support]"
Required Data
With Seshat, you can use both online and offline data. Online data can be obtained from data providers such as Flipside
For the purpose of this quick start tutorial, download a sanpshot of token transfer transaction data from this link. This data shows which token is sent or received to/by which address and with which amount. Make sure to unzip the CSV file before using it!
Running the First Pipeline
In this part, we show you how to get a vector representation for each address in the dataset. These vectors can then be used for downstream tasks, such as recommendation. We also show you how to run a data preprocessing pipeline with Seshat. For the data preprocessing, we only run one pipeline that removes addresses with low transactions. There are plenty of other data cleaning and preprocessing pipelines you can use within Seshat.
In Seshat SDK, the entire ML operation is handled by FeatureView
. Within the FeatureView
, we define the data source.
Different pipelines are defined inside the Pipeline
object.
Here, we only have two pipelines: LowTransactionTrimmer
and TokenPivotVectorizer
.
After running the pipeline, we store the vectors of addresses in address_vectors
.
from seshat.feature_view.base import FeatureView
from seshat.source.local import LocalSource
from seshat.transformer.pipeline import Pipeline
from seshat.transformer.trimmer.base import LowTransactionTrimmer
from seshat.transformer.vectorizer.pivot import TokenPivotVectorizer
class AddressVectorsView(FeatureView):
name = "Vectorizing blockchain addresses using token transfer transactions"
offline_source = LocalSource(path="data/token_transfer_19000000_19005100.csv")
offline_pipeline = Pipeline(
[
LowTransactionTrimmer(min_transaction_num=20),
TokenPivotVectorizer()
]
)
vector_view = AddressVectorsView()
view = vector_view()
address_vectors = view.data["vector"].to_raw()