Pivot Strategy
The Pivot Vectorizer can be employed using different strategies, primarily involving aggregation methods. These strategies define how the pivot table is constructed and what type of information it captures from the data. This section discusses the built-in strategies available and how you can define your own custom strategies by overriding specific methods.
This section assumes that you are familiar with the pivot vectorizer. If you want to learn about it, go to the pivot documentation and make sure that you understand it.
Implementation
Strategies follow the logic that transformers use. Implement simple methods that get the raw data, with method names
following this format: handler_name + frame_name
. So an aggregation strategy with HANDLER_NAME
equal to "run" has
methods like "run_df" or "run_spf".
The implementation of these methods is generally the same in all cases because they find the pivot and then rename the
columns as defined. The core functionality of the pivoting is defined in other methods. These methods are _run_df
and _run_spf
. They are responsible for creating the pivot and returning it. Other tasks are handled by default
methods. This implementation makes it easy to define a strategy by overriding a small method. We will discuss more about
this in the write your own strategy section.
Built-in Strategies
There are multiple strategies that you can use, each offering benefits based on your data.
To explain each strategy, we define a dataset and show the result with each strategy. Consider the following table as our data:
address | token | amount |
---|---|---|
address_1 | token_1 | 100 |
address_2 | token_2 | 200 |
address_1 | token_2 | 150 |
address_3 | token_1 | 300 |
address_1 | token_3 | 50 |
address_2 | token_1 | 100 |
And consider the input SFrame as follows:
sf = DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_1",
"address_3",
"address_1",
"address_2",
],
"token": ["token_1", "token_2", "token_2", "token_1", "token_3", "token_1"],
"amount": [100, 200, 150, 300, 50, 100],
}
)
Count
The Count strategy uses the count or size aggregation function. It uses the "size" aggregation function in pandas and " count" in PySpark implementation. The result of the table using this strategy will be like this:
address | token_address_token_1 | token_address_token_2 | token_address_token_3 |
---|---|---|---|
address_1 | 1 | 1 | 1 |
address_2 | 1 | 1 | 0 |
address_3 | 1 | 0 | 0 |
Example
vectorizer = TokenPivotVectorizer(
strategy=CountStrategy(address_col="address", pivot_columns=["token"]),
address_cols=["address"],
result_address_col="address",
contract_address_col="token",
square_shape=False,
)
sf = vectorizer(sf)
sf["vector"].data
>>>
address token_address_token_1 token_address_token_2 \
0 address_1 1 1
1 address_2 1 1
2 address_3 1 0
token_address_token_3
0 1
1 0
2 0
Sum
The Sum strategy uses the "sum" aggregation function. The sum strategy needs another argument called value_column
;
this column indicates which column the sum must be applied to. If you set the value_column
to "amount" in our example,
you get this result:
address | token_address_token_1 | token_address_token_2 | token_address_token_3 |
---|---|---|---|
address_1 | 100 | 150 | 50 |
address_2 | 100 | 200 | 0 |
address_3 | 300 | 0 | 0 |
Example
sf = DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_1",
"address_3",
"address_1",
"address_2",
],
"token": ["token_1", "token_2", "token_2", "token_1", "token_3", "token_1"],
"amount": [100, 200, 150, 300, 50, 100],
}
)
vectorizer = TokenPivotVectorizer(
strategy=SumStrategy(
address_col="address", pivot_columns=["token"], value_column="amount"
),
address_cols=["address"],
result_address_col="address",
contract_address_col="token",
should_normalize=False,
)
sf = vectorizer(sf)
sf["vector"].data
>>>
address token_address_token_1 token_address_token_2 \
0 address_1 100 150
1 address_2 100 200
2 address_3 300 0
token_address_token_3
0 50
1 0
2 0
Bool
The Bool strategy can have two values, 1 or 0. The value 1 means true, and 0 means false. The value 1 shows that one record exists with a relation in that column, and 0 shows that no relation is found. In other words, the Bool strategy checks that the count of every cell is zero or not. If the value is zero, then it changes the value to 0, and otherwise, it changes it to 1.
By using the Bool strategy, our data will look like this:
address | token_address_token_1 | token_address_token_2 | token_address_token_3 |
---|---|---|---|
address_1 | 1 | 1 | 1 |
address_2 | 1 | 1 | 0 |
address_3 | 1 | 0 | 0 |
Example
vectorizer = TokenPivotVectorizer(
strategy=BoolStrategy(address_col="address", pivot_columns=["token"]),
address_cols=["address"],
result_address_col="address",
contract_address_col="token",
should_normalize=False,
)
sf = vectorizer(sf)
sf["vector"].data
>>>
address token_address_token_1 token_address_token_2 \
0 address_1 1 1
1 address_2 1 1
2 address_3 1 0
token_address_token_3
0 1
1 0
2 0
Write Your Own Strategy
You can write your own strategy quickly by overriding some methods. In this section, we will implement the " BoolStrategy". You can follow these steps:
- Define your custom strategy class that inherits from "AggregationStrategy":
from seshat.transformer.vectorizer.utils import AggregationStrategy
class BoolStrategy(AggregationStrategy):
pass - Define the handler methods for your desired raw type. For example, override the
_run_df
method used for DFrame:Make sure that you return the pivot as the result of the method.class BoolStrategy(AggregationStrategy):
def _run_df(self, raw: DataFrame, index_col, *args, **kwargs):
pivot = raw.pivot_table(
index=index_col, columns=self.pivot_columns, aggfunc="size", fill_value=0
)
pivot = pivot.apply(lambda col: col.map(lambda x: 1 if x > 0 else 0))
return pivot