Skip to main content

Pivot Strategy

The Pivot Vectorizer can be employed using different strategies, primarily involving aggregation methods. These strategies define how the pivot table is constructed and what type of information it captures from the data. This section discusses the built-in strategies available and how you can define your own custom strategies by overriding specific methods.

note

This section assumes that you are familiar with the pivot vectorizer. If you want to learn about it, go to the pivot documentation and make sure that you understand it.

Implementation

Strategies follow the logic that transformers use. Implement simple methods that get the raw data, with method names following this format: handler_name + frame_name. So an aggregation strategy with HANDLER_NAME equal to "run" has methods like "run_df" or "run_spf".

The implementation of these methods is generally the same in all cases because they find the pivot and then rename the columns as defined. The core functionality of the pivoting is defined in other methods. These methods are _run_df and _run_spf. They are responsible for creating the pivot and returning it. Other tasks are handled by default methods. This implementation makes it easy to define a strategy by overriding a small method. We will discuss more about this in the write your own strategy section.

Built-in Strategies

There are multiple strategies that you can use, each offering benefits based on your data.

To explain each strategy, we define a dataset and show the result with each strategy. Consider the following table as our data:

addresstokenamount
address_1token_1100
address_2token_2200
address_1token_2150
address_3token_1300
address_1token_350
address_2token_1100

And consider the input SFrame as follows:

sf = DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_1",
"address_3",
"address_1",
"address_2",
],
"token": ["token_1", "token_2", "token_2", "token_1", "token_3", "token_1"],
"amount": [100, 200, 150, 300, 50, 100],
}
)

Count

The Count strategy uses the count or size aggregation function. It uses the "size" aggregation function in pandas and " count" in PySpark implementation. The result of the table using this strategy will be like this:

addresstoken_address_token_1token_address_token_2token_address_token_3
address_1111
address_2110
address_3100

Example

vectorizer = TokenPivotVectorizer(
strategy=CountStrategy(address_col="address", pivot_columns=["token"]),
address_cols=["address"],
result_address_col="address",
contract_address_col="token",
square_shape=False,
)
sf = vectorizer(sf)
sf["vector"].data

>>>
address token_address_token_1 token_address_token_2 \
0 address_1 1 1
1 address_2 1 1
2 address_3 1 0
token_address_token_3
0 1
1 0
2 0

Sum

The Sum strategy uses the "sum" aggregation function. The sum strategy needs another argument called value_column; this column indicates which column the sum must be applied to. If you set the value_column to "amount" in our example, you get this result:

addresstoken_address_token_1token_address_token_2token_address_token_3
address_110015050
address_21002000
address_330000

Example

sf = DFrame.from_raw(
{
"address": [
"address_1",
"address_2",
"address_1",
"address_3",
"address_1",
"address_2",
],
"token": ["token_1", "token_2", "token_2", "token_1", "token_3", "token_1"],
"amount": [100, 200, 150, 300, 50, 100],
}
)
vectorizer = TokenPivotVectorizer(
strategy=SumStrategy(
address_col="address", pivot_columns=["token"], value_column="amount"
),
address_cols=["address"],
result_address_col="address",
contract_address_col="token",
should_normalize=False,
)
sf = vectorizer(sf)
sf["vector"].data

>>>
address token_address_token_1 token_address_token_2 \
0 address_1 100 150
1 address_2 100 200
2 address_3 300 0
token_address_token_3
0 50
1 0
2 0

Bool

The Bool strategy can have two values, 1 or 0. The value 1 means true, and 0 means false. The value 1 shows that one record exists with a relation in that column, and 0 shows that no relation is found. In other words, the Bool strategy checks that the count of every cell is zero or not. If the value is zero, then it changes the value to 0, and otherwise, it changes it to 1.

By using the Bool strategy, our data will look like this:

addresstoken_address_token_1token_address_token_2token_address_token_3
address_1111
address_2110
address_3100

Example

vectorizer = TokenPivotVectorizer(
strategy=BoolStrategy(address_col="address", pivot_columns=["token"]),
address_cols=["address"],
result_address_col="address",
contract_address_col="token",
should_normalize=False,
)
sf = vectorizer(sf)
sf["vector"].data

>>>
address token_address_token_1 token_address_token_2 \
0 address_1 1 1
1 address_2 1 1
2 address_3 1 0
token_address_token_3
0 1
1 0
2 0

Write Your Own Strategy

You can write your own strategy quickly by overriding some methods. In this section, we will implement the " BoolStrategy". You can follow these steps:

  1. Define your custom strategy class that inherits from "AggregationStrategy":
    from seshat.transformer.vectorizer.utils import AggregationStrategy

    class BoolStrategy(AggregationStrategy):
    pass
  2. Define the handler methods for your desired raw type. For example, override the _run_df method used for DFrame:
    class BoolStrategy(AggregationStrategy):

    def _run_df(self, raw: DataFrame, index_col, *args, **kwargs):
    pivot = raw.pivot_table(
    index=index_col, columns=self.pivot_columns, aggfunc="size", fill_value=0
    )
    pivot = pivot.apply(lambda col: col.map(lambda x: 1 if x > 0 else 0))
    return pivot
    Make sure that you return the pivot as the result of the method.