Pivot Strategy

The Pivot Vectorizer can be employed using different strategies, primarily involving aggregation methods. These strategies define how the pivot table is constructed and what type of information it captures from the data. This section discusses the built-in strategies available and how you can define your own custom strategies by overriding specific methods.

note

This section assumes that you are familiar with the pivot vectorizer. If you want to learn about it, go to the pivot documentation and make sure that you understand it.

Implementation

Strategies follow the logic that transformers use. Implement simple methods that get the raw data, with method names following this format: handler_name + frame_name. So an aggregation strategy with HANDLER_NAME equal to "run" has methods like "run_df" or "run_spf".

The implementation of these methods is generally the same in all cases because they find the pivot and then rename the columns as defined. The core functionality of the pivoting is defined in other methods. These methods are _run_df and _run_spf. They are responsible for creating the pivot and returning it. Other tasks are handled by default methods. This implementation makes it easy to define a strategy by overriding a small method. We will discuss more about this in the write your own strategy section.

Built-in Strategies

There are multiple strategies that you can use, each offering benefits based on your data.

To explain each strategy, we define a dataset and show the result with each strategy. Consider the following table as our data:

address	token	amount
address_1	token_1	100
address_2	token_2	200
address_1	token_2	150
address_3	token_1	300
address_1	token_3	50
address_2	token_1	100

And consider the input SFrame as follows:

sf = DFrame.from_raw(
    {
        "address": [
            "address_1",
            "address_2",
            "address_1",
            "address_3",
            "address_1",
            "address_2",
        ],
        "token": ["token_1", "token_2", "token_2", "token_1", "token_3", "token_1"],
        "amount": [100, 200, 150, 300, 50, 100],
    }
)

Count

The Count strategy uses the count or size aggregation function. It uses the "size" aggregation function in pandas and " count" in PySpark implementation. The result of the table using this strategy will be like this:

address	token_address_token_1	token_address_token_2	token_address_token_3
address_1	1	1	1
address_2	1	1	0
address_3	1	0	0

Example

vectorizer = TokenPivotVectorizer(
    strategy=CountStrategy(address_col="address", pivot_columns=["token"]),
    address_cols=["address"],
    result_address_col="address",
    contract_address_col="token",
    square_shape=False,
)
sf = vectorizer(sf)
sf["vector"].data

>>> 
     address  token_address_token_1  token_address_token_2  \
0  address_1                      1                      1   
1  address_2                      1                      1   
2  address_3                      1                      0   
   token_address_token_3  
0                      1  
1                      0  
2                      0  

Sum

The Sum strategy uses the "sum" aggregation function. The sum strategy needs another argument called value_column; this column indicates which column the sum must be applied to. If you set the value_column to "amount" in our example, you get this result:

address	token_address_token_1	token_address_token_2	token_address_token_3
address_1	100	150	50
address_2	100	200	0
address_3	300	0	0

Example

sf = DFrame.from_raw(
    {
        "address": [
            "address_1",
            "address_2",
            "address_1",
            "address_3",
            "address_1",
            "address_2",
        ],
        "token": ["token_1", "token_2", "token_2", "token_1", "token_3", "token_1"],
        "amount": [100, 200, 150, 300, 50, 100],
    }
)
vectorizer = TokenPivotVectorizer(
    strategy=SumStrategy(
        address_col="address", pivot_columns=["token"], value_column="amount"
    ),
    address_cols=["address"],
    result_address_col="address",
    contract_address_col="token",
    should_normalize=False,
)
sf = vectorizer(sf)
sf["vector"].data

>>> 
     address  token_address_token_1  token_address_token_2  \
0  address_1                    100                    150   
1  address_2                    100                    200   
2  address_3                    300                      0   
   token_address_token_3  
0                     50  
1                      0  
2                      0  

Bool

The Bool strategy can have two values, 1 or 0. The value 1 means true, and 0 means false. The value 1 shows that one record exists with a relation in that column, and 0 shows that no relation is found. In other words, the Bool strategy checks that the count of every cell is zero or not. If the value is zero, then it changes the value to 0, and otherwise, it changes it to 1.

By using the Bool strategy, our data will look like this:

address	token_address_token_1	token_address_token_2	token_address_token_3
address_1	1	1	1
address_2	1	1	0
address_3	1	0	0

Example

vectorizer = TokenPivotVectorizer(
    strategy=BoolStrategy(address_col="address", pivot_columns=["token"]),
    address_cols=["address"],
    result_address_col="address",
    contract_address_col="token",
    should_normalize=False,
)
sf = vectorizer(sf)
sf["vector"].data

>>> 
     address  token_address_token_1  token_address_token_2  \
0  address_1                      1                      1   
1  address_2                      1                      1   
2  address_3                      1                      0   
   token_address_token_3  
0                      1  
1                      0  
2                      0  

Write Your Own Strategy

You can write your own strategy quickly by overriding some methods. In this section, we will implement the " BoolStrategy". You can follow these steps:

Define your custom strategy class that inherits from "AggregationStrategy":

from seshat.transformer.vectorizer.utils import AggregationStrategy

class BoolStrategy(AggregationStrategy):
    pass

Define the handler methods for your desired raw type. For example, override the _run_df method used for DFrame:

class BoolStrategy(AggregationStrategy):

    def _run_df(self, raw: DataFrame, index_col, *args, **kwargs):
        pivot = raw.pivot_table(
            index=index_col, columns=self.pivot_columns, aggfunc="size", fill_value=0
        )
        pivot = pivot.apply(lambda col: col.map(lambda x: 1 if x > 0 else 0))
        return pivot

Make sure that you return the pivot as the result of the method.

Pivot Strategy

Implementation​

Built-in Strategies​

Count​

Example​

Sum​

Example​

Bool​

Example​

Write Your Own Strategy​

Implementation

Built-in Strategies

Count

Example

Sum

Example

Bool

Example

Write Your Own Strategy