Skip to main content

Introduction To Derivers

The deriver is a transformer designed to derive new data or features from an existing dataset. Similar to other transformers, the deriver utilizes group keys, retrieving these keys from the input data and passing their raw values to the handlers for processing.

Derivers possess the capability to work on both group SFrames and single SFrames. While some derivers function exclusively on group SFrames, they typically require raw data for their handler functions. This allows for a more flexible and nuanced approach to data transformation, making it possible to generate a wide variety of new features or data points from the existing dataset.

How it works

In the context of implementing derivers for pandas dataframes, the HANDLER_NAME used is derive, and the corresponding implementation is derive_df. This function facilitates the application of the derive process to pandas dataframes, enabling users to efficiently transform and enrich their datasets within the familiar pandas framework.

SFrame from Columns

One of the key features of Seshat is the SFrame and its ability to be grouped. The deriver in Seshat is designed to create a new SFrame from columns of an existing SFrame. For example, assume you load a simple transactions dataset that contains from_address and to_address columns, and you want to create a new SFrame with all the values of these columns. In this case, the deriver is used.

How it works

The deriver extracts the specified columns from the input SFrame and consolidates their values into a new SFrame. This functionality is particularly useful for isolating specific data points or preparing data for further analysis. By creating a new SFrame with the desired columns, the data processing workflow is streamlined, allowing for a focus on relevant attributes.

This process begins by identifying group keys, which are retrieved from the input SFrame. These raw values are passed to handler functions that process the data according to the specified logic. The flexibility to handle both group and single SFrames allows the deriver to adapt to various data structures and requirements.

Example

from seshat.data_class import DFrame
from seshat.transformer.deriver import SFrameFromColsDeriver

data = {"from_address": ["zero", "bar", "baz"], "to_address": ["baz", "zero", "qux"]}
sf_input = DFrame.from_raw(data)
sf_output = SFrameFromColsDeriver(
group_keys={"default": "default", "address": "address"},
result_col="address",
cols=["from_address", "to_address"],
)(sf_input)

sf_output["address"].data
>>>
address
0 zero
1 baz
2 bar
3 qux

Feature for Address

Assume you want to derive a feature for an SFrame from another SFrame. This deriver is built specifically for that purpose. You have the same columns in both SFrames, which can be a single column or a list of columns, such as "address" or ["contract_address", "token"]. You want to derive a new feature from other columns, requiring an aggregation function and the specific columns from which to derive the feature. All of these parameters can be set in this deriver.

The ability to handle both single columns and lists of columns adds to the flexibility of this deriver. By specifying the necessary columns and the aggregation function, you can tailor the feature derivation process to fit the specific needs of your dataset. This flexibility and ease of use make the deriver a powerful tool in the data preparation toolkit, ensuring that you can efficiently create meaningful and useful features from your existing data. The automatic conversion to numeric types further enhances its utility, allowing for seamless integration into various analytical and modeling workflows.

How it works

If your columns need to be numeric and you're uncertain whether the column types are set correctly, there's no issue. The is_numeric parameter is set to True by default, ensuring that the column types are converted to numeric automatically. This feature simplifies the process by ensuring that the data types are correctly formatted for numerical operations, removing any potential hurdles related to data type inconsistencies.

This deriver is especially useful in scenarios where data transformation and feature engineering are essential. For example, in a financial transactions dataset, you might have columns such as "from_address" and "to_address" in one SFrame and similar columns in another SFrame. Using this deriver, you can aggregate data like total transaction amounts or the count of transactions per address, thus creating new, valuable features for analysis or machine learning models.

There are two types of deriving a feature based on the aggregation type: numeric aggregation and non-numeric aggregation.

Example

In this part, we have examples for both types of numeric and non-numeric aggregation.

Numeric Aggregations

The vision of this deriver is to simplify and automate the process of numeric aggregations within SFrames, particularly when you need to derive new features by grouping and summarizing data from one SFrame to create a new SFrame.

from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.deriver import FeatureForAddressDeriver

data = {
"from_address": ["foo", "bar", "baz", "foo"],
"to_address": ["baz", "foo", "qux", "qux"],
"amount": ["10", "5", "15", "16"],
"token": ["token_1", "token_2", "token_3", "token_4"],
}
sf_default = DFrame.from_raw(data)
sf_address = DFrame.from_raw({"address": ["foo", "bar", "baz", "qux"]})
sf_input = GroupSFrame(children={"default": sf_default, "address": sf_address})
deriver = FeatureForAddressDeriver(
value_col="amount",
default_index_col="from_address",
address_index_col="address",
result_col="sent_amount",
is_numeric=True,
agg_func="sum",
)

sf_output = deriver(sf_input)
sf_output["address"].data

>>>
address sent_amount
0 foo 26.0
1 bar 5.0
2 baz 15.0
3 qux 0.0

Non-numeric Aggregations

In addition to numeric aggregations, the deriver also supports non-numeric aggregations. These are particularly useful for transforming categorical or textual data into aggregated formats.

deriver = FeatureForAddressDeriver(
value_col="token",
default_index_col="from_address",
address_index_col="address",
result_col="sent_tokens",
agg_func="unique",
is_numeric=False,
)

sf_output = deriver(sf_input)
sf_output["address"].data
>>>
address sent_tokens
0 foo [token_1, token_4]
1 bar [token_2]
2 baz [token_3]
3 qux []

Operation on Columns Deriver

If you want to generate a new feature based on an operation on the same SFrame, you can use this deriver. This functionality is particularly useful when you need to perform calculations or transformations within the same dataset to create new, insightful features.

For example, assume you have an SFrame that includes columns for the amount each address has sent (amount_sent) and received (amount_received). If you want to know the total amount a user interacted with, which is the sum of the amounts sent and received, you can use this deriver. By applying an aggregation function such as sum or a custom operation, the deriver will compute the total interaction amount for each user and create a new column in the SFrame with this derived feature.

How it works

Imagine you have the following SFrame:

addressamount_sentamount_received
address_110050
address_2200100
address_315075

You want to create a new column called total_interacted that represents the total amount each address has interacted with. Using this deriver, you would:

  1. Specify Columns for Operation:

    • Identify the columns amount_sent and amount_received to be used in the operation.
  2. Define the Operation:

    • Define the operation to sum these two columns. In this case, the operation is amount_sent + amount_received.
  3. Apply the Deriver:

    • The deriver will perform the specified operation on each row and create a new column total_interacted.

The resulting SFrame will be:

addressamount_sentamount_receivedtotal_interacted
address_110050150
address_2200100300
address_315075225

You can easily change the aggregation function by the agg_func argument. For example, if you want to multiply the columns, the result is:

addressamount_sentamount_receivedtotal_interacted
address_1100505000
address_220010020000
address_31507511250

Example

from seshat.data_class import DFrame
from seshat.transformer.deriver import OperationOnColsDeriver

sf_input = DFrame.from_raw(


{
"address": ["foo", "bar", "baz", "qux"],
"sent_amount": [10, 20, 0, 15],
"received_amount": [20, 10, 15, 0],
}
)

deriver = OperationOnColsDeriver(
cols=["sent_amount", "received_amount"],
result_col="total_transactions_amount",
agg_func="sum",
)
sf_output = deriver(sf_input)
sf_output.data

>>>
address sent_amount received_amount total_transactions_amount
0 foo 10 20 30
1 bar 20 10 30
2 baz 0 15 15
3 qux 15 0 15

In this deriver, the aggregation function can dynamically change using the agg_func argument.


There are also other built-in derivers that you can find on this page.