Introduction To Derivers
The deriver is a transformer designed to derive new data or features from an existing dataset. Similar to other transformers, the deriver utilizes group keys, retrieving these keys from the input data and passing their raw values to the handlers for processing.
Derivers possess the capability to work on both group SFrames and single SFrames. While some derivers function exclusively on group SFrames, they typically require raw data for their handler functions. This allows for a more flexible and nuanced approach to data transformation, making it possible to generate a wide variety of new features or data points from the existing dataset.
How it works
In the context of implementing derivers for pandas dataframes, the HANDLER_NAME
used is derive
, and the
corresponding implementation is derive_df
. This function facilitates the application of the derive process to pandas
dataframes, enabling users to efficiently transform and enrich their datasets within the familiar pandas framework.
SFrame from Columns
One of the key features of Seshat is the SFrame and its ability to be grouped. The deriver in Seshat is designed to
create a new SFrame from columns of an existing SFrame. For example, assume you load a simple transactions dataset that
contains from_address
and to_address
columns, and you want to create a new SFrame with all the values of these
columns. In this case, the deriver is used.
How it works
The deriver extracts the specified columns from the input SFrame and consolidates their values into a new SFrame. This functionality is particularly useful for isolating specific data points or preparing data for further analysis. By creating a new SFrame with the desired columns, the data processing workflow is streamlined, allowing for a focus on relevant attributes.
This process begins by identifying group keys, which are retrieved from the input SFrame. These raw values are passed to handler functions that process the data according to the specified logic. The flexibility to handle both group and single SFrames allows the deriver to adapt to various data structures and requirements.
Example
from seshat.data_class import DFrame
from seshat.transformer.deriver import SFrameFromColsDeriver
data = {"from_address": ["zero", "bar", "baz"], "to_address": ["baz", "zero", "qux"]}
sf_input = DFrame.from_raw(data)
sf_output = SFrameFromColsDeriver(
group_keys={"default": "default", "address": "address"},
result_col="address",
cols=["from_address", "to_address"],
)(sf_input)
sf_output["address"].data
>>>
address
0 zero
1 baz
2 bar
3 qux
Feature for Address
Assume you want to derive a feature for an SFrame from another SFrame. This deriver is built specifically for that purpose. You have the same columns in both SFrames, which can be a single column or a list of columns, such as "address" or ["contract_address", "token"]. You want to derive a new feature from other columns, requiring an aggregation function and the specific columns from which to derive the feature. All of these parameters can be set in this deriver.
The ability to handle both single columns and lists of columns adds to the flexibility of this deriver. By specifying the necessary columns and the aggregation function, you can tailor the feature derivation process to fit the specific needs of your dataset. This flexibility and ease of use make the deriver a powerful tool in the data preparation toolkit, ensuring that you can efficiently create meaningful and useful features from your existing data. The automatic conversion to numeric types further enhances its utility, allowing for seamless integration into various analytical and modeling workflows.
How it works
If your columns need to be numeric and you're uncertain whether the column types are set correctly, there's no issue.
The is_numeric
parameter is set to True
by default, ensuring that the column types are converted to numeric
automatically. This feature simplifies the process by ensuring that the data types are correctly formatted for numerical
operations, removing any potential hurdles related to data type inconsistencies.
This deriver is especially useful in scenarios where data transformation and feature engineering are essential. For example, in a financial transactions dataset, you might have columns such as "from_address" and "to_address" in one SFrame and similar columns in another SFrame. Using this deriver, you can aggregate data like total transaction amounts or the count of transactions per address, thus creating new, valuable features for analysis or machine learning models.
There are two types of deriving a feature based on the aggregation type: numeric aggregation and non-numeric aggregation.
Example
In this part, we have examples for both types of numeric and non-numeric aggregation.
Numeric Aggregations
The vision of this deriver is to simplify and automate the process of numeric aggregations within SFrames, particularly when you need to derive new features by grouping and summarizing data from one SFrame to create a new SFrame.
from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.deriver import FeatureForAddressDeriver
data = {
"from_address": ["foo", "bar", "baz", "foo"],
"to_address": ["baz", "foo", "qux", "qux"],
"amount": ["10", "5", "15", "16"],
"token": ["token_1", "token_2", "token_3", "token_4"],
}
sf_default = DFrame.from_raw(data)
sf_address = DFrame.from_raw({"address": ["foo", "bar", "baz", "qux"]})
sf_input = GroupSFrame(children={"default": sf_default, "address": sf_address})
deriver = FeatureForAddressDeriver(
value_col="amount",
default_index_col="from_address",
address_index_col="address",
result_col="sent_amount",
is_numeric=True,
agg_func="sum",
)
sf_output = deriver(sf_input)
sf_output["address"].data
>>>
address sent_amount
0 foo 26.0
1 bar 5.0
2 baz 15.0
3 qux 0.0
Non-numeric Aggregations
In addition to numeric aggregations, the deriver also supports non-numeric aggregations. These are particularly useful for transforming categorical or textual data into aggregated formats.
deriver = FeatureForAddressDeriver(
value_col="token",
default_index_col="from_address",
address_index_col="address",
result_col="sent_tokens",
agg_func="unique",
is_numeric=False,
)
sf_output = deriver(sf_input)
sf_output["address"].data
>>>
address sent_tokens
0 foo [token_1, token_4]
1 bar [token_2]
2 baz [token_3]
3 qux []
Operation on Columns Deriver
If you want to generate a new feature based on an operation on the same SFrame, you can use this deriver. This functionality is particularly useful when you need to perform calculations or transformations within the same dataset to create new, insightful features.
For example, assume you have an SFrame that includes columns for the amount each address has sent (amount_sent
) and
received (amount_received
). If you want to know the total amount a user interacted with, which is the sum of the
amounts sent and received, you can use this deriver. By applying an aggregation function such as sum or a custom
operation, the deriver will compute the total interaction amount for each user and create a new column in the SFrame
with this derived feature.
How it works
Imagine you have the following SFrame:
address | amount_sent | amount_received |
---|---|---|
address_1 | 100 | 50 |
address_2 | 200 | 100 |
address_3 | 150 | 75 |
You want to create a new column called total_interacted
that represents the total amount each address has interacted
with. Using this deriver, you would:
-
Specify Columns for Operation:
- Identify the columns
amount_sent
andamount_received
to be used in the operation.
- Identify the columns
-
Define the Operation:
- Define the operation to sum these two columns. In this case, the operation is
amount_sent + amount_received
.
- Define the operation to sum these two columns. In this case, the operation is
-
Apply the Deriver:
- The deriver will perform the specified operation on each row and create a new column
total_interacted
.
- The deriver will perform the specified operation on each row and create a new column
The resulting SFrame will be:
address | amount_sent | amount_received | total_interacted |
---|---|---|---|
address_1 | 100 | 50 | 150 |
address_2 | 200 | 100 | 300 |
address_3 | 150 | 75 | 225 |
You can easily change the aggregation function by the agg_func
argument. For example, if you want to multiply the
columns, the result is:
address | amount_sent | amount_received | total_interacted |
---|---|---|---|
address_1 | 100 | 50 | 5000 |
address_2 | 200 | 100 | 20000 |
address_3 | 150 | 75 | 11250 |
Example
from seshat.data_class import DFrame
from seshat.transformer.deriver import OperationOnColsDeriver
sf_input = DFrame.from_raw(
{
"address": ["foo", "bar", "baz", "qux"],
"sent_amount": [10, 20, 0, 15],
"received_amount": [20, 10, 15, 0],
}
)
deriver = OperationOnColsDeriver(
cols=["sent_amount", "received_amount"],
result_col="total_transactions_amount",
agg_func="sum",
)
sf_output = deriver(sf_input)
sf_output.data
>>>
address sent_amount received_amount total_transactions_amount
0 foo 10 20 30
1 bar 20 10 30
2 baz 0 15 15
3 qux 15 0 15
In this deriver, the aggregation function can dynamically change using the agg_func
argument.
There are also other built-in derivers that you can find on this page.