Skip to main content

Other Trimmers

In the trimmer introduction, we discussed trimmer implementations and explained some useful trimmers and how they work. In this section, we will explain other trimmers as well.

Duplicate Trimmer

The duplicate trimmer removes duplicate rows from the input SFrame based on a specified subset of columns. By identifying and retaining only unique rows according to the selected columns, this trimmer ensures that your dataset is free from redundancy.

Example

from seshat.data_class import DFrame
from seshat.transformer.trimmer.base import DuplicateTrimmer

sf = DFrame.from_raw(
{
"address": ["address_1", "address_1", "address_1"],
"feature": [1, 2, 1],
}
)

trimmer = DuplicateTrimmer()
sf = trimmer(sf)
sf.data
>>>
address feature
0 address_1 1
1 address_1 2

The default value of the subset is all the columns.

NaN Trimmer

This trimmer is similar to the Duplicate Trimmer but trims the rows based on the subset that has NaN values. This trimmer uses the dropna methods in pandas and PySpark.

Example

sf = DFrame.from_raw(
{
"address": ["address_1", "address_2", "address_3"],
"feature": [1, 2, pd.NA],
}
)

trimmer = NaNTrimmer(subset=["feature"])
sf = trimmer(sf)
sf.data
>>>
address feature
0 address_1 1
1 address_2 2

Like the Duplicate Trimmer, the default value of the subset is all the columns.

Group Trimmer

The Group Trimmer is designed to trim (remove) a specific SFrame from an input GroupSFrame. This trimmer is useful when you have SFrames within the GroupSFrame that are no longer needed and you want to remove them to clean up your dataset.

You can set the SFrame that you want to remove with the default key of group_keys.

Example

some_sf = DFrame.from_raw(None)
sf = GroupSFrame(children={"default": some_sf, "address": some_sf})

trimmer = GroupTrimmer(group_keys={"default": "address"})
sf = trimmer(sf)
list(sf.keys)
>>>
['default']

Inclusion Trimmer

The Inclusion Trimmer will drop rows from the default SFrame that do NOT exist in the other SFrame within a GroupSFrame. This functionality is useful when you want to filter out certain rows from the main dataset based on their presence in another dataset, effectively cleaning and refining your data for further analysis.

If you want to drop rows that exist in the other SFrame, you can set the exclude attribute to false.

Example

sf_default = DFrame.from_raw({"address": ["address_1", "address_2", "address_3"]})
sf_other = DFrame.from_raw({"address": ["address_1"]})

sf = GroupSFrame(children={"default": sf_default, "other": sf_other})

sf = InclusionTrimmer(default_col="address", other_col="address")(sf)
sf["default"].data
>>>
address
0 address_2
1 address_3

If you set the exclude attribute to false:

sf_default = DFrame.from_raw({"address": ["address_1", "address_2", "address_3"]})
sf_other = DFrame.from_raw({"address": ["address_1"]})

sf = GroupSFrame(children={"default": sf_default, "other": sf_other})

sf = InclusionTrimmer(default_col="address", other_col="address", exclude=False)(sf)
sf["default"].data
>>>
address
0 address_1

By understanding and using these additional trimmers, you can further refine and clean your datasets, ensuring they are well-prepared for analysis and processing.