Other Trimmers
In the trimmer introduction, we discussed trimmer implementations and explained some useful trimmers and how they work. In this section, we will explain other trimmers as well.
Duplicate Trimmer
The duplicate trimmer removes duplicate rows from the input SFrame based on a specified subset of columns. By identifying and retaining only unique rows according to the selected columns, this trimmer ensures that your dataset is free from redundancy.
Example
from seshat.data_class import DFrame
from seshat.transformer.trimmer.base import DuplicateTrimmer
sf = DFrame.from_raw(
{
"address": ["address_1", "address_1", "address_1"],
"feature": [1, 2, 1],
}
)
trimmer = DuplicateTrimmer()
sf = trimmer(sf)
sf.data
>>>
address feature
0 address_1 1
1 address_1 2
The default value of the subset is all the columns.
NaN Trimmer
This trimmer is similar to the Duplicate Trimmer but trims the rows based on the subset that has
NaN values. This trimmer uses the dropna
methods in pandas and PySpark.
Example
sf = DFrame.from_raw(
{
"address": ["address_1", "address_2", "address_3"],
"feature": [1, 2, pd.NA],
}
)
trimmer = NaNTrimmer(subset=["feature"])
sf = trimmer(sf)
sf.data
>>>
address feature
0 address_1 1
1 address_2 2
Like the Duplicate Trimmer, the default value of the subset is all the columns.
Group Trimmer
The Group Trimmer is designed to trim (remove) a specific SFrame from an input GroupSFrame. This trimmer is useful when you have SFrames within the GroupSFrame that are no longer needed and you want to remove them to clean up your dataset.
You can set the SFrame that you want to remove with the default
key of group_keys
.
Example
some_sf = DFrame.from_raw(None)
sf = GroupSFrame(children={"default": some_sf, "address": some_sf})
trimmer = GroupTrimmer(group_keys={"default": "address"})
sf = trimmer(sf)
list(sf.keys)
>>>
['default']
Inclusion Trimmer
The Inclusion Trimmer will drop rows from the default
SFrame that do NOT exist in the other
SFrame within a
GroupSFrame. This functionality is useful when you want to filter out certain rows from the main dataset based on their
presence in another dataset, effectively cleaning and refining your data for further analysis.
If you want to drop rows that exist in the other
SFrame, you can set the exclude
attribute to false.
Example
sf_default = DFrame.from_raw({"address": ["address_1", "address_2", "address_3"]})
sf_other = DFrame.from_raw({"address": ["address_1"]})
sf = GroupSFrame(children={"default": sf_default, "other": sf_other})
sf = InclusionTrimmer(default_col="address", other_col="address")(sf)
sf["default"].data
>>>
address
0 address_2
1 address_3
If you set the exclude
attribute to false:
sf_default = DFrame.from_raw({"address": ["address_1", "address_2", "address_3"]})
sf_other = DFrame.from_raw({"address": ["address_1"]})
sf = GroupSFrame(children={"default": sf_default, "other": sf_other})
sf = InclusionTrimmer(default_col="address", other_col="address", exclude=False)(sf)
sf["default"].data
>>>
address
0 address_1
By understanding and using these additional trimmers, you can further refine and clean your datasets, ensuring they are well-prepared for analysis and processing.