Merger
The Merger transformer takes a GroupSFrame, selects two SFrames from it, and merges them either vertically or horizontally. This powerful tool is designed to facilitate the combination of data from different sources or different subsets of the same dataset, making it an essential component for data preparation and integration tasks.
Merging Types & How it works
As said before merger can merge sframes vertically or horizontally. In this part we explain both of these types and how to use them.
Vertically
By merging SFrames vertically, the Merger transformer appends rows from one SFrame to another, effectively stacking them. This is useful when you have separate datasets with the same structure (same columns) that you want to combine into a single dataset. For example, you might have transaction records from different months that you need to merge into a single continuous dataset for yearly analysis.
How it works
For vertically merging the only thing you must care about is to set the axis
to 0 value. Othertilng will be handled by
default. After that like other transformer set the group_keys
to be match with your input GroupSFrame and then calling
the merger with input sf.
from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.merger import Merger
sf_1 = DFrame.from_raw(
{"address": ["address_1", "address_2", "address_3"], "feature_1": [1, 12, 20]}
)
sf_2 = DFrame.from_raw(
{"address": ["address_4", "address_5", "address_6"], "feature_2": [1, 2, 3]}
)
sf = GroupSFrame(children={"default": sf_1, "other": sf_2})
merger = Merger(axis=0)
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1.0 NaN
1 address_2 12.0 NaN
2 address_3 20.0 NaN
3 address_4 NaN 1.0
4 address_5 NaN 2.0
5 address_6 NaN 3.0
As you see if columns are not matched the values in output sf contains NaN value.
Horizontally
Merging SFrames horizontally involves aligning them side by side, based on a common key or simply by index, effectively adding columns from one SFrame to another. This approach is useful when you have different sets of features for the same entities. For instance, you might have a dataset with address and some feature and another sf with other feature, and you want to combine these into a single dataset for a comprehensive view of each address.
How it works
By deafult the value of axis
is set to 1 value. You have different options for configuration merger:
-
if the columns that merging should be base on are the same you can simply using
on
arguments like this:sf_left = DFrame.from_raw(
{"address": ["address_1", "address_2", "address_3"], "feature_1": [1, 12, 20]}
)
sf_right = DFrame.from_raw(
{"address": ["address_1", "address_2", "address_3"], "feature_2": [1, 2, 3]}
)
sf = GroupSFrame(children={"default": sf_left, "other": sf_right})
merger = Merger(axis=1, on="address")
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1 1
1 address_2 12 2
2 address_3 20 3 -
The
on
can also be the list of columns:sf_left = DFrame.from_raw(
{
"address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_1": [1, 12, 20],
}
)
sf_right = DFrame.from_raw(
{
"address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_2": [1, 2, 3],
}
)
sf = GroupSFrame(children={"default": sf_left, "other": sf_right})
merger = Merger(axis=1, on=["address", "property"])
sf = merger(sf)
sf["merged"].data
>>>
address property feature_1 feature_2
0 address_1 10 1 1
1 address_2 20 12 2
2 address_3 30 20 3 -
If the column or columns for merging are different among two sf you can use
left_on
andright_on
arguments:sf_left = DFrame.from_raw(
{
"left_address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_1": [1, 12, 20],
}
)
sf_right = DFrame.from_raw(
{
"right_address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_2": [1, 2, 3],
}
)
sf = GroupSFrame(children={"default": sf_left, "other": sf_right})
merger = Merger(
axis=1,
left_on=["left_address", "property"],
right_on=["right_address", "property"],
)
sf = merger(sf)
sf["merged"].data
>>>
left_address property feature_1 feature_2
0 address_1 10 1 1
1 address_2 20 12 2
2 address_3 30 20 3
Other Options
Is there common options that you can still use them to configuration that how merger works:
Merge how
The important argument that you can set for mergers is the merge_how
, the options for these can be one of the "
left", "right", "inner", "outer", "cross".
For example note the different between merging "left" and "right" in these two example:
sf_left = DFrame.from_raw(
{
"address": ["address_1", "address_2", "address_3"],
"feature_1": [1, 12, 20],
}
)
sf_right = DFrame.from_raw(
{
"address": ["address_1", "address_4", "address_5"],
"feature_2": [1, 2, 3],
}
)
sf = GroupSFrame(children={"default": sf_left, "other": sf_right})
merger = Merger(
axis=1,
on="address",
merge_how="left",
)
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1 1.0
1 address_2 12 NaN
2 address_3 20 NaN
And is you set the merge_how
to be "right" then you have this output:
merger = Merger(
axis=1,
on="address",
merge_how="right",
)
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1.0 1
1 address_4 NaN 2
2 address_5 NaN 3
Schema for right sf
To merge only specific columns from the right SFrame, you can pass right_schema to the Merger transformer. This allows you to specify which columns from the right SFrame should be included in the final merged dataset. By using right_schema, you can control the structure and content of the resulting SFrame, ensuring that only the necessary data is included.
right_schema = Schema(cols=[Col("A"), Col("B_right")])
merger = Merger(
axis=1,
on="A",
merge_how="left",
right_schema=right_schema,
group_keys={"default": "default", "other": "the_other", "merged": "merged_sf"},
)
result_sf = merger(sf)
result_sf["merged_sf"].data
>>> A B_left C_left B_right
0 foo 1 1 5
1 foo 1 1 8
2 bar 2 2 6
3 baz 3 3 7
4 foo 5 4 5
5 foo 5 4 8
Merging inplace
The inplace
parameter in the Merger transformer specifies whether the merged SFrame should replace the original SFrame
or create a new one. When inplace is set to True, the merged SFrame replaces the original SFrame, making it the default
dataset for subsequent operations. This is useful when you want to update your data in place without creating additional
copies.
merger = Merger(
axis=1,
on="A",
merge_how="left",
inplace=True,
group_keys={"default": "default", "other": "the_other"},
)
result_sf = merger(sf)
list(result_sf.keys)
>>> ['default', 'the_other']
Dropping unmerged
The drop_unmerged
parameter in the Merger transformer is used to control whether the default
and other
SFrames are
dropped from the GroupSFrame after the merge. When drop_unmerged
is set to True
, both the default
and other
SFrames are removed from the GroupSFrame once the merge operation is complete. This helps in maintaining a clean dataset
by ensuring that only the merged SFrame remains.
sf["sf_new"] = sf["default"]
merger = Merger(
axis=1,
on="A",
merge_how="left",
drop_unmerged=True,
group_keys={"default": "default", "other": "the_other"},
)
result_sf = merger(sf)
list(result_sf.keys)
>>> ['sf_new', 'merged']
List Merger
To get the list of SFrames, whether they are group or non-group, and return only one group with all input SFrames combined, you can use a method that consolidates all the input SFrames into a single GroupSFrame.
How it works
This merger first create empty GroupSFrame, then for each if sframes in input list:
-
If the sframe is non-group, add the sframe into the group with the name of
sf_prefix
argument plus the index of that sframe inside the input list. -
If the sframe is grouped, for each child add the sframe into the result group with the name of
sf_prefix
argument plus the index of that element plus the sf key of the child
Example
from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.merger.base import ListMerger
some_sf = DFrame.from_raw(None)
sf_list = [`GroupSFrame(children={"default": some_sf, "address": some_sf})`, some_sf]
merger = ListMerger(sf_prefix="sf")
sf = merger(sf_list)
list(sf.keys)
>>>
['sf0_default', 'sf0_address', 'sf1']
Multi Merger
The “MultiMerger” transformer will apply multiple mergers on the input SFrame. It takes a list of mergers and applies each of them sequentially to the SFrame.
Example
sf_left = DFrame.from_raw(
{
"left_address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_1": [1, 12, 20],
}
)
sf_right = DFrame.from_raw(
{
"right_address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_2": [1, 2, 3],
}
)
sf_bottom = DFrame.from_raw(
{
"left_address": ["address_4", "address_5", "address_6"],
"property": [40, 50, 60],
"feature_3": [5, 6, 7],
}
)
sf = GroupSFrame(children={"default": sf_left, "other": sf_right, "bottom": sf_bottom})
merger = MultiMerger(
mergers=[
Merger(
axis=1,
left_on=["left_address", "property"],
right_on=["right_address", "property"],
),
Merger(axis=0, group_keys={"default": "merged", "other": "bottom"}),
]
)
sf = merger(sf)
sf["merged"].data
>>>
left_address property feature_1 feature_2 feature_3
0 address_1 10 1.0 1.0 NaN
1 address_2 20 12.0 2.0 NaN
2 address_3 30 20.0 3.0 NaN
3 address_4 40 NaN NaN 5.0
4 address_5 50 NaN NaN 6.0
5 address_6 60 NaN NaN 7.0