Skip to main content

Merger

The Merger transformer takes a GroupSFrame, selects two SFrames from it, and merges them either vertically or horizontally. This powerful tool is designed to facilitate the combination of data from different sources or different subsets of the same dataset, making it an essential component for data preparation and integration tasks.

Merging Types & How it works

As said before merger can merge sframes vertically or horizontally. In this part we explain both of these types and how to use them.

Vertically

By merging SFrames vertically, the Merger transformer appends rows from one SFrame to another, effectively stacking them. This is useful when you have separate datasets with the same structure (same columns) that you want to combine into a single dataset. For example, you might have transaction records from different months that you need to merge into a single continuous dataset for yearly analysis.

How it works

For vertically merging the only thing you must care about is to set the axis to 0 value. Othertilng will be handled by default. After that like other transformer set the group_keys to be match with your input GroupSFrame and then calling the merger with input sf.

from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.merger import Merger

sf_1 = DFrame.from_raw(
{"address": ["address_1", "address_2", "address_3"], "feature_1": [1, 12, 20]}
)

sf_2 = DFrame.from_raw(
{"address": ["address_4", "address_5", "address_6"], "feature_2": [1, 2, 3]}
)


sf = GroupSFrame(children={"default": sf_1, "other": sf_2})

merger = Merger(axis=0)
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1.0 NaN
1 address_2 12.0 NaN
2 address_3 20.0 NaN
3 address_4 NaN 1.0
4 address_5 NaN 2.0
5 address_6 NaN 3.0

As you see if columns are not matched the values in output sf contains NaN value.

Horizontally

Merging SFrames horizontally involves aligning them side by side, based on a common key or simply by index, effectively adding columns from one SFrame to another. This approach is useful when you have different sets of features for the same entities. For instance, you might have a dataset with address and some feature and another sf with other feature, and you want to combine these into a single dataset for a comprehensive view of each address.

How it works

By deafult the value of axis is set to 1 value. You have different options for configuration merger:

  • if the columns that merging should be base on are the same you can simply using on arguments like this:

    sf_left = DFrame.from_raw(
    {"address": ["address_1", "address_2", "address_3"], "feature_1": [1, 12, 20]}
    )

    sf_right = DFrame.from_raw(
    {"address": ["address_1", "address_2", "address_3"], "feature_2": [1, 2, 3]}
    )


    sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

    merger = Merger(axis=1, on="address")
    sf = merger(sf)
    sf["merged"].data
    >>>
    address feature_1 feature_2
    0 address_1 1 1
    1 address_2 12 2
    2 address_3 20 3
  • The on can also be the list of columns:

    sf_left = DFrame.from_raw(
    {
    "address": ["address_1", "address_2", "address_3"],
    "property": [10, 20, 30],
    "feature_1": [1, 12, 20],
    }
    )

    sf_right = DFrame.from_raw(
    {
    "address": ["address_1", "address_2", "address_3"],
    "property": [10, 20, 30],
    "feature_2": [1, 2, 3],
    }
    )


    sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

    merger = Merger(axis=1, on=["address", "property"])
    sf = merger(sf)
    sf["merged"].data
    >>>
    address property feature_1 feature_2
    0 address_1 10 1 1
    1 address_2 20 12 2
    2 address_3 30 20 3
  • If the column or columns for merging are different among two sf you can use left_on and right_on arguments:

    sf_left = DFrame.from_raw(
    {
    "left_address": ["address_1", "address_2", "address_3"],
    "property": [10, 20, 30],
    "feature_1": [1, 12, 20],
    }
    )

    sf_right = DFrame.from_raw(
    {
    "right_address": ["address_1", "address_2", "address_3"],
    "property": [10, 20, 30],
    "feature_2": [1, 2, 3],
    }
    )


    sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

    merger = Merger(
    axis=1,
    left_on=["left_address", "property"],
    right_on=["right_address", "property"],
    )
    sf = merger(sf)
    sf["merged"].data

    >>>
    left_address property feature_1 feature_2
    0 address_1 10 1 1
    1 address_2 20 12 2
    2 address_3 30 20 3

Other Options

Is there common options that you can still use them to configuration that how merger works:

Merge how

The important argument that you can set for mergers is the merge_how, the options for these can be one of the " left", "right", "inner", "outer", "cross".

For example note the different between merging "left" and "right" in these two example:

sf_left = DFrame.from_raw(
{
"address": ["address_1", "address_2", "address_3"],
"feature_1": [1, 12, 20],
}
)

sf_right = DFrame.from_raw(
{
"address": ["address_1", "address_4", "address_5"],
"feature_2": [1, 2, 3],
}
)


sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

merger = Merger(
axis=1,
on="address",
merge_how="left",
)
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1 1.0
1 address_2 12 NaN
2 address_3 20 NaN

And is you set the merge_how to be "right" then you have this output:

merger = Merger(
axis=1,
on="address",
merge_how="right",
)
sf = merger(sf)
sf["merged"].data
>>>
address feature_1 feature_2
0 address_1 1.0 1
1 address_4 NaN 2
2 address_5 NaN 3

Schema for right sf

To merge only specific columns from the right SFrame, you can pass right_schema to the Merger transformer. This allows you to specify which columns from the right SFrame should be included in the final merged dataset. By using right_schema, you can control the structure and content of the resulting SFrame, ensuring that only the necessary data is included.

right_schema = Schema(cols=[Col("A"), Col("B_right")])
merger = Merger(
axis=1,
on="A",
merge_how="left",
right_schema=right_schema,
group_keys={"default": "default", "other": "the_other", "merged": "merged_sf"},
)
result_sf = merger(sf)
result_sf["merged_sf"].data
>>> A B_left C_left B_right
0 foo 1 1 5
1 foo 1 1 8
2 bar 2 2 6
3 baz 3 3 7
4 foo 5 4 5
5 foo 5 4 8

Merging inplace

The inplace parameter in the Merger transformer specifies whether the merged SFrame should replace the original SFrame or create a new one. When inplace is set to True, the merged SFrame replaces the original SFrame, making it the default dataset for subsequent operations. This is useful when you want to update your data in place without creating additional copies.

merger = Merger(
axis=1,
on="A",
merge_how="left",
inplace=True,
group_keys={"default": "default", "other": "the_other"},
)
result_sf = merger(sf)
list(result_sf.keys)
>>> ['default', 'the_other']

Dropping unmerged

The drop_unmerged parameter in the Merger transformer is used to control whether the default and other SFrames are dropped from the GroupSFrame after the merge. When drop_unmerged is set to True, both the default and other SFrames are removed from the GroupSFrame once the merge operation is complete. This helps in maintaining a clean dataset by ensuring that only the merged SFrame remains.

sf["sf_new"] = sf["default"]
merger = Merger(
axis=1,
on="A",
merge_how="left",
drop_unmerged=True,
group_keys={"default": "default", "other": "the_other"},
)
result_sf = merger(sf)
list(result_sf.keys)
>>> ['sf_new', 'merged']

List Merger

To get the list of SFrames, whether they are group or non-group, and return only one group with all input SFrames combined, you can use a method that consolidates all the input SFrames into a single GroupSFrame.

How it works

This merger first create empty GroupSFrame, then for each if sframes in input list:

  1. If the sframe is non-group, add the sframe into the group with the name of sf_prefix argument plus the index of that sframe inside the input list.

  2. If the sframe is grouped, for each child add the sframe into the result group with the name of sf_prefix argument plus the index of that element plus the sf key of the child

Example

from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.merger.base import ListMerger

some_sf = DFrame.from_raw(None)
sf_list = [`GroupSFrame(children={"default": some_sf, "address": some_sf})`, some_sf]

merger = ListMerger(sf_prefix="sf")
sf = merger(sf_list)
list(sf.keys)
>>>
['sf0_default', 'sf0_address', 'sf1']

Multi Merger

The “MultiMerger” transformer will apply multiple mergers on the input SFrame. It takes a list of mergers and applies each of them sequentially to the SFrame.

Example

sf_left = DFrame.from_raw(
{
"left_address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_1": [1, 12, 20],
}
)

sf_right = DFrame.from_raw(
{
"right_address": ["address_1", "address_2", "address_3"],
"property": [10, 20, 30],
"feature_2": [1, 2, 3],
}
)

sf_bottom = DFrame.from_raw(
{
"left_address": ["address_4", "address_5", "address_6"],
"property": [40, 50, 60],
"feature_3": [5, 6, 7],
}
)

sf = GroupSFrame(children={"default": sf_left, "other": sf_right, "bottom": sf_bottom})


merger = MultiMerger(
mergers=[
Merger(
axis=1,
left_on=["left_address", "property"],
right_on=["right_address", "property"],
),
Merger(axis=0, group_keys={"default": "merged", "other": "bottom"}),
]
)

sf = merger(sf)
sf["merged"].data
>>>
left_address property feature_1 feature_2 feature_3
0 address_1 10 1.0 1.0 NaN
1 address_2 20 12.0 2.0 NaN
2 address_3 30 20.0 3.0 NaN
3 address_4 40 NaN NaN 5.0
4 address_5 50 NaN NaN 6.0
5 address_6 60 NaN NaN 7.0