Merger

The Merger transformer takes a GroupSFrame, selects two SFrames from it, and merges them either vertically or horizontally. This powerful tool is designed to facilitate the combination of data from different sources or different subsets of the same dataset, making it an essential component for data preparation and integration tasks.

Merging Types & How it works

As said before merger can merge sframes vertically or horizontally. In this part we explain both of these types and how to use them.

Vertically

By merging SFrames vertically, the Merger transformer appends rows from one SFrame to another, effectively stacking them. This is useful when you have separate datasets with the same structure (same columns) that you want to combine into a single dataset. For example, you might have transaction records from different months that you need to merge into a single continuous dataset for yearly analysis.

How it works

For vertically merging the only thing you must care about is to set the axis to 0 value. Othertilng will be handled by default. After that like other transformer set the group_keys to be match with your input GroupSFrame and then calling the merger with input sf.

from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.merger import Merger

sf_1 = DFrame.from_raw(
    {"address": ["address_1", "address_2", "address_3"], "feature_1": [1, 12, 20]}
)

sf_2 = DFrame.from_raw(
    {"address": ["address_4", "address_5", "address_6"], "feature_2": [1, 2, 3]}
)


sf = GroupSFrame(children={"default": sf_1, "other": sf_2})

merger = Merger(axis=0)
sf = merger(sf)
sf["merged"].data
>>>
     address  feature_1  feature_2
0  address_1        1.0        NaN
1  address_2       12.0        NaN
2  address_3       20.0        NaN
3  address_4        NaN        1.0
4  address_5        NaN        2.0
5  address_6        NaN        3.0

As you see if columns are not matched the values in output sf contains NaN value.

Horizontally

Merging SFrames horizontally involves aligning them side by side, based on a common key or simply by index, effectively adding columns from one SFrame to another. This approach is useful when you have different sets of features for the same entities. For instance, you might have a dataset with address and some feature and another sf with other feature, and you want to combine these into a single dataset for a comprehensive view of each address.

How it works

By deafult the value of axis is set to 1 value. You have different options for configuration merger:

if the columns that merging should be base on are the same you can simply using on arguments like this:

sf_left = DFrame.from_raw(
{"address": ["address_1", "address_2", "address_3"], "feature_1": [1, 12, 20]}
)

sf_right = DFrame.from_raw(
{"address": ["address_1", "address_2", "address_3"], "feature_2": [1, 2, 3]}
)


sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

merger = Merger(axis=1, on="address")
sf = merger(sf)
sf["merged"].data
>>>
address  feature_1  feature_2
0  address_1          1          1
1  address_2         12          2
2  address_3         20          3

The on can also be the list of columns:

sf_left = DFrame.from_raw(
    {
        "address": ["address_1", "address_2", "address_3"],
        "property": [10, 20, 30],
        "feature_1": [1, 12, 20],
    }
)

sf_right = DFrame.from_raw(
    {
        "address": ["address_1", "address_2", "address_3"],
        "property": [10, 20, 30],
        "feature_2": [1, 2, 3],
    }
)


sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

merger = Merger(axis=1, on=["address", "property"])
sf = merger(sf)
sf["merged"].data
>>>
     address  property  feature_1  feature_2
0  address_1        10          1          1
1  address_2        20         12          2
2  address_3        30         20          3

If the column or columns for merging are different among two sf you can use left_on and right_on arguments:

sf_left = DFrame.from_raw(
    {
        "left_address": ["address_1", "address_2", "address_3"],
        "property": [10, 20, 30],
        "feature_1": [1, 12, 20],
    }
)

sf_right = DFrame.from_raw(
    {
        "right_address": ["address_1", "address_2", "address_3"],
        "property": [10, 20, 30],
        "feature_2": [1, 2, 3],
    }
)


sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

merger = Merger(
    axis=1,
    left_on=["left_address", "property"],
    right_on=["right_address", "property"],
)
sf = merger(sf)
sf["merged"].data

>>>
left_address  property  feature_1  feature_2
0    address_1        10          1          1
1    address_2        20         12          2
2    address_3        30         20          3

Other Options

Is there common options that you can still use them to configuration that how merger works:

Merge how

The important argument that you can set for mergers is the merge_how, the options for these can be one of the " left", "right", "inner", "outer", "cross".

For example note the different between merging "left" and "right" in these two example:

sf_left = DFrame.from_raw(
    {
        "address": ["address_1", "address_2", "address_3"],
        "feature_1": [1, 12, 20],
    }
)

sf_right = DFrame.from_raw(
    {
        "address": ["address_1", "address_4", "address_5"],
        "feature_2": [1, 2, 3],
    }
)


sf = GroupSFrame(children={"default": sf_left, "other": sf_right})

merger = Merger(
    axis=1,
    on="address",
    merge_how="left",
)
sf = merger(sf)
sf["merged"].data
>>>
     address  feature_1  feature_2
0  address_1          1        1.0
1  address_2         12        NaN
2  address_3         20        NaN

And is you set the merge_how to be "right" then you have this output:

merger = Merger(
    axis=1,
    on="address",
    merge_how="right",
)
sf = merger(sf)
sf["merged"].data
>>>
     address  feature_1  feature_2
0  address_1        1.0          1
1  address_4        NaN          2
2  address_5        NaN          3

Schema for right sf

To merge only specific columns from the right SFrame, you can pass right_schema to the Merger transformer. This allows you to specify which columns from the right SFrame should be included in the final merged dataset. By using right_schema, you can control the structure and content of the resulting SFrame, ensuring that only the necessary data is included.

right_schema = Schema(cols=[Col("A"), Col("B_right")])
merger = Merger(
    axis=1,
    on="A",
    merge_how="left",
    right_schema=right_schema,
    group_keys={"default": "default", "other": "the_other", "merged": "merged_sf"},
)
result_sf = merger(sf)
result_sf["merged_sf"].data
>>>   A  B_left  C_left  B_right
0  foo       1      1        5
1  foo       1      1        8
2  bar       2      2        6
3  baz       3      3        7
4  foo       5      4        5
5  foo       5      4        8

Merging inplace

The inplace parameter in the Merger transformer specifies whether the merged SFrame should replace the original SFrame or create a new one. When inplace is set to True, the merged SFrame replaces the original SFrame, making it the default dataset for subsequent operations. This is useful when you want to update your data in place without creating additional copies.

merger = Merger(
    axis=1,
    on="A",
    merge_how="left",
    inplace=True,
    group_keys={"default": "default", "other": "the_other"},
)
result_sf = merger(sf)
list(result_sf.keys)
>>> ['default', 'the_other']

Dropping unmerged

The drop_unmerged parameter in the Merger transformer is used to control whether the default and other SFrames are dropped from the GroupSFrame after the merge. When drop_unmerged is set to True, both the default and other SFrames are removed from the GroupSFrame once the merge operation is complete. This helps in maintaining a clean dataset by ensuring that only the merged SFrame remains.

sf["sf_new"] = sf["default"]
merger = Merger(
    axis=1,
    on="A",
    merge_how="left",
    drop_unmerged=True,
    group_keys={"default": "default", "other": "the_other"},
)
result_sf = merger(sf)
list(result_sf.keys)
>>> ['sf_new', 'merged']

List Merger

To get the list of SFrames, whether they are group or non-group, and return only one group with all input SFrames combined, you can use a method that consolidates all the input SFrames into a single GroupSFrame.

How it works

This merger first create empty GroupSFrame, then for each if sframes in input list:

If the sframe is non-group, add the sframe into the group with the name of sf_prefix argument plus the index of that sframe inside the input list.
If the sframe is grouped, for each child add the sframe into the result group with the name of sf_prefix argument plus the index of that element plus the sf key of the child

Example

from seshat.data_class import DFrame, GroupSFrame
from seshat.transformer.merger.base import ListMerger

some_sf = DFrame.from_raw(None)
sf_list = [`GroupSFrame(children={"default": some_sf, "address": some_sf})`, some_sf]

merger = ListMerger(sf_prefix="sf")
sf = merger(sf_list)
list(sf.keys)
>>>
['sf0_default', 'sf0_address', 'sf1']

Multi Merger

The “MultiMerger” transformer will apply multiple mergers on the input SFrame. It takes a list of mergers and applies each of them sequentially to the SFrame.

Example

sf_left = DFrame.from_raw(
    {
        "left_address": ["address_1", "address_2", "address_3"],
        "property": [10, 20, 30],
        "feature_1": [1, 12, 20],
    }
)

sf_right = DFrame.from_raw(
    {
        "right_address": ["address_1", "address_2", "address_3"],
        "property": [10, 20, 30],
        "feature_2": [1, 2, 3],
    }
)

sf_bottom = DFrame.from_raw(
    {
        "left_address": ["address_4", "address_5", "address_6"],
        "property": [40, 50, 60],
        "feature_3": [5, 6, 7],
    }
)

sf = GroupSFrame(children={"default": sf_left, "other": sf_right, "bottom": sf_bottom})


merger = MultiMerger(
    mergers=[
        Merger(
            axis=1,
            left_on=["left_address", "property"],
            right_on=["right_address", "property"],
        ),
        Merger(axis=0, group_keys={"default": "merged", "other": "bottom"}),
    ]
)

sf = merger(sf)
sf["merged"].data
>>>
  left_address  property  feature_1  feature_2  feature_3
0    address_1        10        1.0        1.0        NaN
1    address_2        20       12.0        2.0        NaN
2    address_3        30       20.0        3.0        NaN
3    address_4        40        NaN        NaN        5.0
4    address_5        50        NaN        NaN        6.0
5    address_6        60        NaN        NaN        7.0

Merger

Merging Types & How it works​

Vertically​

How it works​

Horizontally​

How it works​

Other Options​

Merge how​

Schema for right sf​

Merging inplace​

Dropping unmerged​

List Merger​

How it works​

Example​

Multi Merger​

Example​

Merging Types & How it works

Vertically

How it works

Horizontally

How it works

Other Options

Merge how

Schema for right sf

Merging inplace

Dropping unmerged

List Merger

How it works

Example

Multi Merger

Example