Skip to main content

Implementation

Transformers are widely used in several sections of the seshat project. Transformers have a general purpose: whenever an SFrame is transformed from state A to state B, it should be a transformer. By this definition, Pipeline, Branch, Deriver, Splitter, and Merger are all transformers.

How transformers work

Transformers generally receive an SFrame as input and, based on its type, find the proper handler. If a handler for the SFrame type is not found, the SFrame is converted to the DEFAULT_FRAME of the transformer. After executing the transformation on the SFrame, the result is converted back to the initial SFrame type.

The dispatching to find a handler compatible with the input SFrame uses a naming pattern for handlers. The handler base name is defined as the HANDLER_NAME attribute in the transformer. Each SFrame type has a frame name, for example, DFrame's frame name is "df" and SPFrame's is "spf." Thus, the handler for every SFrame follows this format: HANDLER_NAME + _ + SFrame's frame name.

class CustomTransformer(Transformer):
HANDLER_NAME = "handle"

def handle_df(self, default: pd.DataFrame, *args, **kwargs): ...

def handle_spf(self, default: PySparkDataFrame, *args, **kwargs): ...

One of the key points of transformers is that they are very compatible with GroupSFrame. Generally, transformers receive a GroupSFrame in the constructor. Based on this, the raw value of the SFrame is retrieved from the input SFrame and passed to the handler methods. The handle method retrieves the raw data of the SFrame and returns a dictionary with keys the same as the group_keys of the transformer and values as raw data. After the handle method executes, the result is set to the input SFrame.

Changing single SFrame to GroupSFrame

If a transformer needs only one raw data and returns multiple raw data, then if the input SFrame is not grouped, the output SFrame will definitely be grouped. One example of these transformers is SFrameFromColsDeriver.

Only groups

Some transformers work only with GroupSFrame because they need multiple raw data in the handler. This can be set using the ONLY_GROUP attribute.

Validating input SFrame

It is important that the transformer validates the input SFrame and raises clear exceptions before passing it to the handler methods. This validation occurs in the validate method. If you want to add some other validation, you can do this:

class SomeTransformer(Transformer):
def validate(self, sf: SFrame):
super().validate(sf)
# Your validation code.

Validating columns in the input SFrame is a common type of validation. For this, the _validate_columns method is implemented. This method receives the input SFrame, its key, and the columns that should be present. The best place to call this method is in the validate method.

def validate(self, sf: SFrame):
super().validate(sf)
self._validate_columns(sf, self.default_sf_key, *self.cols)

Immutability

Transformers are immutable, and the input SFrame is different from the output SFrame. These two SFrames are not the same.

transformer: Transformer
sf_output = transformer(sf_input)
sf_output == sf_input
>>> False