Skip to main content

Create Transformer

Creating a transformer can be done simply by following these steps:

  1. Define default group keys for the input sf. To do this, you must set DEFAULT_GROUP_KEYS.

    class CustomTransformer(Transformer):
    DEFAULT_GROUP_KEYS = {"default": "default", "address": "address"}

    The DEFAULT_GROUP_KEYS are set to the group_keys by default. If you want to use different names, add the group_keys to the constructor when you use it.

  2. If you need more than one raw data for your transformation and all of them must exist and cannot be None, set ONLY_GROUP to True.

    class CustomTransformer(Transformer):
    DEFAULT_GROUP_KEYS = {"default": "default", "address": "address"}
    ONLY_GROUP = True

    The default value of ONLY_GROUP is False.

  3. Override the validate method. If you need to validate that the input sf must have specific columns, you can use the _validate_columns method.

    class CustomTransformer(Transformer):
    DEFAULT_GROUP_KEYS = {"default": "default", "address": "address"}
    ONLY_GROUP = True

    def validate(self, sf: SFrame):
    super().validate(sf)
    self._validate_columns(sf, self.default_sf_key, "column_1", "column_2")
  4. Set HANDLER_NAME to your preferred value. For example, you can choose derive for derivers and trim for trimmers.

    class CustomTransformer(Transformer):
    DEFAULT_GROUP_KEYS = {"default": "default", "address": "address"}
    ONLY_GROUP = True

    def validate(self, sf: SFrame):
    super().validate(sf)
    self._validate_columns(sf, self.default_sf_key, "column_1", "column_2")
  5. Implement methods based on the input raw format. The method name should follow this rule:

    HANDLER_NAME + _ + FRAME_NAME

    For example, FRAME_NAME for pandas is df and for pyspark is spf.

    class CustomTransformer(Transformer):
    DEFAULT_GROUP_KEYS = {"default": "default", "address": "address"}
    ONLY_GROUP = True

    def validate(self, sf: SFrame):
    super().validate(sf)
    self._validate_columns(sf, self.default_sf_key, "column_1", "column_2")

    def transform_df(default: pd.DataFrame, address: pd.DataFrame, *args, **kwargs): ...
  6. Return a dictionary from the handler method, so that the keys match the group_keys and the values are raw data.

     class CustomTransformer(Transformer):
    DEFAULT_GROUP_KEYS = {"default": "default", "address": "address"}
    ONLY_GROUP = True

    def validate(self, sf: SFrame):
    super().validate(sf)
    self._validate_columns(sf, self.default_sf_key, "column_1", "column_2")

    def transform_df(default: pd.DataFrame, address: pd.DataFrame, *args, **kwargs):
    # your transformation implementation ....
    return {"default": default, "address": address}