Skip to main content

Pipeline

A Pipeline is a versatile tool that applies a list of transformers sequentially to an input SFrame. Each transformer in the list processes the data and then passes the transformed output to the next transformer in the sequence. This chain of processing continues until the data has passed through all the transformers, resulting in the final output produced by the last transformer in the list.

The transformers in the list must be capable of applying a transformation and producing an output that can be handled by the next transformer in the list, if there is one. This allows for a flexible and modular design where different transformations can be chained together to achieve complex data processing workflows.

Pipeline is a Transformer

Note that the pipeline is a transformer and uses the same approach for configuration and calling. The configuration is done by adding pipes in the constructor, and to use it, you must call the object from the previous step. We will explain this further in this section.

How to Configure

Pipeline accepts a list of transformers as pipes. The order of pipes is important because, as mentioned before, the SFrame sequentially passes through each defined transformer.

pipeline = Pipeline(pipes=[Transformer1(), Transformer2()])

By passing these pipes, the pipeline is ready to be called.

Example

To illustrate, let's consider a simple example. Suppose we have two transformers: Transformer1 which normalizes data, and Transformer2 which filters out rows based on a condition. We can create and use a pipeline as follows:

from seshat.data_class import SFrame
from seshat.transformer.pipeline import Pipeline


class Transformer1:
def __call__(self, sf: SFrame) -> SFrame:
# Normalize data in sf
return sf

class Transformer2:
def __call__(self, sf: SFrame) -> SFrame:
# Filter rows in sf
return sf

pipeline = Pipeline(pipes=[Transformer1(), Transformer2()])

sf_input = SFrame.from_raw({"A": [1, 2, 3], "B": [4, 5, 6]})
sf_output = pipeline(sf_input)

print(sf_output.to_raw())

In this example, sf_input is processed by Transformer1 and then by Transformer2, resulting in sf_output.

Benefits of Using Pipelines

  • Modularity: Each transformation step is encapsulated in its own transformer, making the pipeline easy to extend and modify.
  • Reusability: Transformers can be reused across different pipelines, promoting code reuse.
  • Maintainability: The clear structure of pipelines helps in maintaining and understanding the data processing workflow.
  • Scalability: Pipelines can handle complex workflows by chaining multiple transformers, scaling from simple to advanced data processing tasks.

By understanding how to configure and use pipelines, you can create efficient and maintainable data processing workflows tailored to your specific needs.