Skip to main content

Schema

The Schema has two main jobs: first, to change the data type of columns, and second, to change the column names.

Col Config

The Col is a dataclass that serves as a configuration for the schema. The main fields of Col are:

  • original: the current name of the column
  • to: the desired column name
  • dtype: the data type to which the column values should be converted
from seshat.data_class import DFrame
from seshat.transformer.schema import Schema, Col

sf = DFrame.from_raw(data={"A": ["1", "2"]})
schema = Schema(cols=[Col("A", to="renamed_A")])
sf = schema(sf)

sf.data.columns
>>> Index(['renamed_A'], dtype='object')

Other Col Config

Schema is a general thing that is used in different places of project, So the col config may carry the data that has meaning of some special purpose.

  • is_id: If you want that some columns be the ID you can set these field to true for that column. The ID column is used in flipside sources and in the SQL Database Saver.
  • update_func: This field is specially for when you want to update the database, so the database should know that how update each column. This field can be sum, mean or replace, the replace value is used for when you want to replace the column value with existing one.

Exclusive

The exclusive argument indicates whether the schema should keep only columns defined for it or just change defined columns and ignore other columns.

sf = DFrame.from_raw(data={"A": ["1", "2"], "B": ["foo", "bar"]})
schema = Schema(cols=[Col("A", to="renamed_A")], exclusive=True)
sf = schema(sf)
sf.data.columns
>>> Index(['renamed_A'], dtype='object') # exclusive is true

sf = DFrame.from_raw(data={"A": ["1", "2"], "B": ["foo", "bar"]})
schema = Schema(cols=[Col("A", to="renamed_A")], exclusive=False)
sf = schema(sf)
sf.data.columns
>>> Index(['renamed_A', 'B'], dtype='object') # exclusive is false

The default value of exclusive is True.

ID

The Schema sometimes needs an ID. For example, in saver for cases when data must be updated, there is a need for columns that act as IDs. If you want a column to be an ID, you must set the field id_id for Col to be True.