Schema
The Schema
has two main jobs: first, to change the data type of columns, and second, to change the column names.
Col Config
The Col
is a dataclass that serves as a configuration for the schema. The main fields of Col
are:
original
: the current name of the columnto
: the desired column namedtype
: the data type to which the column values should be converted
from seshat.data_class import DFrame
from seshat.transformer.schema import Schema, Col
sf = DFrame.from_raw(data={"A": ["1", "2"]})
schema = Schema(cols=[Col("A", to="renamed_A")])
sf = schema(sf)
sf.data.columns
>>> Index(['renamed_A'], dtype='object')
Other Col Config
Schema is a general thing that is used in different places of project, So the col config may carry the data that has meaning of some special purpose.
is_id
: If you want that some columns be the ID you can set these field to true for that column. The ID column is used in flipside sources and in the SQL Database Saver.update_func
: This field is specially for when you want to update the database, so the database should know that how update each column. This field can besum
,mean
orreplace
, thereplace
value is used for when you want to replace the column value with existing one.
Exclusive
The exclusive
argument indicates whether the schema should keep only columns defined for it or just change defined
columns and ignore other columns.
sf = DFrame.from_raw(data={"A": ["1", "2"], "B": ["foo", "bar"]})
schema = Schema(cols=[Col("A", to="renamed_A")], exclusive=True)
sf = schema(sf)
sf.data.columns
>>> Index(['renamed_A'], dtype='object') # exclusive is true
sf = DFrame.from_raw(data={"A": ["1", "2"], "B": ["foo", "bar"]})
schema = Schema(cols=[Col("A", to="renamed_A")], exclusive=False)
sf = schema(sf)
sf.data.columns
>>> Index(['renamed_A', 'B'], dtype='object') # exclusive is false
The default value of exclusive
is True
.
ID
The Schema
sometimes needs an ID. For example, in saver for cases when data must be updated, there is a need for
columns that act as IDs. If you want a column to be an ID, you must set the field id_id
for Col
to be True
.