Feature View
The "Feature View" manages the retrieval, processing, and optional storage of feature data for machine learning models, accommodating both real-time inference and batch training workflows. The class can handle different data sources and processing pipelines, depending on whether it is operating in online or offline mode.
When the "Feature View" is called, it generally first fetches data from the source. Based on its configuration, it may
split the fetched data and then pass it into the pipeline. After calling the "Feature View" using the save
method, the
result will be saved in the database according to the saving configuration. Additionally, by defining evaluators for
the "Feature View," the model output can be evaluated.
As mentioned, the "Feature View" has both online and offline modes. The online mode is used for inferencing, while the offline mode is used to find training data for AI models. Each mode has its own source and pipeline.
Influence on the Project
Feature view is the heart of Seshat, as the design of other sections is directly influenced by the feature view and its philosophy. Feature view gathers everything you want to do in Seshat into one cohesive class, encompassing transformers, pipelines, sources, splitters, profiling, and every other implementation present.
Feature view maintains a consistent flow: it fetches data using sources, transforms it using a pipeline of transformers, splits it if needed, and saves the result if a saver is defined.
Online & Offline Modes
There are online and offline modes for feature view. Offline mode is used when you make some transformations on input data and use the result for training models. On the other hand, the online mode is for inferencing. It can read the processed data from the source and, with other transformers, find the output that you need.
The feature view generally has both online and offline modes. By setting the online
attribute to true, you enable the
online mode, and by setting it to false, you enable the offline mode.
The online and offline modes differ in source and pipeline, so you must define these according to your needs.
First, define your feature view:
class MyFeatureView(FeatureView):
online_pipeline = Pipeline(pipes=[])
offline_pipeline = Pipeline(pipes=[])
Pipeline
To read more about the pipeline, read its documentation.
Now, if you want to set it to work in online mode, set the online
attribute.
class MyFeatureView(FeatureView):
online_pipeline = Pipeline(pipes=[])
offline_pipeline = Pipeline(pipes=[])
online = True
Equivalently, setting online
to false makes it work in offline mode, meaning that it runs the offline_pipeline
.
Source
Like the pipeline, the feature view accepts two attributes called online_source
and offline_source
. These can be
different if needed. For example, the online source can be the local source, and the
offline source can be the SQL database source.
If you are not familiar with the Source, check the source documentation to ensure you understand it.
class MyFeatureView(FeatureView):
# other feature view attributes...
offline_source = LocalSource(path="path_to_your_local_source")
online_source = SQLDBSource(url="your_database_url")
Integration with Profiling
Profiling can easily integrate into the feature view. First, you must configure the profiling, which can be done in
the profile_config
. Let's see the default configuration of profiling in the feature view and learn how to change its
configuration.
To learn more about it, see the profiling and logging documentation.
Default Profiling
The default profiling configuration is like this:
class FeatureView:
# other feature view attributes...
profile_config = ProfileConfig(logging.INFO, default_tracking=True)
The profile_config
must be set to an instance of "ProfileConfig." So, by default, you have these settings for
profiling:
- Log level is set to INFO
- Log directory, the directory where log files will be saved, is set to "./logs" path. If the directory does not exist, profiling will create it.
- Show in console is set to true, meaning that event logs will print into the console.
- Default tracking is enabled, so main methods of transformers, sources, and SFrame will be tracked. If you set other tracks using the track decorator (@track), your desired methods will also be tracked.
- By default, memory profiling and cprofiling are disabled.
Customize Configuration
Assume that you want to enable memory profiling and cprofiling. You can simply define profile_config
and enable them.
class MyFeatureView(FeatureView):
# other feature view attributes...
profile_config = ProfileConfig(
logging.INFO,
default_tracking=True,
mem_profile_conf=MemProfileConfig(log_path="memory.txt", enable=True),
cprofile_conf=CProfileConfig(log_path="cprofile.txt", enable=True),
)
After that, if your feature view is called, then inside the "./logs" directory, you will have these files.
$ cd logs && tree
.
├── cprofile.txt
├── event.txt
└── memory.txt
Example of Event Logs
By enabling profiling and setting show_in_console
to true, the event logs will print on the console.
2024-06-18 13:01:08,283 - INFO - >>> start LocalSource.fetch:
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/source/local/base.py:24
2024-06-18 13:01:08,625 - INFO - >>> finish LocalSource.fetch:
- Memory Changing: +78.21875
- Time Spent in method itself: 2.4417e-05
- Cumulative Time Spent: 0.12808609008789062
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/source/local/base.py:24
2024-06-18 13:01:08,625 - INFO - >>> start Pipeline.__call__:
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/transformer/pipeline/base.py:40
2024-06-18 13:01:08,836 - INFO - >>> finish Pipeline.__call__:
- Memory Changing: +0.015625
- Time Spent in method itself: 1.5830000000000001e-06
- Cumulative Time Spent: 1.4781951904296875e-05
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/transformer/pipeline/base.py:40
Splitting
Splitting is very useful to find the test and train datasets. You have different options for splitting the data. You can split it right after the data is fetched. First, let's define a splitter for the feature view:
class MyFeatureView(FeatureView):
# other feature view attributes...
splitter = Splitter()
By default, splitting occurs at the end of the feature view.
If you want the feature view to split the data at the start, set the split_at_start
attribute to true.
class MyFeatureView(FeatureView):
# other feature view attributes...
splitter = Splitter()
split_at_start = True
Test and Train Data
The test and train data can be accessed by calling the test_data
and train_data
methods, respectively. As you know,
the splitter result is a dictionary containing SFrames with string keys. By calling each of these methods, if the data
is not yet split, the split
method is called, which is responsible for passing the SFrame to the splitter and keeping
the result in the feature view instance. After that, the test and train data will be retrieved from the splitting result
kept in the feature view instance.
Example
Assume that you have this feature view, with split_at_start
equal to false, the default value.
class MyFeatureView(FeatureView):
offline_pipeline = Pipeline(pipes=[])
offline_source = LocalSource(path="./data/token_transfer_19000000_19005100.csv")
splitter = RandomSplitter()
split_at_start = False
feature_view = MyFeatureView()
feature_view()
test_sf = feature_view.test_data()
train_sf = feature_view.train_data()
If you set the split_at_start
to true, then splitting occurs automatically after the source is fetched.
class MyFeatureView(FeatureView):
offline_pipeline = Pipeline(pipes=[])
offline_source = LocalSource(path="./data/token_transfer_19000000_19005100.csv")
splitter = RandomSplitter()
split_at_start = True
feature_view = MyFeatureView()
feature_view()
feature_view.split_data
>>>
{'train': <seshat.data_class.pandas.DFrame at 0x2803c7a90>,
'test': <seshat.data_class.pandas.DFrame at 0x2803c6fd0>}
Note that for cases where the split occurs at the start, both test and train SFrames will be passed to the pipeline.
Flow
We already talked about the overall flow that the feature view has from beginning to end. But we want to explain more about the flow. This flow is run by calling the feature
view instance.
- Setup the profiler: First, the profiler, based on its configuration, must be set up. This setup is done by calling
the
setup
class method of the Profiler. - Get the source and the pipeline: As you know, the feature view has two modes, online and offline. Based on
the
online
attribute, the desired source and pipeline are found from the feature view attributes. - Fetch the data from the source: The source is called, and the data is fetched.
- Split if necessary: If the
split_at_start
is set to true, the splitter will split the data; otherwise, nothing happens. - Run the pipeline: The pipelines are called, and the SFrame is passed to them. If the data is split, then both test and train data are passed to the pipeline.
- Tear down the profiler: After that, the
tear_down
method for the profiler is called, saving the log files in the desired path.
Saving
You can also save the transformed data into the database if you define the splitter for your feature view. The train data will be passed to the saver; otherwise, the whole processed data will be used for saving.
Example
Saving to the database is not part of the feature view flow, so you must call it after the feature view instance is called.
class MyFeatureView(FeatureView):
offline_pipeline = Pipeline(pipes=[])
offline_source = LocalSource(path="./data/token_transfer_19000000_19005100.csv")
saver = Saver(save_configs=[...])
feature_view = MyFeatureView()
feature_view()
feature_view.save()
By following these steps, you can efficiently manage the entire data processing workflow within the feature view, from fetching and transforming data to saving the processed results in a database. This structured approach ensures a smooth and consistent handling of data, facilitating both real-time inference and batch training workflows.