Skip to main content

Feature View

The "Feature View" manages the retrieval, processing, and optional storage of feature data for machine learning models, accommodating both real-time inference and batch training workflows. The class can handle different data sources and processing pipelines, depending on whether it is operating in online or offline mode.

When the "Feature View" is called, it generally first fetches data from the source. Based on its configuration, it may split the fetched data and then pass it into the pipeline. After calling the "Feature View" using the save method, the result will be saved in the database according to the saving configuration. Additionally, by defining evaluators for the "Feature View," the model output can be evaluated.

As mentioned, the "Feature View" has both online and offline modes. The online mode is used for inferencing, while the offline mode is used to find training data for AI models. Each mode has its own source and pipeline.

Influence on the Project

Feature view is the heart of Seshat, as the design of other sections is directly influenced by the feature view and its philosophy. Feature view gathers everything you want to do in Seshat into one cohesive class, encompassing transformers, pipelines, sources, splitters, profiling, and every other implementation present.

Feature view maintains a consistent flow: it fetches data using sources, transforms it using a pipeline of transformers, splits it if needed, and saves the result if a saver is defined.

Online & Offline Modes

There are online and offline modes for feature view. Offline mode is used when you make some transformations on input data and use the result for training models. On the other hand, the online mode is for inferencing. It can read the processed data from the source and, with other transformers, find the output that you need.

The feature view generally has both online and offline modes. By setting the online attribute to true, you enable the online mode, and by setting it to false, you enable the offline mode.

The online and offline modes differ in source and pipeline, so you must define these according to your needs.

First, define your feature view:

class MyFeatureView(FeatureView):
online_pipeline = Pipeline(pipes=[])
offline_pipeline = Pipeline(pipes=[])

Pipeline

note

To read more about the pipeline, read its documentation.

Now, if you want to set it to work in online mode, set the online attribute.

class MyFeatureView(FeatureView):
online_pipeline = Pipeline(pipes=[])
offline_pipeline = Pipeline(pipes=[])
online = True

Equivalently, setting online to false makes it work in offline mode, meaning that it runs the offline_pipeline.

Source

Like the pipeline, the feature view accepts two attributes called online_source and offline_source. These can be different if needed. For example, the online source can be the local source, and the offline source can be the SQL database source.

note

If you are not familiar with the Source, check the source documentation to ensure you understand it.

class MyFeatureView(FeatureView):
# other feature view attributes...
offline_source = LocalSource(path="path_to_your_local_source")
online_source = SQLDBSource(url="your_database_url")

Integration with Profiling

Profiling can easily integrate into the feature view. First, you must configure the profiling, which can be done in the profile_config. Let's see the default configuration of profiling in the feature view and learn how to change its configuration.

note

To learn more about it, see the profiling and logging documentation.

Default Profiling

The default profiling configuration is like this:

class FeatureView:
# other feature view attributes...
profile_config = ProfileConfig(logging.INFO, default_tracking=True)

The profile_config must be set to an instance of "ProfileConfig." So, by default, you have these settings for profiling:

  • Log level is set to INFO
  • Log directory, the directory where log files will be saved, is set to "./logs" path. If the directory does not exist, profiling will create it.
  • Show in console is set to true, meaning that event logs will print into the console.
  • Default tracking is enabled, so main methods of transformers, sources, and SFrame will be tracked. If you set other tracks using the track decorator (@track), your desired methods will also be tracked.
  • By default, memory profiling and cprofiling are disabled.

Customize Configuration

Assume that you want to enable memory profiling and cprofiling. You can simply define profile_config and enable them.

class MyFeatureView(FeatureView):
# other feature view attributes...
profile_config = ProfileConfig(
logging.INFO,
default_tracking=True,
mem_profile_conf=MemProfileConfig(log_path="memory.txt", enable=True),
cprofile_conf=CProfileConfig(log_path="cprofile.txt", enable=True),
)

After that, if your feature view is called, then inside the "./logs" directory, you will have these files.

$ cd logs && tree
.
├── cprofile.txt
├── event.txt
└── memory.txt

Example of Event Logs

By enabling profiling and setting show_in_console to true, the event logs will print on the console.

2024-06-18 13:01:08,283 - INFO - >>> start LocalSource.fetch:
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/source/local/base.py:24
2024-06-18 13:01:08,625 - INFO - >>> finish LocalSource.fetch:
- Memory Changing: +78.21875
- Time Spent in method itself: 2.4417e-05
- Cumulative Time Spent: 0.12808609008789062
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/source/local/base.py:24
2024-06-18 13:01:08,625 - INFO - >>> start Pipeline.__call__:
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/transformer/pipeline/base.py:40
2024-06-18 13:01:08,836 - INFO - >>> finish Pipeline.__call__:
- Memory Changing: +0.015625
- Time Spent in method itself: 1.5830000000000001e-06
- Cumulative Time Spent: 1.4781951904296875e-05
- /path_to_seshat_sdk/seshat/sdk-seshat-python/seshat/transformer/pipeline/base.py:40

Splitting

Splitting is very useful to find the test and train datasets. You have different options for splitting the data. You can split it right after the data is fetched. First, let's define a splitter for the feature view:

class MyFeatureView(FeatureView):
# other feature view attributes...
splitter = Splitter()

By default, splitting occurs at the end of the feature view.

If you want the feature view to split the data at the start, set the split_at_start attribute to true.

class MyFeatureView(FeatureView):
# other feature view attributes...
splitter = Splitter()
split_at_start = True

Test and Train Data

The test and train data can be accessed by calling the test_data and train_data methods, respectively. As you know, the splitter result is a dictionary containing SFrames with string keys. By calling each of these methods, if the data is not yet split, the split method is called, which is responsible for passing the SFrame to the splitter and keeping the result in the feature view instance. After that, the test and train data will be retrieved from the splitting result kept in the feature view instance.

Example

Assume that you have this feature view, with split_at_start equal to false, the default value.

class MyFeatureView(FeatureView):
offline_pipeline = Pipeline(pipes=[])
offline_source = LocalSource(path="./data/token_transfer_19000000_19005100.csv")
splitter = RandomSplitter()
split_at_start = False

feature_view = MyFeatureView()
feature_view()
test_sf = feature_view.test_data()
train_sf = feature_view.train_data()

If you set the split_at_start to true, then splitting occurs automatically after the source is fetched.

class MyFeatureView(FeatureView):
offline_pipeline = Pipeline(pipes=[])
offline_source = LocalSource(path="./data/token_transfer_19000000_19005100.csv")
splitter = RandomSplitter()
split_at_start = True

feature_view = MyFeatureView()
feature_view()
feature_view.split_data
>>>
{'train': <seshat.data_class.pandas.DFrame at 0x2803c7a90>,
'test': <seshat.data_class.pandas.DFrame at 0x2803c6fd0>}

Note that for cases where the split occurs at the start, both test and train SFrames will be passed to the pipeline.

Flow

We already talked about the overall flow that the feature view has from beginning to end. But we want to explain more about the flow. This flow is run by calling the feature

view instance.

  • Setup the profiler: First, the profiler, based on its configuration, must be set up. This setup is done by calling the setup class method of the Profiler.
  • Get the source and the pipeline: As you know, the feature view has two modes, online and offline. Based on the online attribute, the desired source and pipeline are found from the feature view attributes.
  • Fetch the data from the source: The source is called, and the data is fetched.
  • Split if necessary: If the split_at_start is set to true, the splitter will split the data; otherwise, nothing happens.
  • Run the pipeline: The pipelines are called, and the SFrame is passed to them. If the data is split, then both test and train data are passed to the pipeline.
  • Tear down the profiler: After that, the tear_down method for the profiler is called, saving the log files in the desired path.

Saving

You can also save the transformed data into the database if you define the splitter for your feature view. The train data will be passed to the saver; otherwise, the whole processed data will be used for saving.

Example

Saving to the database is not part of the feature view flow, so you must call it after the feature view instance is called.

class MyFeatureView(FeatureView):
offline_pipeline = Pipeline(pipes=[])
offline_source = LocalSource(path="./data/token_transfer_19000000_19005100.csv")
saver = Saver(save_configs=[...])

feature_view = MyFeatureView()
feature_view()
feature_view.save()

By following these steps, you can efficiently manage the entire data processing workflow within the feature view, from fetching and transforming data to saving the processed results in a database. This structured approach ensures a smooth and consistent handling of data, facilitating both real-time inference and batch training workflows.