Version: 0.19.3

Feature set API

Registering a feature set

To register a feature set, you first need to obtain the schema. See Schema API for information on how to create the schema.

Python
Scala

project.feature_sets.register(schema, "feature_set_name", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", secret=False, partition_by=None, time_travel_column_as_partition=False, flow=None)

If the secret argument is set to True, the feature set is visible only to its owners (which also means all owners of the project where this feature set is being registered). Other users in the system can not see the feature set in the output of the "list feature sets" call and cannot view the feature set details.

If the partition_by argument is not set, the time travel column will be used by Feature Store to partition the layout by each ingestion. If it is defined, time_travel_column_as_partition can be set to True to use time travel based partitioning additionally.

note

The feature_sets.register, and feature_set.flow methods use the enum FeatureSetFlow. Enum (enumeration) is a fundamental concept in programming languages that allow developers to define a set of named values. They provide a convenient way to group related values and make code more readable and maintainable.

If the flow argument is set, it will influence where data is stored. Following values (from enum FeatureSetFlow) are supported:

FeatureSetFlow.OFFLINE_ONLY - data is stored only in offline feature store. Online ingestion and materialization is disabled.
FeatureSetFlow.ONLINE_ONLY - data is stored only in online feature store. Offline ingestion and materialization is disabled.
FeatureSetFlow.OFFLINE_ONLINE_MANUAL - data is stored in both offline and online Feature Store, but automatic materialization to online is disabled. That means propagating data between online to offline is automated, but offline to online is manual and must be triggered by online materialization call.
FeatureSetFlow.OFFLINE_ONLINE_AUTOMATIC - data is stored in both offline and online Feature Store, and automatic materialization to online is enabled. That means this is used to autimatically propagate data between offline - online and online - offline. You don’t have to call materialize_online as it is done automatically.

project.featureSets.register(schema, "feature_set_name", description="", primaryKey=Seq(), timeTravelColumn="", timeTravelColumn_format="yyyy-MM-dd HH:mm:ss", secret=false, partitionBy="", timeTravelColumnAsPartition=false, flow="")

If the secret argument is set to true, the feature set is visible only to its owners (which also means all owners of the project where this feature set is being registered). Other users in the system can not see the feature set in the output of the "list feature sets" call and can not view the feature set details.

If the partitionBy argument is not set, the time travel column will be used by Feature Store to partition the layout by each ingestion. If it is defined, timeTravelColumnAsPartition can be set to true to use time travel based partitioning additionally.

note

The featureSets.register, and featureSets.flow methods use the enum FeatureSetFlow. Enum (enumeration) is a fundamental concept in programming languages that allow developers to define a set of named values. They provide a convenient way to group related values and make code more readable and maintainable.

If the flow argument is set, it will influence where data is stored. Following values (from enumeration ai.h2o.featurestore.core.FeatureSetFlow) are supported:

FeatureSetFlow.OFFLINE_ONLY - data is stored only in offline feature store. Online ingestion and materialization is disabled.
FeatureSetFlow.ONLINE_ONLY - data is stored only in online feature store. Offline ingestion and materialization is disabled.
FeatureSetFlow.OFFLINE_ONLINE_MANUAL - data is stored in both offline and online Feature Store, but automatic materialization to online is disabled. That means propagating data between online to offline is automated, but offline to online is manual and must be triggered by online materialization call.
FeatureSetFlow.OFFLINE_ONLINE_AUTOMATIC - data is stored in both offline and online Feature Store, and automatic materialization to online is enabled. That means this is used to autimatically propagate data between offline - online and online - offline. You don’t have to call materializeOnline as it is done automatically.

note

In case primary key or partition by arguments contain same feature multiple times, only distinct values are used.

note

If value in primary key or partition by or time travel column corresponds to two or more features, most nested is selected by default. In other cases, specific feature can be selected by enclosing the feature name in ``

For example, feature set contains feature named "test.data" and second feature "test" with nested feature "data". But default for value "test.data", nested feature "data" will be selected. If feature with name "test.data" should be selected, value should be changed to "`.test.data'`"

note

Feature Store is using time format used by Spark. Specification is available at https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

To see naming conventions for feature set names, please visit Default naming rules.

To register a derived feature set, you first need to obtain the derived schema. See Schema API for information on how to create the schema.

Python
Scala

import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")

derived_schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)

project.feature_sets.register(derived_schema, "derived_feature_set", description="", primary_key=None, time_travel_column=None, time_travel_column_format="yyyy-MM-dd HH:mm:ss", secret=False, partition_by=None, time_travel_column_as_partition=False)

import ai.h2o.featurestore.core.transformations.SparkPipeline

val sparkPipelineTransformation = t.SparkPipeline("...")

val derivedSchema = client.extractDerivedSchema(Seq(parentFeatureSet), sparkPipelineTransformation)

project.featureSets.register(derivedSchema, "derived_feature_set", description="", primaryKey=Seq(), timeTravelColumn="", timeTravelColumn_format="yyyy-MM-dd HH:mm:ss", secret=false, partitionBy="", timeTravelColumnAsPartition=false)

Features can be masked by setting Special Data fields in the schema. For further information, please visit Modify special data on a schema.

Setting any of the following attributes to true marks the feature for masking:

spi
pci
rpi
demographic
sensitive

note

Feature Store does not support registering feature sets with the following characters in column names:

,
;
{ or }
( or )
new line character
tab character
=

Time travel column selection

You can specify a time travel column during the registration call. If the column is specified, Feature Store will use that column to obtain time travel data and will use it for incremental ingest purposes. The explicitly passed time travel column must be present in the schema passed to the registration call.

If the time travel column is not specified, a virtual one is created, so you can still do time travel on static feature sets. Each ingestion to this feature set is treated as a new batch of data with a new timestamp.

Use the following register method argument to specify the name of the time travel column explicitly:

Python
Scala

time_travel_column

timeTravelColumn

Inferring the data type of date-time columns during feature set registration

File types without schema information: For file types that have no metadata about column types (e.g., CSV), Feature Store detects date-time columns as regular string.

File types containing schema information: For file types that keep information about the data types (e.g., Parquet), Feature Store respects those types. If a date-time column is stored with a type of Timestamp or Date, Feature Store will respect that during the registration.

Listing feature sets within a project

note

The list method does not return feature sets directly. Instead, it returns an iterator which obtains the feature sets lazily.

Python
Scala

project.feature_sets.list(query=None, advanced_search_options=None)

project.featureSets.list(query="", advancedSearchOption=Seq())

The query and advancedSearchOption arguments are optional and specify which feature sets should be returned. By default, no filtering options are specified.

To filter feature sets by name, description or tags please use query parameter.

Python
Scala

project.feature_sets.list(query="My feature")

project.featureSets.list(query="My feature")

The advancedSearchOption allows to filter feature sets by feature name, description or tags.

To provide the 'advancedSearchOption' in your requests, follow these steps:

Python
Scala

from featurestore.core.search_operator import SearchOperator
from featurestore.core.search_field import SearchField
from featurestore import AdvancedSearchOption
search_options = [AdvancedSearchOption(search_operator=SearchOperator.SEARCH_OPERATOR_LIKE, search_field=SearchField.SEARCH_FIELD_FEATURE_NAME, search_value="super feature")]

project.feature_sets.list(advanced_search_options=search_options)

import ai.h2o.featurestore.core.SearchOperator
import ai.h2o.featurestore.core.SearchField
import ai.h2o.featurestore.core.entities.AdvancedSearchOption
searchOptions = Seq(AdvancedSearchOption(SearchOperator.SEARCH_OPERATOR_LIKE, SearchField.SEARCH_FIELD_FEATURE_NAME, "super feature"))

project.featureSets.list(advancedSearchOption=searchOptions)

Both parameters could be used together.

Obtaining a feature set

Python
Scala

fs = project.feature_sets.get("feature_set_name", version=None)

val fs = project.featureSets.get("feature_set_name")

val fs = project.featureSets.get("feature_set_name", "version")

If the version is not specified, the latest version of the feature set is returned.

You can also list all major versions of the feature set:

Python
Scala

fs.major_versions()

fs.majorVersions()

This call shows all major versions of the feature set (the current and previous ones).

You can also list all versions of the feature set:

Python
Scala

fs.list_versions()

fs.listVersions()

This call shows all versions of the feature set (the current and previous ones).

Previewing data

You can preview up to a maximum of 100 rows and 50 features.

Python
Scala

fs.get_preview()

fs.getPreview()

Setting feature set permissions

Refer to Permissions for more information.

Deleting feature sets

Python
Scala

fs = project.feature_sets.get("name")
fs.delete()

val fs = project.featureSets.get("name")
fs.delete()

Deleting feature set major versions

Python
Scala

fs = project.feature_sets.get("name")
major_versions = fs.major_versions()
major_versions[0].delete()

val fs = project.featureSets.get("name")
majorVersions = fs.majorVersions()
majorVersions(0).delete()

Updating feature set fields

To update the field, simply call the setter of that field, for example:

Python
Scala

fs = project.feature_sets.get("name")
fs.secret = False
fs.deprecated = True
fs.time_to_live.offline = 2
fs.special_data.legal.approved = True
fs.special_data.legal.notes = "Legal notes"
fs.features["col"].special_data.legal.approved = True
fs.features["col"].special_data.legal.notes = "Legal notes"
# Add a new tag to the feature set
fs.tags.append("new tag") # This will add the new tag to the list of existing tags
# Add new tags that will overwrite any existing tags
fs.tags = ["new tag 1", "new tag 2"] # This will overwrite the existing tags with the given list of values
# Assigning a string to tags is not supported
fs.tags = "new tag" # This operation is not supported as tags accepts only a list of strings as input
# Add a new value to the data source domains on the feature set
fs.data_source_domains.append("new domain") # This will add the new domain to the list of existing domains
# Add new domains that will overwrite any existing domains
fs.data_source_domains = ["new domain 1", "new domain 2"] # This will overwrite the existing domains with the given list of values
# Assigning a string to domain is not supported
fs.data_source_domains = "new domain" # This operation is not supported as domain accepts only a list of strings as input

val fs = project.featureSets.get("name")
fs.secret = false
fs.deprecated = true
fs.timeToLive.offline = 2
fs.specialData.legal.approved = True
fs.specialData.legal.notes = "Legal notes"
fs.features("col").specialData.legal.approved = true
fs.features("col").specialData.legal.notes = "Legal notes"
// Add a new tag to the feature set
fs.tags = fs.tags :+ "new tag" # This will add the new tag to the list of existing tags
// Add new tags that will overwrite any existing tags
fs.tags = Seq("new tag 1", "new tag 2") # This will overwrite the existing tags with the given seq of values
// Assigning a string to tags is not supported
fs.tags = "new tag" # This operation is not supported as tags accepts only a list of strings as input
// Add a new value to the data source domains on the feature set
fs.dataSourceDomains = fs.dataSourceDomains :+ "new domain" # This will add the new domain to the list of existing domains
// Add new domains that will overwrite any existing domains
fs.dataSourceDomains = Seq("new domain 1", "new domain 2") # This will overwrite the existing domains with the given list of values
// Assigning a string to domain is not supported
fs.dataSourceDomains = "new domain" # This operation is not supported as domain accepts only a list of strings as input

Feature type can be changed by:

Python
Scala

from featurestore.core.entities.feature import CATEGORICAL
fs = project.feature_sets.get("name")
feature = fs.features["feature"]
my_feature.profile.feature_type = CATEGORICAL

import ai.h2o.featurestore.core.entities.Feature.CATEGORICAL
val fs = project.featureSets.get("name")
val feature = fs.features("feature")
feature.profile.featureType = CATEGORICAL

The following list of fields can be updated on the feature set object:

Python
Scala

- tags
- data_source_domains
- feature_set_type
- description
- application_name
- application_id
- deprecated
- process_interval
- process_interval_unit
- flow
- feature_set_state
- secret
- time_to_live.ttl_offline
- time_to_live.ttl_offline_interval
- time_to_live.ttl_online
- time_to_live.ttl_online_interval
- special_data.legal.approved
- special_data.legal.notes
- feature[].status
- feature[].profile.feature_type
- feature[].importance
- feature[].description
- feature[].special
- feature[].monitoring.anomaly_detection
- feature[].classifiers

- tags
- dataSourceDomains
- featureSetType
- description
- applicationName
- applicationId
- deprecated
- processInterval
- processIntervalUnit
- flow
- featureSetState
- secret
- timeToLive.ttlOffline
- timeToLive.ttlOfflineInterval
- timeToLive.ttlOnline
- timeToLive.ttlOnlineInterval
- specialData.legal.approved
- specialData.legal.notes
- feature[].status
- feature[].profile.featureType
- feature[].importance
- feature[].description
- feature[].special
- feature[].monitoring.anomalyDetection
- feature[].classifiers

To retrospectively find out who and when updated a feature set, call:

Python
Scala

fs.last_updated_by
fs.last_updated_date_time

fs.lastUpdatedBy
fs.lastUpdatedDateTime

Recommendation and classifiers

Refer to the Recommendation API for more information.

New version API

Refer to the Create new feature set version API for more information.

Feature set schema API

Getting schema

To get feature set's schema, run:

Python
Scala

fs = project.feature_sets.get("gdp")
fs.schema.get()

val fs = project.featureSets.get("gdp")
fs.schema.get()

Checking schema compatibility

To compare feature set's schema with the new data source's schema, run:

Python
Scala

fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.is_compatible_with(new_schema, compare_data_types=True)

val fs = project.featureSets.get("gdp")
val newSchema = client.extractSchemaFromSource(<source>)
fs.schema.isCompatibleWith(newSchema, compareDataTypes=true)

Parameters explanation:

Python
Scala

new_schema new schema to check compatibility with.
compare_data_types accepts True/False, indicates whether data type needs to be compared or not.
- If compare_data_types is True, then data types for features with same name will be verified.
- If compare_data_types is False, then data types for features with same name will not be verified.

newSchema new schema to check compatibility with.
compareDataTypes accepts true/false, indicates whether data type needs to be compared or not.
- If compareDataTypes is true, then data types for features with same name will be verified.
- If compareDataTypes is false, then data types for features with same name will not be verified.

Patching new schema

Patch schema checks for matching features between the 'new schema' and the existing 'fs.schema'. If there is a match, then the meta data such as special_data, description etc are copied into the new_schema

To patch the new schema with feature set's schema, run:

Python
Scala

fs = project.feature_sets.get("gdp")
new_schema = client.extract_schema_from_source(<source>)
fs.schema.patch_from(new_schema, compare_data_types=True)

val fs = project.featureSets.get("gdp")
val newSchema = client.extractSchemaFromSource(<source>)
fs.schema.patchFrom(newSchema, compareDataTypes=true)

Parameters explanation:

Python
Scala

new_schema new schema that needs to be patched.
compare_data_types accepts True/False, indicates whether data type are to be compared while patching.
- If compare_data_types is True, then data type from feature set schema is retained for features with same name and different types.
- If compare_data_types is False, then data type from new schema is retained for features with same name and different types.

newSchema new schema that needs to be patched.
compareDataTypes accepts true/false, indicates whether data type are to be compared while patching.
- If compareDataTypes is true, then data type from feature set schema is retained for features with same name and different types.
- If compareDataTypes is false, then data type from new schema is retained for features with same name and different types.

Offline to online API

To push existing data from offline Feature store into online, run:

Blocking approach:

Python
Scala

feature_set.materialize_online()

featureSet.materializeOnline()

Non-Blocking approach:

Python
Scala

future = feature_set.materialize_online_async()

future = featureSet.materializeOnlineAsync()

note

Feature set must have a primary key and time travel column defined in order to materialize the offline store into online.

More information about asynchronous methods is available at Asynchronous methods.

Subsequent calls to materialization only push the new records since the last call to online.

Feature set jobs API

You can get the list of jobs that are currently processing for the specific feature set by running:

Python
Scala

You can also retrieve a specific type of job by specifying the job_type parameter.

from featurestore.core.job_types import INGEST, RETRIEVE, EXTRACT_SCHEMA
fs.get_active_jobs()
fs.get_active_jobs(job_type=INGEST)

You can also retrieve specific type of job by specifying the jobType parameter.

import ai.h2o.featurestore.core.JobTypes.{INGEST, RETRIEVE, EXTRACT_SCHEMA}
fs.getActiveJobs()
fs.getActiveJobs(jobType=INGEST)

Refreshing feature set

To refresh the feature set to contain the latest information, call:

Python
Scala

fs.refresh()

fs.refresh()

Getting recommendations

To get recommendations, call:

Python
Scala

fs.get_recommendations()

fs.getRecommendations()

The following conditions must hold for recommendations:

The feature set must have at least one or more classifiers defined.
The results will be based on the retrieve permissions of the user.

Marking feature as target variable

When feature sets are used to train ML models, it can be beneficial to know which feature was used as model's target variable. In order to communicate this knowledge between different feature set users, there is a possibility to mark/discard a feature as a target variable and list those marked features.

Python
Scala

feature_state = fs.features["state"]
feature_state.mark_as_target_variable()

fs.list_features_used_as_target_variable()

feature_state.discard_as_target_variable()

val featureState = fs.features("state")
featureState.markAsTargetVariable()

fs.listFeaturesUsedAsTargetVariable()

featureState.discardAsTargetVariable()

Listing feature set users

From feature set owner's perspective, it may be needed to understand who is actually allowed to access and modify the given feature set. Therefore, there are convenience methods to list feature set users according to their rights. Each of these methods returns list of users that have specified or higher rights, their actual access rights and a resource type (project or feature set) specifying, where the access right permission comes from.

note

The list method does not return users directly. Instead, it returns an iterator which obtains the users lazily.

Python
Scala

# listing users by access rights
fs = project.feature_sets.get("training_fs")
owners = fs.list_owners()
editors = fs.list_editors()
sensitive_consumers = fs.list_sensitive_consumers()
consumers = fs.list_consumers()
viewers = fs.list_viewers()

# accessing returned element
owner = next(owners)
owner.user
owner.access_type
owner.resource_type

 // listing users by access rights
val fs = project.featureSets.get("training_fs")
val owners = fs.listOwners()
val editors = fs.listEditors()
val sensitiveConsumers = fs.listSensitiveConsumers()
val consumers = fs.listConsumers()
val viewers = fs.listViewers()

// accessing returned element
val owner = owners.next
owner.user
owner.accessType
owner.resourceType

Artifacts

Refer to the Artifacts API for more information.

Feedback

Submit and view feedback for this page
Send feedback about H2O Feature Store to cloud-feedback@h2o.ai

Feature set API

Registering a feature set​

Time travel column selection​

Inferring the data type of date-time columns during feature set registration​

Listing feature sets within a project​

Obtaining a feature set​

Previewing data​

Setting feature set permissions​

Deleting feature sets​

Deleting feature set major versions​

Updating feature set fields​

Recommendation and classifiers​

New version API​

Feature set schema API​

Getting schema​

Checking schema compatibility​

Patching new schema​

Offline to online API​

Feature set jobs API​

Refreshing feature set​

Getting recommendations​

Marking feature as target variable​

Listing feature set users​

Artifacts​

Registering a feature set

Time travel column selection

Inferring the data type of date-time columns during feature set registration

Listing feature sets within a project

Obtaining a feature set

Previewing data

Setting feature set permissions

Deleting feature sets

Deleting feature set major versions

Updating feature set fields

Recommendation and classifiers

New version API

Feature set schema API

Getting schema

Checking schema compatibility

Patching new schema

Offline to online API

Feature set jobs API

Refreshing feature set

Getting recommendations

Marking feature as target variable

Listing feature set users

Artifacts