Version: 1.2.0

Schema API

A schema is extracted from a data source. The schema represents the features of the feature set.

Creating the schema

Python
Scala

create_from is available on the Schema class and is used to create a schema instance from a string formatted schema
create_derived_from is available on the Schema class and is used to create a derived schema instance from a string formatted schema and parent feature set along with transformation
to_string is available on a schema instance and is used to serialise the schema object to string format

createFrom is available on the Schema class and is used to create a schema instance from a string formatted schema
createDerivedFrom is available on the Schema class and is used to create a derived schema instance from a string formatted schema and parent feature set along with transformation
toString is available on a schema instance and is used to serialise the schema object to string format

Usage

Create a schema from a string

A schema can be created from a string format:

Python
Scala

from featurestore import Schema
schema = "col1 string, col2 string, col3 integer"
schema = Schema.create_from(schema)

import ai.h2o.featurestore.core.Schema
val schema = "col1 string, col2 string, col3 integer"
val schema = Schema.createFrom(schema)

Create a derived schema from a string

Python
Scala

from featurestore import Schema
import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")
schema_str = "id INT, text STRING, label DOUBLE, state STRING, date STRING, words ARRAY<STRING>"
schema = Schema.create_derived_from(schema_str, [parent_feature_set], spark_pipeline_transformation)

import ai.h2o.featurestore.core.Schema
import ai.h2o.featurestore.core.transformations.SparkPipeline
sparkPipelineTransformation = t.SparkPipeline("...")
schemaStr = "id INT, text STRING, label DOUBLE, state STRING, date STRING, words ARRAY<STRING>"
schema = Schema.createDerivedFrom(schemaStr, Seq(parentFeatureSet), sparkPipelineTransformation)

Create a schema from a data source

A schema can also be created from a data source. To see all supported data sources, see Supported data sources.

Python
Scala

schema = client.extract_schema_from_source(source)
schema = Client.extract_schema_from_source(source, credentials)

val schema = client.extractSchemaFromSource(source)
val schema = client.extractSchemaFromSource(source, credentials)

note

An optional parameter, credentials , can be specified. If specified, these credentials are used instead of environmental variables.

Create a schema from a feature set

Python
Scala

feature_set = project.feature_sets.get("example")
schema = Schema.create_from(feature_set)

val feature_set = project.featureSets.get("example")
val schema = Schema.createFrom(feature_set)

Create a derived schema from a parent feature set with applied transformation

A derived schema can be created from an existing feature set using selected transformation. To see all supported transformations, see Supported derived transformation.

Python
Scala

import featurestore.transformations as t
spark_pipeline_transformation = t.SparkPipeline("...")

schema = client.extract_derived_schema([parent_feature_set], spark_pipeline_transformation)

import ai.h2o.featurestore.core.transformations.SparkPipeline
val sparkPipelineTransformation = t.SparkPipeline("...")

val schema = client.extractDerivedSchema(Seq(parentFeatureSet), sparkPipelineTransformation)

Load schema from a feature set

You can also load a schema from an existing feature set:

Python
Scala

schema = feature_set.schema.get()

schema = featureSet.schema.get()

Create a new schema by changing the data type of the current schema

Python
Scala

from featurestore.core.data_types import STRING
schema["col"].data_type = STRING
# nested columns
schema["col1"].schema["col2"].data_type = STRING

import ai.h2o.featurestore.core.DataTypes.STRING
schema("col").dataType = STRING
# nested columns
schema("col1").schema("col2").dataType = STRING

Create a new schema by column selection

Python
Scala

schema.select(features)
schema.exclude(features)

schema.select(features)
schema.exclude(features)

Create a new schema by adding a new feature schema

Python
Scala

from featurestore.core.data_types import STRING
from featurestore import FeatureSchema
new_feature_schema = FeatureSchema("new_name", STRING)
# Append
schema.append(new_feature_schema)  # Append to the end
schema.append(new_feature_schema, schema["old"])  # Append after old
# Prepend
new_schema = schema.prepend(new_feature_schema)  # Prepend to the beginning
new_schema = schema.prepend(new_feature_schema, schema["old"])  # Prepend before old

import ai.h2o.featurestore.core.DataTypes.STRING
import ai.h2o.featurestore.core.FeatureSchema
val newFeatureSchema = FeatureSchema("new_name", STRING)
// Append
schema.append(newFeatureSchema)  // Append to the end
schema.append(newFeatureSchema, schema("old")) // Append after old
// Prepend
schema.prepend(newFeatureSchema) // Prepend to the beginning
schema.prepend(newFeatureSchema, schema("old")) // Prepend before old

Modify special data on a schema

Python
Scala

schema["col1"].special_data.sensitive = True
schema["col2"].special_data.spi = True
# Nested feature modification
schema["col3"].schema["col4"].special_data.pci = True

schema("col1").specialData.sensitive = true
schema("col2").specialData.spi = true
// Nested feature modification
schema("col3").schema("col4").specialData.pci = true

note

Available special data fields on the Schema object are spi, pci, rpi, demographic and sensitive. These are boolean fields and can be either set with true/false.

Modify feature type

Python
Scala

from featurestore.core.entities.feature import *
schema["col1"].feature_type = NUMERICAL
schema["col2"].feature_type = AUTOMATIC_DISCOVERY
# Nested feature modification
schema["col3"].schema["col4"].feature_type = TEXT

import ai.h2o.featurestore.core.entities.Feature._
schema("col1").featureType = NUMERICAL
schema("col2").featureType = AUTOMATIC_DISCOVERY
// Nested feature modification
schema("col3").schema("col4").featureType = TEXT

The AUTOMATIC_DISCOVERY means that the feature type will be determined on the backend side based on the feature data type automatically. AUTOMATIC_DISCOVERY is the default value for all the schema's feature types.

Set feature description

It is also possible to provide a description for a feature schema. This description is propagated to the feature.

Python
Scala

schema["col1"].description = "The best feature"

schema("col1").description = "The best feature"

Set feature classifier

Features in a feature set can be tagged by a classifier from a predefined list. The classifier on the feature denotes the type of data stored in the feature.

Python
Scala

client.classifiers.list()  # this returns all configured classifiers on the backend
schema["col1"].classifier = "emailId"

client.classifiers.list()  // this returns all configured classifiers on the backend
schema("col1").classifier = "emailId"

Save schema as string

A schema can be serialized to string format:

Python
Scala

str_schema = schema.to_string()

val strSchema = schema.toString()

Feedback

Submit and view feedback for this page
Send feedback about H2O Feature Store to cloud-feedback@h2o.ai

Schema API

Creating the schema​

Usage​

Create a schema from a string​

Create a derived schema from a string​

Create a schema from a data source​

Create a schema from a feature set​

Create a derived schema from a parent feature set with applied transformation​

Load schema from a feature set​

Create a new schema by changing the data type of the current schema​

Create a new schema by column selection​

Create a new schema by adding a new feature schema​

Modify special data on a schema​

Modify feature type​

Set feature description​

Set feature classifier​

Save schema as string​