Version: 0.19.3

Supported data sources

Data must first be ingested into Feature Store before it can be used. Ingesting is the act of uploading data into Feature Store.

Feature Store supports reading data from the following protocols:

s3a
wasbs (encrypted) and wasb (legacy)
abfss (encrypted) and abfs (legacy)

CSV

CSV file format. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Python
Scala

Parameters:

path: String - path to csv file
delimiter: String - values delimiter

source = CSVFile(path=..., delimiter=...)

Parameters:

path: String - path to csv file
delimiter: String - values delimiter

val source = CSVFile(path=..., delimiter=...)

CSV folder

CSV Folder source. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Python
Scala

Parameters:

root_folder: String - path to the root folder
delimiter: String - values delimiter
filter_pattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.
- For example: filter_pattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".

source = CSVFolder(root_folder=..., delimiter=..., filter_pattern=...)

Parameters:

rootFolder: String - path to the root folder
delimiter: String - values delimiter
filterPattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.
- For example: filterPattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".

val source = CSVFolder(rootFolder=..., delimiter=..., filterPattern=...)

Parquet

Parquet file format. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Python
Scala

Parameters:

path: String - path to parquet file

source = ParquetFile(path=...)

Parameters:

path: String - path to parquet file

val source = ParquetFile(path=...)

Parquet folder

Parquet folder source. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Python
Scala

Parameters:

root_folder: String - path to the root folder
filter_pattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.
- For example: filter_pattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".

source = ParquetFolder(root_folder=..., filter_pattern=...)

Parameters:

rootFolder: String - path to the root folder
filterPattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.
- For example: filterPattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".

val source = ParquetFolder(rootFolder=..., filterPattern=...)

JSON

JSON file format. Supported path locations are S3 bucket and Azure Blob Storage. Different types of JSON formats are supported. Read more here to learn what types of JSON files are supported. By default multiline is set to False.

User API:

Python
Scala

Parameters:

path: String - path to JSON file
multiline: Boolean - True whether the input is JSON where one entry is on multiple lines, otherwise False.

source = JSONFile(path=..., multiline=...)

Parameters:

path: String - path to JSON file
multiline: Boolean - true whether the input is JSON where one entry is on multiple lines, otherwise false.

val source = JsonFile(path=..., multiline=...)

note

Please keep in mind that a JSON object is an unordered set of name/value pairs. This means that using JSON files for extracting schema can produce a schema with a different order of features than that used in the file.

JSON folder

JSON folder source. Supported path locations are S3 bucket and Azure Blob Storage.

User API:

Python
Scala

Parameters:

root_folder: String - path to the root folder
multiline: Boolean - True whether the input is JSON where one entry is on multiple lines, otherwise False.
filter_pattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.
- For example: filter_pattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".

source = JSONFolder(root_folder=..., multiline=..., filter_pattern=...)

Parameters:

rootFolder: String - path to the root folder
multiline: Boolean - true whether the input is JSON where one entry is on multiple lines, otherwise false.
filterPattern: String - Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.
- For example: filterPattern="data/.*/.*/.*comp/.*" will match this file "data/1996-03-03/1/1679-comp/hello.json".

val source = JsonFolder(rootFolder=..., multiline=..., filterPattern=...)

MongoDB

Data stored in a MongoDb can be accessed by Feature Store as well. For a MongoDb authentication, environment variables

MONGODB_USER
MONGODB_PASSWORD will be used to provide user information.
User API:

Python
Scala

Parameters:

connection_uri: String - a MongoDb server URI
- E.g. connection_uri="mongodb+srv://my_cluster.mongodb.net/test"
database: String - Name of a database on the server
- E.g. database="sample_guides"
collection: String - Name of a collection to read the data from
- E.g. collection="planets"

source = MongoDbCollection(connection_uri=..., database= ..., collection = ...)

Parameters:

connectionUri: String - a MongoDb server URI
- E.g. connection_uri="mongodb+srv://my_cluster.mongodb.net/test"
database: String - Name of a database on the server
- E.g. database="sample_guides"
collection: String - Name of a collection to read the data from
- E.g. collection="planets"

val source = MongoDbCollection(connectionUri=..., database= ..., collection = ...)

Delta table

Delta table format. Table can be stored in either S3 or Azure Blob Storage.

User API:

Python
Scala

Parameters:

path: String - path to delta table
version: Int - (Optional) - version of the delta table
timestamp: String - (Optional) - timestamp of the data in the table
filter: DeltaTableFilter - (Optional) - Filter on the delta table

source = DeltaTable(path=..., version=..., timestamp=..., filter=...)

Parameters:

path: String - path to delta table
version: Int - (Optional) - version of the delta table
timestamp: String - (Optional) - timestamp of the data in the table
filter: DeltaTableFilter - (Optional) - Filter on the delta table

val source = DeltaTable(path=..., version=..., timestamp=..., filter=...)

DeltaTableFilter API:

Python
Scala

Parameters:

column: String - name of the column
operator: String - operator to be applied
value: String|Double|Boolean - value to be applied on the filter

delta_filter = DeltaTableFilter(column=..., operator=..., value=...)

Parameters:

column: String - name of the column
operator: String - operator to be applied
value: String|Double|Boolean - value to be applied on the filter

val deltaFilter = DeltaTableFilter(column=..., operator=..., value=...)

Operator can be one of the following: <, <=, >, >=, =!= and ==.

Supported operators

==, <, >, ≤, ≥

Valid parameter combinations

Path
Path, Version
Path, Version, Filter
Path, Timestamp
Path, Timestamp, Filter
Path, Filter

JDBC

JDBC table format. Currently, we support the following JDBC connections:

PostgreSQL
Teradata

User API:

Python
Scala

Parameters:

connection_url: String - connection string including the database name
table: String - table to fetch data from
query: String - query to fetch data from
partition_options: PartitionOptions - (Optional) parameters to enable parallel execution. These are applicable only when table is specified
- PartitionOptions constitutes : num_partitions, partition_column, lower_bound, upper_bound, fetch_size

source = JdbcTable(connection_url=..., table=..., partition_options=PartitionOptions(num_partitions = ..., partition_column = ..., lower_bound = ..., upper_bound = ..., fetch_size=...))
source = JdbcTable(connection_url=..., query=...)

Parameters:

table: String - table to fetch data from
connectionUrl: String - connection string including the database name
query: String - query to fetch data from
partitionOptions: PartitionOptions - (Optional) parameters to enable parallel execution. These are applicable only when table is specified
- PartitionOptions constitutes : numPartitions, partitionColumn, lowerBound, upperBound, fetchSize

val source = JdbcTable(connectionUrl=..., table=...,partitionOptions = new PartitionOptions(numPartitions = ..., partitionColumn = ..., lowerBound = ..., upperBound = ..., fetchSize = ...))
val source = JdbcTable(connectionUrl=..., query=...)

The format of the connection URL is a standard JDBC connection string, such as:

For Teradata, jdbc:teradata://host:port/database
For PostgreSQL, jdbc:postgresql://host:port/database

The database is a mandatory part of the connection string in the case of Feature Store. Note that only one of table or query is supported at the same time. Additionally, PartitionOptions can only be specified with table. These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. All rows in the table will be partitioned and returned. This option applies only to reading.

Snowflake table

Extract data from Snowflake tables or queries.

User API:

Python
Scala

Parameters:

table: String - table to fetch data from
database: String - Snowflake database
url: String - url to Snowflake instance
query: String - query to fetch data from
warehouse: String - Snowflake warehouse
schema: String - Snowflake schema
insecure: Boolean - if True, Snowflake will not perform SSL verification
proxy: Proxy object - proxy specification
role: String - Snowflake role
account: String - Snowflake account name

note

table and query parameters cannot be configured simultaneously.

from featurestore import *
proxy = Proxy(host..., port=..., user=..., password=...)
source = SnowflakeTable(table=..., database=..., url=..., query=..., warehouse=..., schema=..., insecure=...,
proxy=..., role=..., account=...)

Parameters:

table: String - table to fetch data from
database: String - Snowflake database
url: String - url to Snowflake instance
query: String - query to fetch data from
warehouse: String - Snowflake warehouse
schema: String - Snowflake schema
insecure: Boolean - if true, Snowflake will not perform SSL verification
proxy: Proxy object - proxy specification
role: String - Snowflake role
account: String - Snowflake account name

note

table and query parameters cannot be configured simultaneously.

val proxy = Proxy(host..., port=..., user=..., password=...)
val source = SnowflakeTable(table=..., database=..., url=..., query=..., warehouse=..., schema=..., insecure=...,
                            proxy=..., role=..., account=...)

note

A proxy is an optional argument in the Snowflake data source API. If a proxy is not being used, the proxy configuration can simply be set to None.

The use of a proxy is possible for users only if the proxy feature is enabled by the administrator of the Snowflake account. Therefore, it is important to confirm whether proxy support is enabled before attempting to configure a proxy in the Snowflake data source API.

Snowflake Cursor object

Extract data from Snowflake tables or queries.

User API:

The Snowflake Cursor object is currently only supported in the Python client.

Parameters:

database: String - Snowflake database
url: String - url to Snowflake instance
warehouse: String - Snowflake warehouse
schema: String - Snowflake schema
snowflake_cursor: Object - Snowflake cursor
insecure: Boolean - if True, Snowflake will not perform SSL verification
proxy: Proxy object - proxy specification
role: String - Snowflake role
account: String - Snowflake account name

source = SnowflakeCursor(database=..., url=..., warehouse=..., schema=..., snowflake_cursor=..., insecure=...,
                         proxy=..., role=..., account=...)

Database snippet:

Internally, the Snowflake Cursor is converted to SnowflakeTable with query and is therefore saved in the same format in the database.

Spark Data Frame

When using Spark Data Frame as the source, several conditions must be met first. Read about the Spark dependencies to understand these requirements.

User API:

Python
Scala

Parameters:

dataframe: DataFrame - Spark Data Frame instance

source = SparkDataFrame(dataframe...)

Parameters:

dataframe: DataFrame - Spark data frame instance

val source = SparkDataFrame(dataframe...)

Feedback

Submit and view feedback for this page
Send feedback about H2O Feature Store to cloud-feedback@h2o.ai

Supported data sources

CSV​

CSV folder​

Parquet​

Parquet folder​

JSON​

JSON folder​

MongoDB​

Delta table​

Supported operators​

Valid parameter combinations​

JDBC​

Snowflake table​

Snowflake Cursor object​

Spark Data Frame​

CSV

CSV folder

Parquet

Parquet folder

JSON

JSON folder

MongoDB

Delta table

Supported operators

Valid parameter combinations

JDBC

Snowflake table

Snowflake Cursor object

Spark Data Frame