Supported data sources
Data must first be ingested into Feature Store before it can be used. Ingesting is the act of uploading data into Feature Store.
Feature Store supports reading data from the following protocols:
- s3 (internally reusing s3a client)
- s3a
- wasbs (encrypted) and wasb (legacy)
- abfss (encrypted) and abfs (legacy)
- http/https (data gets uploaded to internal storage)
- drive (to read files from H2O Drive)
CSV
CSV file format. Supported path locations are S3 bucket, Azure Blob Storage, HTTP/HTTPS URL and H20 Drive.
User API:
- Python
- Scala
Parameters:
path: String
- path to csv filedelimiter: String
- values delimiter
source = CSVFile(path=..., delimiter=...)
Parameters:
path: String
- path to csv filedelimiter: String
- values delimiter
val source = CSVFile(path=..., delimiter=...)
CSV folder
CSV Folder source. Supported path locations are S3 bucket and Azure Blob Storage.
User API:
- Python
- Scala
Parameters:
root_folder: String
- path to the root folderdelimiter: String
- values delimiterfilter_pattern: String
- Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.- For example:
filter_pattern="data/.*/.*/.*comp/.*"
will match this file "data/1996-03-03/1/1679-comp/hello.json".
- For example:
source = CSVFolder(root_folder=..., delimiter=..., filter_pattern=...)
Parameters:
rootFolder: String
- path to the root folderdelimiter: String
- values delimiterfilterPattern: String
- Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.- For example:
filterPattern="data/.*/.*/.*comp/.*"
will match this file "data/1996-03-03/1/1679-comp/hello.json".
- For example:
val source = CSVFolder(rootFolder=..., delimiter=..., filterPattern=...)
Parquet
Parquet file format. Supported path locations are S3 bucket, Azure Blob Storage, HTTP/HTTPS URL and H20 Drive.
User API:
- Python
- Scala
Parameters:
path: String
- path to parquet file
source = ParquetFile(path=...)
Parameters:
path: String
- path to parquet file
val source = ParquetFile(path=...)
Parquet folder
Parquet folder source. Supported path locations are S3 bucket and Azure Blob Storage.
User API:
- Python
- Scala
Parameters:
root_folder: String
- path to the root folderfilter_pattern: String
- Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.- For example:
filter_pattern="data/.*/.*/.*comp/.*"
will match this file "data/1996-03-03/1/1679-comp/hello.json".
- For example:
source = ParquetFolder(root_folder=..., filter_pattern=...)
Parameters:
rootFolder: String
- path to the root folderfilterPattern: String
- Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.- For example:
filterPattern="data/.*/.*/.*comp/.*"
will match this file "data/1996-03-03/1/1679-comp/hello.json".
- For example:
val source = ParquetFolder(rootFolder=..., filterPattern=...)
JSON
JSON file format. Supported path locations are S3 bucket, Azure Blob
Storage, HTTP/HTTPS URL and H20 Drive. Different types of JSON formats are supported.
Read more here
to learn what types of JSON files are supported. By default multiline
is set to False
.
User API:
- Python
- Scala
Parameters:
path: String
- path to JSON filemultiline: Boolean
-True
whether the input is JSON where one entry is on multiple lines, otherwiseFalse
.
source = JSONFile(path=..., multiline=...)
Parameters:
path: String
- path to JSON filemultiline: Boolean
-true
whether the input is JSON where one entry is on multiple lines, otherwisefalse
.
val source = JsonFile(path=..., multiline=...)
Please keep in mind that a JSON object is an unordered set of name/value pairs. This means that using JSON files for extracting schema can produce a schema with a different order of features than that used in the file.
JSON folder
JSON folder source. Supported path locations are S3 bucket and Azure Blob Storage.
User API:
- Python
- Scala
Parameters:
root_folder: String
- path to the root foldermultiline: Boolean
-True
whether the input is JSON where one entry is on multiple lines, otherwiseFalse
.filter_pattern: String
- Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.- For example:
filter_pattern="data/.*/.*/.*comp/.*"
will match this file "data/1996-03-03/1/1679-comp/hello.json".
- For example:
source = JSONFolder(root_folder=..., multiline=..., filter_pattern=...)
Parameters:
rootFolder: String
- path to the root foldermultiline: Boolean
-true
whether the input is JSON where one entry is on multiple lines, otherwisefalse
.filterPattern: String
- Pattern to locate the files. To match the files at depth "N", the filter pattern must contain N expressions separated by "/" where each string is either an exact string or a regex pattern.- For example:
filterPattern="data/.*/.*/.*comp/.*"
will match this file "data/1996-03-03/1/1679-comp/hello.json".
- For example:
val source = JsonFolder(rootFolder=..., multiline=..., filterPattern=...)
Please keep in mind that a JSON object is an unordered set of name/value pairs. This means that using JSON files for extracting schema can produce a schema with a different order of features than that used in the file.
MongoDB
Data stored in a MongoDb can be accessed by Feature Store as well. For a MongoDb authentication, environment variables
- MONGODB_USER
- MONGODB_PASSWORD will be used to provide user information.
- User API:
- Python
- Scala
Parameters:
connection_uri: String
- a MongoDb server URI- E.g. connection_uri="mongodb+srv://my_cluster.mongodb.net/test"
database: String
- Name of a database on the server- E.g. database="sample_guides"
collection: String
- Name of a collection to read the data from- E.g. collection="planets"
source = MongoDbCollection(connection_uri=..., database= ..., collection = ...)
Parameters:
connectionUri: String
- a MongoDb server URI- E.g. connection_uri="mongodb+srv://my_cluster.mongodb.net/test"
database: String
- Name of a database on the server- E.g. database="sample_guides"
collection: String
- Name of a collection to read the data from- E.g. collection="planets"
val source = MongoDbCollection(connectionUri=..., database= ..., collection = ...)
Delta table
Delta table format. Table can be stored in either S3 or Azure Blob Storage.
User API:
- Python
- Scala
Parameters:
path: String
- path to delta tableversion: Int
- (Optional) - version of the delta tabletimestamp: String
- (Optional) - timestamp of the data in the tablefilter: DeltaTableFilter
- (Optional) - Filter on the delta table
source = DeltaTable(path=..., version=..., timestamp=..., filter=...)
Parameters:
path: String
- path to delta tableversion: Int
- (Optional) - version of the delta tabletimestamp: String
- (Optional) - timestamp of the data in the tablefilter: DeltaTableFilter
- (Optional) - Filter on the delta table
val source = DeltaTable(path=..., version=..., timestamp=..., filter=...)
DeltaTableFilter API:
- Python
- Scala
Parameters:
column: String
- name of the columnoperator: String
- operator to be appliedvalue: String|Double|Boolean
- value to be applied on the filter
delta_filter = DeltaTableFilter(column=..., operator=..., value=...)
Parameters:
column: String
- name of the columnoperator: String
- operator to be appliedvalue: String|Double|Boolean
- value to be applied on the filter
val deltaFilter = DeltaTableFilter(column=..., operator=..., value=...)
Operator can be one of the following: <
, <=
, >
, >=
, =!=
and ==
.
Supported operators
The following are the supported operators : ==
, <
, >
, ≤
, and ≥
.
Valid parameter combinations
Path
Path
,Version
Path
,Version
,Filter
Path
,Timestamp
Path
,Timestamp
,Filter
Path
,Filter
JDBC
JDBC table format. Currently, we support the following JDBC connections:
- PostgreSQL
- Teradata
User API:
- Python
- Scala
Parameters:
connection_url: String
- connection string including the database nametable: String
- table to fetch data fromquery: String
- query to fetch data frompartition_options: PartitionOptions
- (Optional) parameters to enable parallel execution. These are applicable only whentable
is specifiedPartitionOptions
constitutes :num_partitions
,partition_column
,lower_bound
,upper_bound
,fetch_size
source = JdbcTable(connection_url=..., table=..., partition_options=PartitionOptions(num_partitions = ..., partition_column = ..., lower_bound = ..., upper_bound = ..., fetch_size=...))
source = JdbcTable(connection_url=..., query=...)
Parameters:
table: String
- table to fetch data fromconnectionUrl: String
- connection string including the database namequery: String
- query to fetch data frompartitionOptions: PartitionOptions
- (Optional) parameters to enable parallel execution. These are applicable only whentable
is specifiedPartitionOptions
constitutes :numPartitions
,partitionColumn
,lowerBound
,upperBound
,fetchSize
val source = JdbcTable(connectionUrl=..., table=...,partitionOptions = new PartitionOptions(numPartitions = ..., partitionColumn = ..., lowerBound = ..., upperBound = ..., fetchSize = ...))
val source = JdbcTable(connectionUrl=..., query=...)
The format of the connection URL is a standard JDBC connection string, such as:
- For Teradata,
jdbc:teradata://host:port/database
- For PostgreSQL,
jdbc:postgresql://host:port/database
The database is a mandatory part of the connection string in the case of
Feature Store. Note that only one of table or query is supported at the
same time. Additionally, PartitionOptions
can only be specified with
table. These options must all be specified if any of them is specified.
They describe how to partition the table when reading in parallel from
multiple workers. partitionColumn
must be a numeric, date, or
timestamp column from the table in question. Notice that lowerBound
and upperBound
are just used to decide the partition stride, not for
filtering the rows in table. All rows in the table will be partitioned
and returned. This option applies only to reading.
Snowflake table
Extract data from Snowflake tables or queries.
User API:
- Python
- Scala
Parameters:
table: String
- table to fetch data fromdatabase: String
- Snowflake databaseurl: String
- url to Snowflake instancequery: String
- query to fetch data fromwarehouse: String
- Snowflake warehouseschema: String
- Snowflake schemainsecure: Boolean
- ifTrue
, Snowflake will not perform SSL verificationproxy: Proxy object
- proxy specificationrole: String
- Snowflake roleaccount: String
- Snowflake account name
table
and query
parameters cannot be configured simultaneously.
from featurestore import *
proxy = Proxy(host..., port=..., user=..., password=...)
source = SnowflakeTable(table=..., database=..., url=..., query=..., warehouse=..., schema=..., insecure=...,
proxy=..., role=..., account=...)
Parameters:
table: String
- table to fetch data fromdatabase: String
- Snowflake databaseurl: String
- url to Snowflake instancequery: String
- query to fetch data fromwarehouse: String
- Snowflake warehouseschema: String
- Snowflake schemainsecure: Boolean
- iftrue
, Snowflake will not perform SSL verificationproxy: Proxy object
- proxy specificationrole: String
- Snowflake roleaccount: String
- Snowflake account name
table
and query
parameters cannot be configured simultaneously.
val proxy = Proxy(host..., port=..., user=..., password=...)
val source = SnowflakeTable(table=..., database=..., url=..., query=..., warehouse=..., schema=..., insecure=...,
proxy=..., role=..., account=...)
A proxy is an optional argument in the Snowflake data source API. If a proxy is not being used, the proxy configuration can simply be set to None
.
The use of a proxy is possible for users only if the proxy feature is enabled by the administrator of the Snowflake account. Therefore, it is important to confirm whether proxy support is enabled before attempting to configure a proxy in the Snowflake data source API.
Snowflake Cursor object
Extract data from Snowflake tables or queries.
User API:
The Snowflake Cursor object is currently only supported in the Python client.
Parameters:
database: String
- Snowflake databaseurl: String
- url to Snowflake instancewarehouse: String
- Snowflake warehouseschema: String
- Snowflake schemasnowflake_cursor: Object
- Snowflake cursorinsecure: Boolean
- ifTrue
, Snowflake will not perform SSL verificationproxy: Proxy object
- proxy specificationrole: String
- Snowflake roleaccount: String
- Snowflake account name
source = SnowflakeCursor(database=..., url=..., warehouse=..., schema=..., snowflake_cursor=..., insecure=...,
proxy=..., role=..., account=...)
Database snippet:
Internally, the Snowflake Cursor is converted to SnowflakeTable
with
query
and is therefore saved in the same format in the database.
Spark Data Frame
When using Spark Data Frame as the source, several conditions must be met first. Read about the Spark dependencies to understand these requirements.
User API:
- Python
- Scala
Parameters:
dataframe: DataFrame
- Spark Data Frame instance
source = SparkDataFrame(dataframe...)
Parameters:
dataframe: DataFrame
- Spark data frame instance
val source = SparkDataFrame(dataframe...)
Accessing H2O Drive Data
When H2O Drive application is running in the same cloud environment as Feature Store, then user is able to access files that he/she uploaded into H2O Drive. To refer to those files, let's specify the scheme as drive. However, due to technical limitations access to H2O Drive files is currently not possible when user is authenticated to Feature Store via PAT token.
Examples
- Python
- Scala
source_1 = CSVFile("drive://example-file-1.csv")
source_2 = CSVFile("drive://my-subdirectory/example-file-2.csv")
val source1 = CSVFile("drive://example-file-1.csv")
val source2 = CSVFile("drive://my-subdirectory/example-file-2.csv")
- Submit and view feedback for this page
- Send feedback about H2O Feature Store to cloud-feedback@h2o.ai