This example shows a very simple example of how to parse a small CSV file.
The well-known iris dataset is used.
The H2O examples use the h2o.importFile method, which is a "pull-into-h2o" operation.
This is a fast, scalable, highly optimized way to read data. H2O pulls the data from a data store. H2O initiates the data transfer as a read operation.
library(ex)
path = ex.locate("data/iris/iris_wheader.csv")
df = read.csv(path)
stopifnot(nrow(df) == 150)
library(h2o)
library(ex)
path = ex.locate("data/iris/iris_wheader.csv")
h2o.init()
h2odf = h2o.importFile(path)
stopifnot(nrow(h2odf) == 150)
import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
assert df.nrow == 150
This example shows a very simple example of how to parse a small CSV file.
The well-known iris dataset is used.
The H2O examples use the h2o.uploadFile method, which is a "push-to-h2o" operation.
This is not scalable, and only intended for smaller data sizes. The client pushes the data from a local filesystem (for example on your laptop where R is running) to H2O.
For big-data operations, you don't want the data stored on, or flowing through, the client.
library(h2o)
library(ex)
path = ex.locate("data/iris/iris_wheader.csv")
h2o.init()
h2odf = h2o.uploadFile(path)
stopifnot(nrow(h2odf) == 150)
This example shows how to slice columns from a frame of data.
H2O lazily slices out columns of data, and will only materialize a shared copy upon some type of triggering IO.
path <- "data/iris/iris_wheader.csv"
df <- read.csv(path)
# slice 1 column by index
c1 <- df[,1]
# slice 1 column by name
c1_1 <- df[, "sepal_len"]
# slice cols by vector of indexes
cols <- df[, 1:4]
# slice cols by vector of names
cols_1 <- df[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]
library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)
# slice 1 column by index
c1 <- df[,1]
# slice 1 column by name
c1_1 <- df[, "sepal_len"]
# slice cols by vector of indexes
cols <- df[, 1:4]
# slice cols by vector of names
cols_1 <- df[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]
import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# slice 1 column by index
c1 = df[:,0]
# slice 1 column by name
c1_1 = df[:, "sepal_len"]
# slice cols by list of indexes
cols = df[:, range(4)]
# slice cols by a list of names
cols_1 = df[:, ["sepal_len", "sepal_wid", "petal_len", "petal_wid"]]
This example shows how to slice rows columns from a frame of data.
H2O lazily slices out rows of data, and will only materialize a shared copy upon IO.
path <- "data/iris/iris_wheader.csv"
df <- read.csv(path)
# slice 1 row by index
c1 <- df[15,]
# slice a range of rows
c1_1 <- df[25:49,]
# slice with a boolean mask
mask <- df[,"sepal_len"] < 4.4
cols <- df[mask,]
# filter out missing values
mask <- is.na(df[,"sepal_len"])
cols <- df[!mask,]
library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)
# slice 1 row by index
c1 <- df[15,]
# slice a range of rows
c1_1 <- df[25:49,]
# slice with a boolean mask
mask <- df[,"sepal_len"] < 4.4
cols <- df[mask,]
# filter out missing values
mask <- is.na(df[,"sepal_len"])
cols <- df[!mask,]
import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# slice 1 row by index
c1 = df[15,:]
# slice a ramge of rows
c1_1 = df[range(25,50,1), :]
# slice with a boolean mask
mask = df["sepal_len"] < 4.4
cols = df[mask,:]
# filter out missing values
mask = df["sepal_len"].isna()
cols = df[~mask,:] # note how to perform a logical not with the '~'
This example shows how to replace values in a frame of data.
path <- "data/iris/iris_wheader.csv"
df <- read.csv(path)
# replace a single numerical datum
df[15,3] <- 2
# replace a single categorical datum
df[15,5] <- "versicolor"
# replace a whole column
df[,1] <- 3*df[,1]
# replace by row mask
df[df[,"sepal_len"] < 4.4, "sepal_len"] <- 22
# replacement with ifelse
df[,"sepal_len"] <- ifelse(df[,"sepal_len"] < 4.4, 22, df[,"sepal_len"])
# replace missing values with 0
df[is.na(df[,"sepal_len"]), "sepal_len"] <- 0
# alternative with ifelse
df[,"sepal_len"] <- ifelse(is.na(df[,"sepal_len"]), 0, df[,"sepal_len"])
library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)
# replace a single numerical datum
df[15,3] <- 2
# replace a single categorical datum
# unimplemented as of 3.6.0.8 (tibshirani)
# replace a whole column
df[,1] <- 3*df[,1]
# replace by row mask
df[df[,"sepal_len"] < 4.4, "sepal_len"] <- 22 # BUG: https://0xdata.atlassian.net/browse/PUBDEV-2520
# replacement with ifelse
df[,"sepal_len"] <- h2o.ifelse(df[,"sepal_len"] < 4.4, 22, df[,"sepal_len"])
# replace missing values with 0
df[is.na(df[,"sepal_len"]), "sepal_len"] <- 0
# alternative with ifelse
df[,"sepal_len"] <- h2o.ifelse(is.na(df[,"sepal_len"]), 0, df[,"sepal_len"])
import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# replace a single numerical datum
df[14,2] = 2
# replace a single categorical datum
# unimplemented as of 3.6.0.8 (tibshirani)
# replace a whole column
df[0] = 3*df[0]
# replace by row mask
df[df["sepal_len"] < 4.4, "sepal_len"] = 22 # BUG: https://0xdata.atlassian.net/browse/PUBDEV-2520
# replacement with ifelse
df["sepal_len"] = (df["sepal_len"] < 4.4).ifelse(22, df["sepal_len"])
# replace missing values with 0
df[df["sepal_len"].isna(), "sepal_len"] <- 0
# alternative with ifelse
df["sepal_len"] <- (df["sepal_len"].isna()).ifelse(0, df["sepal_len"]) # note the parantheses!
This example shows how to build a simple classifier and view/save its POJO.
library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)
# build a simple random forest classifier
model <- h2o.randomForest(x=1:4, y=5, training_frame=df)
# print the pojo to screen
h2o.download_pojo(model)
# save it to your favorite local path
h2o.download_pojo(model, modelSavePath)
import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
# build a simple random forest classifier
from h2o.estimators.random_forest import H2ORandomForestEstimator
model = H2ORandomForestEstimator()
model.train(x=range(4), y=4, training_frame=df)
# print the pojo to screen
h2o.download_pojo(model)
# save it to your favorite local path
h2o.download_pojo(model, modelSavePath)