Data Science Examples

A collection of data science examples implemented across a variety of languages and libraries.

[How to contribute] [List of contributors] [Report an issue]

Table of Contents

Section 1. Parsing and Ingesting Data

Section 2. Munging Data

Section 3. Training Models

Section 4. Making Predictions

Tags

Tag Related Examples
classifier How Do I View Trees From A Random Forest Model?
csv How Do I Import Data?, How Do I Upload Data?
import How Do I Import Data?
missing How Do I Replace Values In A Frame?
munge How do I Slice Columns From A Frame?, How Do I Slice Rows From A Frame?, How Do I Replace Values In A Frame?
parse How Do I Import Data?, How Do I Upload Data?
pojo How Do I View Trees From A Random Forest Model?
pull How Do I Import Data?
push How Do I Upload Data?
randomforest How Do I View Trees From A Random Forest Model?
replace How Do I Replace Values In A Frame?
slice How do I Slice Columns From A Frame?, How Do I Slice Rows From A Frame?
trees How Do I View Trees From A Random Forest Model?
upload How Do I Upload Data?

The Examples

1.1. Parsing and Ingesting Data / How Do I Import Data? [link]

This example shows a very simple example of how to parse a small CSV file.

The well-known iris dataset is used.

The H2O examples use the h2o.importFile method, which is a "pull-into-h2o" operation.

This is a fast, scalable, highly optimized way to read data. H2O pulls the data from a data store. H2O initiates the data transfer as a read operation.


library(ex)
path = ex.locate("data/iris/iris_wheader.csv")
df = read.csv(path)
stopifnot(nrow(df) == 150)


library(h2o)
library(ex)
path = ex.locate("data/iris/iris_wheader.csv")
h2o.init()
h2odf = h2o.importFile(path)
stopifnot(nrow(h2odf) == 150)


import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)
assert df.nrow == 150

1.2. Parsing and Ingesting Data / How Do I Upload Data? [link]

This example shows a very simple example of how to parse a small CSV file.

The well-known iris dataset is used.

The H2O examples use the h2o.uploadFile method, which is a "push-to-h2o" operation.

This is not scalable, and only intended for smaller data sizes. The client pushes the data from a local filesystem (for example on your laptop where R is running) to H2O.

For big-data operations, you don't want the data stored on, or flowing through, the client.


library(h2o)
library(ex)
path = ex.locate("data/iris/iris_wheader.csv")
h2o.init()
h2odf = h2o.uploadFile(path)
stopifnot(nrow(h2odf) == 150)

2.1. Munging Data / How do I Slice Columns From A Frame? [link]

This example shows how to slice columns from a frame of data.

H2O lazily slices out columns of data, and will only materialize a shared copy upon some type of triggering IO.


path <- "data/iris/iris_wheader.csv"
df <- read.csv(path)

# slice 1 column by index
c1 <- df[,1]

# slice 1 column by name
c1_1 <- df[, "sepal_len"]

# slice cols by vector of indexes
cols <- df[, 1:4]

# slice cols by vector of names
cols_1 <- df[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]


library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)

# slice 1 column by index
c1 <- df[,1]

# slice 1 column by name
c1_1 <- df[, "sepal_len"]

# slice cols by vector of indexes
cols <- df[, 1:4]

# slice cols by vector of names
cols_1 <- df[, c("sepal_len", "sepal_wid", "petal_len", "petal_wid")]


import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# slice 1 column by index
c1 = df[:,0]

# slice 1 column by name
c1_1 = df[:, "sepal_len"]

# slice cols by list of indexes
cols = df[:, range(4)]

# slice cols by a list of names
cols_1 = df[:, ["sepal_len", "sepal_wid", "petal_len", "petal_wid"]]

2.2. Munging Data / How Do I Slice Rows From A Frame? [link]

This example shows how to slice rows columns from a frame of data.

H2O lazily slices out rows of data, and will only materialize a shared copy upon IO.


path <- "data/iris/iris_wheader.csv"
df <- read.csv(path)

# slice 1 row by index
c1 <- df[15,]

# slice a range of rows
c1_1 <- df[25:49,]

# slice with a boolean mask
mask <- df[,"sepal_len"] < 4.4
cols <- df[mask,]

# filter out missing values
mask <- is.na(df[,"sepal_len"])
cols <- df[!mask,]


library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)

# slice 1 row by index
c1 <- df[15,]

# slice a range of rows
c1_1 <- df[25:49,]

# slice with a boolean mask
mask <- df[,"sepal_len"] < 4.4
cols <- df[mask,]

# filter out missing values
mask <- is.na(df[,"sepal_len"])
cols <- df[!mask,]


import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# slice 1 row by index
c1 = df[15,:]

# slice a ramge of rows
c1_1 = df[range(25,50,1), :]

# slice with a boolean mask
mask = df["sepal_len"] < 4.4
cols = df[mask,:]

# filter out missing values
mask = df["sepal_len"].isna()
cols = df[~mask,:]  # note how to perform a logical not with the '~'

2.3. Munging Data / How Do I Replace Values In A Frame? [link]

This example shows how to replace values in a frame of data.


path <- "data/iris/iris_wheader.csv"
df <- read.csv(path)

# replace a single numerical datum
df[15,3] <- 2

# replace a single categorical datum
df[15,5] <- "versicolor"

# replace a whole column
df[,1] <- 3*df[,1]

# replace by row mask
df[df[,"sepal_len"] < 4.4, "sepal_len"] <- 22

# replacement with ifelse
df[,"sepal_len"] <- ifelse(df[,"sepal_len"] < 4.4, 22, df[,"sepal_len"])

# replace missing values with 0
df[is.na(df[,"sepal_len"]), "sepal_len"] <- 0

# alternative with ifelse
df[,"sepal_len"] <- ifelse(is.na(df[,"sepal_len"]), 0, df[,"sepal_len"])


library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)

# replace a single numerical datum
df[15,3] <- 2

# replace a single categorical datum
# unimplemented as of 3.6.0.8 (tibshirani)

# replace a whole column
df[,1] <- 3*df[,1]

# replace by row mask
df[df[,"sepal_len"] < 4.4, "sepal_len"] <- 22  # BUG: https://0xdata.atlassian.net/browse/PUBDEV-2520

# replacement with ifelse
df[,"sepal_len"] <- h2o.ifelse(df[,"sepal_len"] < 4.4, 22, df[,"sepal_len"])

# replace missing values with 0
df[is.na(df[,"sepal_len"]), "sepal_len"] <- 0

# alternative with ifelse
df[,"sepal_len"] <- h2o.ifelse(is.na(df[,"sepal_len"]), 0, df[,"sepal_len"])


import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# replace a single numerical datum
df[14,2] = 2

# replace a single categorical datum
# unimplemented as of 3.6.0.8 (tibshirani)

# replace a whole column
df[0] = 3*df[0]

# replace by row mask
df[df["sepal_len"] < 4.4, "sepal_len"] = 22  # BUG: https://0xdata.atlassian.net/browse/PUBDEV-2520

# replacement with ifelse
df["sepal_len"] = (df["sepal_len"] < 4.4).ifelse(22, df["sepal_len"])

# replace missing values with 0
df[df["sepal_len"].isna(), "sepal_len"] <- 0

# alternative with ifelse
df["sepal_len"] <- (df["sepal_len"].isna()).ifelse(0, df["sepal_len"])  # note the parantheses!


3.1. Training Models / How Do I View Trees From A Random Forest Model? [link]

This example shows how to build a simple classifier and view/save its POJO.


library(h2o)
path <- "data/iris/iris_wheader.csv"
h2o.init()
df <- h2o.importFile(path)

# build a simple random forest classifier
model <- h2o.randomForest(x=1:4, y=5, training_frame=df)

# print the pojo to screen
h2o.download_pojo(model)

# save it to your favorite local path
h2o.download_pojo(model, modelSavePath)


import h2o
h2o.init()
path = "data/iris/iris_wheader.csv"
df = h2o.import_file(path=path)

# build a simple random forest classifier
from h2o.estimators.random_forest import H2ORandomForestEstimator
model = H2ORandomForestEstimator()
model.train(x=range(4), y=4, training_frame=df)

# print the pojo to screen
h2o.download_pojo(model)

# save it to your favorite local path
h2o.download_pojo(model, modelSavePath)