For this self-paced course, we will explore the Titanic dataset from the perspective of a passenger life insurance company while using and learning about H2O.ai's enterprise product, Driverless AI. We will explore possible risk factors derived from this dataset that could have been considered when selling passenger insurances. More specifically, we will create a predictive model to determine what factors contributed to a passenger surviving.

In part, this self-paced course will also be an overview of Driverless AI. You will learn how to load data, explore data details, generate Auto visualizations, launch an experiment, explore feature engineering, view experiment results. As well, we will go through a quick tour of the Machine Learning Interpretability report that you can generate right after an experiment is complete.

Note: This self-paced course has been built on Aquarium, which is H2O.ai's cloud environment providing software access for workshops, conferences, and training. The labs in Aquarium have datasets, experiments, projects, and other content preloaded. If you use your version of Driverless AI, you will not see the preloaded content.

Note: Aquarium's Driverless AI Test Drive lab has a license key built-in, so you don't need to request one to use it. Each Driverless AI Test Drive instance will be available to you for two hours, after which it will terminate. No work will be saved. If you need more time to further explore Driverless AI, you can always launch another Test Drive instance or reach out to our sales team via the contact us form.

Overview

H2O Driverless AI is an artificial intelligence(AI) platform for automatic machine learning. Driverless AI automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection, and model deployment. It aims to achieve highest predictive accuracy, comparable to expert data scientists, but in much shorter time thanks to end-to-end automation. Driverless AI also offers automatic visualizations and machine learning interpretability(MLI). Especially in regulated industries, model transparency and explanation are just as important as predictive performance. Modeling pipelines (feature engineering and models) are exported (in full fidelity, without approximations) both as Python modules and as pure Java standalone scoring artifacts.

Why Driverless AI?

Over the last several years, machine learning has become an integral part of many organizations' decision-making process at various levels. With not enough data scientists to fill the increasing demand for data-driven business processes, H2O.ai offers Driverless AI, which automates several time consuming aspects of a typical data science workflow, including data visualization, feature engineering, predictive modeling, and model explanation.

H2O Driverless AI is a high-performance, GPU-enabled computing platform for automatic development and rapid deployment of state-of-the-art predictive analytics models. It reads tabular data from plain text sources, Hadoop, or S3 buckets and automates data visualization and building predictive models. Driverless AI targets business applications such as loss-given-default, probability of default, customer churn, campaign response, fraud detection, anti-money-laundering, demand forecasting, and predictive asset maintenance models. (Or in machine learning parlance: common regression, binomial classification, and multinomial classification problems).

Tour

Welcome to the Driverless AI Datasets page(this will be the first thing you will see when you click your Driverless AI URL):

dai-datasets-page

On the Datasets page, the following options and features can be found. Now, we will quickly review them, but we will further explore these options and features before and after we launch an experiment in the upcoming tasks.

Before we load the dataset for our experiment, let us review some introductory concepts around Machine Learning.

Deeper Dive and Resources

Artificial Intelligence and Machine Learning

The concepts found in this task are meant to provide a high-level overview of Machine Learning. At the end of this task, you can find links to resources that offer a more in-depth explanation of the concepts covered here.

Machine learning is a subset of Artificial intelligence where the focus is to create machines that can simulate human intelligence. One critical distinction between artificial intelligence and machine learning is that machine learning models "learn" from the data the models get exposed to. Arthur Samuel, a machine learning pioneer back in 1959, defined machine learning as a " field of study that gives computers the ability to learn without being explicitly programmed" [1]. A machine learning algorithm trains on a dataset to make predictions. These predictions are, at times, used to optimize a system or assist with decision-making.

Machine Learning Training

Advances in technology have made it easier for data to be collected and made available. The available type of data will determine the kind of training that the machine learning model can undergo. There are two types of machine learning training, supervised and unsupervised learning. Supervised learning is when the dataset contains the output that you are trying to predict. For those cases where the predicting variable is not present, it's called unsupervised learning. Both types of training define the relationship between input and output variables.

In machine learning, the input variables are called features and the output variables labels. The labels, in this case, are what we are trying to predict. The goal is to take the inputs/variables/features and use them to come up with predictions on never-before-seen data. In linear regression, the features are the x-variables, and the labels are the y-variables. An example of a label could be the future price of avocados. In terms of feature examples regarding this self-paced course, in Task 3, we will see the following features when creating our survival prediction model: passenger class, sex, age, passenger fare, cabin number, etc.

A machine learning model defines the relationship between features and labels. Anyone can train a model by feeding it examples of particular instances of data. You can have two types of examples: labeled and unlabeled. Labeled examples are those where the X and Y values (features, labels) are known. Unlabeled examples are those where we know the X value, but we don't know the Y value[1]. Your dataset is similar to an example; the columns that will be used for training are the features; the rows are the instances of those features. The column that you want to predict is the label.

Supervised learning takes labeled examples and allows a model that is being trained to learn the relationship between features and labels. The trained model can then be used on unlabelled data to predict the missing Y value. The model can be tested with either labeled or unlabeled data. Note that H2O Driverless AI creates models with labeled examples.

Data Preparation

A machine learning model is as good as the data that is used to train it. If you use garbage data to train your model, you will get a garbage model. With that said, before uploading a dataset into tools that will assist you with building your machine learning model, such as Driverless AI, ensure that the dataset has been cleaned and prepared for training. Transforming raw data into another format, which is more appropriate and valuable for analytics, is called data wrangling.

Data wrangling can include extractions, parsing, joining, standardizing, augmenting, cleansing, and consolidating until the missing data is fixed or removed. Data preparation includes the dataset being in the correct format for what you are trying to do; accordingly, duplicates are removed, missing data is fixed or removed, and finally, categorical values are transformed or encoded to a numerical type. Tools like Python datatable, Pandas and R are great assets for data wrangling.

Data wrangling can be done in Driverless AI via a data recipe, the JDBC connector or through live code which will create a new dataset by modifying the existing one.

Data Transformation/Feature Engineering

Data transformation or feature engineering is the process of creating new features from the existing ones. Proper data transformations on a dataset can include scaling, decomposition, and aggregation [2]. Some data transformations include looking at all the features and identifying which features can be combined to make new ones that will be more useful to the model's performance. For categorical features, the recommendation is for classes that have few observations to be grouped to reduce the likelihood of the model overfitting. Categorical features may be converted to numerical representations since many algorithms cannot handle categorical features directly. Besides, data transformation removes features that are not used or are redundant[3]. These are only a few suggestions when approaching feature engineering. Feature engineering is very time-consuming due to its repetitive nature; it can also be costly. After successfully or having a notion of well-done data transformation, the next step in creating a model is selecting an algorithm.

Algorithm Selection

"Machine learning algorithms are described as learning a target function (f) that best maps input variables (x) to an output variable(y): Y= f(x)" [4]. In supervised learning, there are many algorithms to select from for training. The type of algorithm(s) will depend on the size of your data set, structure, and the type of problem you are trying to solve. Through trial and error, the best performing algorithms can be found for your dataset. Some of those algorithms include linear regression, regression trees, random forests, Naive Bayes, and boosting, to name a few [5].

Model Training

Datasets

When training a machine learning model, one good practice is to split up your dataset into subsets: training, validation, and testing sets. A good ratio for the entire dataset is 70-15-15, 70% of the whole dataset for training, 15% for validation, and the remaining 15% for testing. The training set is the data used to train the model, and it needs to be big enough to get significant results from it. The validation set is the data held back from the training and will be used to evaluate and adjust the trained model's hyperparameters and, hence, adjust the performance. Finally, the test set is data that has also been held back and will be used to confirm the final model's results [1].

Note: The validation dataset is used for tuning the modeling pipeline. If provided, the entire training data will be used for training, and validation of the modeling pipeline is performed with only this validation dataset. When you do not include a validation dataset, Driverless AI will do K-fold cross-validation for I.I.D. (identically and independently distributed) experiments and multiple rolling window validation splits for time series experiments. For this reason, it is not generally recommended to include a validation dataset as you are then validating on only a single dataset. Please note that time series experiments cannot be used with a validation dataset: including a validation dataset will disable the ability to select a time column and vice versa.

This dataset must have the same number of columns (and column types) as the training dataset. Also, note that if provided, the validation set is not sampled down, so it can lead to large memory usage, even if accuracy=1 (which reduces the train size). In a moment, we will learn more about accuracy when preparing an experiment.[10]

Another part of model training is fitting and tuning the models. For fitting and tuning, hyperparameters need to be tuned, and cross-validation needs to take place using only the training data. Various hyperparameter values will need to be tested. "A hyperparameter is a parameter that is set before the learning process begins. These parameters are tunable and can directly affect how well a model trains. [In Machine Learning, a hyperparameter is the learning rate]" [7]. In other words, the hyperparameter value is used to determine the rate at which the model learns.

With cross-validation, the whole dataset is utilized, and each model is trained on a different subset of the training data [8]. Additionally, a cross-validation loop will be set to calculate the cross-validation score for each set of hyperparameters for each algorithm. Based on the cross-validation score and hyperparameter values, you can select the model for each algorithm that has been tuned with training data and tested with your test set. Remember that "a model in [Machine Learning(ML)] is the output of an ML algorithm run on data. It represents what was learned by a machine learning algorithm." [9]

What are the challenges in AI Model Development?

One of the significant challenges in developing a single production-ready model is that it can take weeks or months to build it. Developing a model involves feature engineering, model building, and model deployment. All tasks are very repetitive, time-consuming, require advanced knowledge of feature generation, algorithms, parameters, and model deployment. Finally, there needs to be in-depth knowledge and confidence in how the model was generated to justify how it made its decisions.

What is Automated Machine Learning, and why is it important?

AutoML or Automated Machine Learning is the process of automating algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment. AutoML tools such as H2O Driverless AI makes it easy to train and evaluate machine learning models. Automating the repetitive tasks around Machine Learning development allows individuals to focus on the data and the business problems they are trying to solve.

With this task in mind, let's explore and load the data that we will be using when predicting whether a passenger would've survived the titanic accident.

References

Deeper Dive and Resources

What is the Driverless AI Workflow?

The typical Driverless AI workflow is to:

  1. Load data
  2. Visualize data
  3. Run an experiment
  4. Interpret the model
  5. Deploy the scoring pipeline

Besides, you can diagnose a model, transform another dataset, score the model against another dataset and manage your data in Projects. This self-paced course's focus will be on steps 1 - 4. We will cover Driverless AI's other aspects in other self-paced courses found in the Driverless AI learning path. We will start with step 1: load data.

About the Dataset

The dataset used for this experiment is a version of the Titanic Kaggle dataset. This dataset contains the list of estimated passengers aboard the RMS Titanic.

The RMS Titanic was a British commercial passenger liner that sank after colliding with an iceberg in the North Atlantic Ocean on April 15, 1912. More than 1,500 people lost their lives from an estimated 2,224 passengers and crew members while on their way to New York City from Southampton.

This tragedy shocked the international community and led to better safety regulations for ships. The lack of lifeboats, amongst other things, was one of the factors that resulted in a significant loss of life. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

Figure 1. RMS Titanic

To further understand the data, please consider the table below:

Attribute

Definition

Key

passenger Id

Id randomly generated

-

pclass

Passenger Class

1= 1st, 2 =2nd, 3=3rd

survived

Survival

0=No, 1=Yes

name_with_salutations

Passenger name

-

name_without_salutations

Passenger name without salutations

-

sex

Sex

Female, Male

age

Age in years

-

sibsp

Number of siblings/Spouse aboard

-

parch

Number of Parents/Children aboard

-

ticket

Ticket number

-

fare

Passenger fare

-

cabin

Cabin number

-

embarked

Port of Embarkment

C = Cherbourg, Q = Queenstown, S = Southampton

boat

Boat number

-

body

Body number

-

home.des

Home Destination

-

Add the Dataset

1. Navigate back to the H2O Driverless AI Datasets page. To add the dataset:

a. Click on Add a Dataset(or Drag and Drop)
b.
Select FILE SYSTEM:

2. Inside the FILE SYSTEM:

select-titanic-dataset

a. Enter the following in the search bar:/data/TestDrive/titanic.csv
b. Select the titanic.csv
c. Click to Import Selection:

3. The following will appear after you have successfully imported the dataset:

titanic-set-overview

Now that the dataset has been imported let's discover on the next task how Driverless AI allows users to further understand a selected dataset. Doing so will allow us to further explore the second step of the Driverless AI workflow: visualize data.

Deeper Dive and Resources

Details

We are now going to explore the Titanic dataset that we just loaded.

1. On the Dataset Overview page, click on the titanic.csv. The following options will appear:

titanic-set-actions

Note: A dataset can only be deleted if it's not being used in an experiment. Otherwise, you must delete the experiment first, and then the dataset.

Next, we are going to confirm that the dataset loaded correctly and that it has the correct number of rows and columns.

2. Click the Details option, and it will take you to the Dataset Details Page:

3. To continue learning about what details are available, click on the following button: Dataset Rows. The following will appear:

titanic-set-rows-page

4. Exit and return to the Datasets page.

Split the Dataset

From the Titanic.csv dataset, we are going to create two datasets, training and test. 75% of the data will be used to train the model, and the other 25% will be used to test the trained model.

1. Click on the titanic.csv file and select Split:

titanic-set-split-1

2. Split the data into two sets: titanic_train and titanic_test, then save the changes. Use the image below as a guide:

titanic-set-split-2

The split ratio of .75 (75% for the training set and 25% for the test set) was selected for this particular dataset, not to generalize the model given the set's total size.

3. Verify that the three Titanic datasets, titanic_test, titanic_train, and titanic.csv, are there:

three-datasets

Autoviz

Now that the titanic.csv dataset has been split, we will now use the titanic_train dataset. Before we begin our experiment, let's begin by visualizing our dataset while further understanding what features and labels will play a crucial role in our Machine learning model.

There are two ways to visualize the training set:

titanic-train-visualize

1. Pick a method to visualize the titanic_train dataset. Right after, the following will appear:

train-set-visualization-ready

2. Click on the titanic_train visualization, and the following graphs will appear:

train-set-visualizations

Is it possible to visualize how variables on the training set are correlated? Can we determine what other variables are strongly correlated to a passenger's survival? The answer to those questions is yes! One of the graphs that allow us to visualize the correlations between variables is the Correlation Graph.

3. Let's explore the correlation between the survived variable and other variables in the dataset:

What variables are strongly correlated with the ‘survived' variable? Based on the correlation graph, we can see that no correlations were inferred to the survived attribute based on the titanic_train dataset. Although the graph inferred no correlation, that is not to say that we will not be able to predict whether someone will survive the titanic accident. Visualizing the dataset only gives us an idea/preview of what data will be used to train our model. As well, visualizing can allow for a deeper understanding while highlighting outliers.

train-set-correlation-graph

4. Exit out of the Correlation Graph view by clicking on the X at the graph's top-right corner.

5. After you are done exploring the other graphs, go back to the Datasets page. While exploring, keep in mind that Driverless AI shows graphs that represent "relevant" aspects of the data. The following are the type of graphs available:

In the next task, we will proceed to step 3 of our Driverless AI workflow: run an experiment.

References

Deeper Dive and Resources

We are going to launch our first experiment. An experiment means that we are going to generate a prediction using a dataset of our choice. In this case, we will use the titanic_train dataset.

1. Return to the Datasets Overview page and click on the titanic_train dataset, then select Predict:

titanic-train-predict

If this is your first time launching an experiment, the following prompt will appear, asking if you want to take a tour:

driverless-tour

If you would like to take a quick tour of the Experiments page, select YES; the short tour will cover the following items:

2. For the time being, select Not Now you can go back later and take a tour. For the most part, this self-paced course will cover what's mentioned during the tour.

3. The Experiment preview page will appear; this preview page displays all settings that Driverless AI will use before launching an experiment:

train-set-experiment-page

4. Continuing with our experiment, name your experiment as follows: Titanic Classification Tutorial

5. Click Dropped Columns, drop the following columns, then select Done:

train-set-drop-columns

We removed these attributes (columns) to create a cleaner dataset. Attributes such as boat and body are excluded because they are clear indicators that a passenger survived and can lead to data leakage. A clean dataset is essential for the creation of a good predictive model. The process of data cleansing needs to be done with all datasets to rid the set of any unwanted observations, structural errors, unwanted outliers, or missing data.

6. For our experiment, we will be using a test dataset. To select the test dataset, select TEST DATASET and select the titanic_test:

add-test-set

7. Now, select the TARGET COLUMN. In our case, the column will be survived. We want to know who will be surviving base on the information the model will be trained on (e.g., age):

train-set-drop-name-column

The survived attribute was selected because, as an insurance company, we want to know what attributes can contribute to passengers' survival. Knowing these attributes from the perspective of an insurance company can be beneficial because it can give the insurance company a good idea of how to manage the rate of insurances.

8. Your experiment page should look similar to the one below; these are the system suggestions based on the data selected to train this model:

experiment-settings

9. Update the following experiment settings, so they match the image below, then select Launch Experiment (use the + (increase) or -(decrease) icons t located in each training setting):

update-experiment-settings

Note: To Launch an Experiment: The dataset and the target column are the minimum elements required to launch an experiment.

10. The Experiment page will look similar to the one below after 95% of the experiment is complete:

experiment-running-46

11. Once the experiment is complete, an Experiment Summary will appear:

experiment-summary

Deeper Dive and Resources

Again, Driverless AI performs feature engineering on the dataset to determine the optimal representation of the data being used to train the models (experiment):

feature-engineering-1

Transformations in Driverless AI are applied to columns in the data. The transformers create the engineered features in experiments. Driverless AI provides a number of transformers. The following transformers are available for regression and classification (multiclass and binary) experiments:

Below are just some of the transformers found in our experiment:

1. Look at some of the variables in the following section: Variable Importance. Note that some of the variables start with _CVTE(_CVTargetEncode) followed by the dataset's column name. Other variables might also begin with _NumToCatTE or _WoE depending on the experiment you run. These are the new, high-value features for our training dataset.

These transformations are created with the following transformers:

To learn more about Driverless AI Transformations please refer to the Driverless AI documentation here.

2. Hover over any of the variables under Variable Importance to get a simple explanation of the transformer used, as seen in the image below:

The complete list of features used in the final model is available in the Experiment Summary and Logs. The experiment summary also provides a list of the original features and their estimated feature importance. In other words, the experiment summary and logs include the transformations that Driverless AI applied to our titanic experiment.

Deeper Dive and Resources

Let's explore the results of this classification experiment. You can find useful metrics in the experiment Summary at the right-bottom of the Experiment Summary page. Next to the Summary section, you can observe graphs that reflect insights from the training and validation data resulting from the classification problem. Now, let us observe and learn about these graphs and the summary generated by Driverless AI. Feel free to follow along as we explore each subsection of the summary section(you can access each discussed graph below by clicking on the name of each graph):

References

Deeper Dive and Resources

For non-time-series experiments, Driverless AI provides several visual explanations and reason codes for the trained Driverless AI model and its results. After the predictive model is finished, we can have access to this reason codes and visuals by generating an MLI Report. With that in mind, let us focus on the fourth step of the Driverless AI workflow: Interpret the model.

1. Generate MLI Report: In the Status Complete section, click Interpret this Model:

The Model Interpretation page is organized into three tabs:

Once the MLI Experiment is finished the following should appear:

dai-model

a. Summary of MLI experiment. This page provides an overview of the interpretation, including the dataset and Driverless AI experiment (if available) that were used for the interpretation along with the feature space (original or transformed), target column, problem type, and k-Lime information:

mli-report-page-1.jpgmli-report-page-2.jpgmli-report-page-3.jpgmli-report-page-4.jpg

b. The DAI Model tab is organized into tiles for each interpretation method. To view a specific plot, click the tile for the plot that you want to view.

For binary classification and regression experiments, this tab includes Feature Importance and Shapley (not supported for RuleFit and TensorFlow models) plots for original and transformed features as well as Partial Dependence/ICE, Disparate Impact Analysis (DIA), Sensitivity Analysis, NLP Tokens and NLP LOCO (for text experiments), and Permutation Feature Importance (if the autodoc_include_permutation_feature_importance configuration option is enabled) plots.

For multiclass classification experiments, this tab includes Feature Importance and Shapley plots for original and transformed features:

dai-model-graphs

A surrogate model is a data mining and engineering technique in which a generally simpler model is used to explain another, usually more complex, model or phenomenon. For example, the decision tree surrogate model is trained to predict the predictions of the more complex Driverless AI model using the original model inputs. The trained surrogate model enables a heuristic understanding (i.e., not a mathematically precise understanding) of the mechanisms of the highly complex and nonlinear Driverless AI model.

c. The Surrogate Model tab is organized into tiles for each interpretation method. To view a specific plot, click the tile for the plot that you want to view. For binary classification and regression experiments, this tab includes K-LIME/LIME-SUP and Decision Tree plots as well as Feature Importance, Partial Dependence, and LOCO plots for the Random Forest surrogate model. For more information on these plots, see Surrogate Model Plots.

surrogate-models

d. Dashboard - The Dashboard button contains a dashboard with an overview of the interpretations (built using surrogate models).

dashboard

e. The Action button on the MLI page can be used to download the reason codes, scoring pipelines for productionization and MLI logs:

f. *n* Running | n Failed | n Done - This option gives you access to a status log that displays the build status of the charts, graphs, plots, and metrics being generated by Driverless AI.

g. DATASETS - It takes you to the Datasets page.

h. EXPERIMENTS - Takes you to the Experiments page.

i. MLI - It takes you to the MLI page to generate or find already developed various interpretations for experiments.

MLI Dashboard

1. Select the MLI Dashboard and explore the different types of insights and explanations regarding the model and its results. All plots are interactive.

mli-dashboard

Note: On the top right corner, where it says "Row Number or Column Value," you are allowed the following actions: This option allows a user to search for a particular observation by row number or the column value. The user cannot specify column values - MLI automatically chooses columns whose values are unique (dataset row count equals the number of unique values in a column).

Every single graph, plot, or chart we have observed has a ? icon, and this provides further information about the visual. It can be located at the top right corner of each visual.

With the above in mind, we can say that the top three factors that contributed to a passenger surviving are as follows: sex, cabin, and class. From the perspective of an insurance company, knowing this information can drastically determine certain groups' insurance rates.

Deeper Dive and Resources

To emphasize, Driverless AI allows you to download auto-generated documents such as the Experiment Summary & Logs and the MLI Report, all at the click of a button.

Experiment Summary & Logs

Click on Download Summary & Logs: Driverless AI will download a zip file:

download-experiment-summary

When you open the zip file, Driverless AI will include the following files:

Besides the DOWNLOAD SUMMARY & LOGS, you can click the DOWNLOAD AUTODOC option to gain insights about the experiment further:

The AutoDoc feature is used to generate automated machine learning documentation for individual Driverless AI experiments. This editable document contains an overview of the experiment and includes other significant details like feature engineering and final model performance. To generate an AutoDoc click on the DOWNLOAD AUTODOC option located in the STATUS: COMPLETE section.

download-autoreport

Deeper Dive and Resources

Before we conclude this self-paced course, note that we haven't focused on the fifth step of our Driverless AI workflow: deploy the scoring pipeline. Given the complexity of the step, we will be exploring the final step in the following self-paced course:

Before moving forward to the above self-paced course, it is recommended to proceed to the next self-paced course in the learning path before exploring the final step of the Driverless AI workflow. The second self-paced course in the learning path will provide a deeper understanding of Driverless AI's UI and its functionalities. Once again, the second self-paced course is as follows:

If you want to test Driverless AI without the constraints the Aquarium lab holds, such as the two-hour mark and no save work, you can request a 21-day trial license key for your own Driverless AI environment.

Driverless AI provides a Project Workspace for managing datasets and experiments related to a specific business problem or use case. Whether you are trying to detect fraud or predict user retention, datasets and experiments can be stored and saved in individual projects. A Leaderboard on the Projects page allows you to quickly compare performance and results and identify the best solution for your problem.

You can link datasets and experiments from the Projects page, and you can run new experiments. When you link an existing experiment to a Project, Driverless AI will automatically link the experiment's datasets to this project (if not already linked).

Explore an Existing Project Workspace

1. Select Projects, an image similar to the one below will appear(the projects section is located on the top of the Driverless AI UI next to the datasets section):

projects-page

2. Let's explore what each project can contain: open the project Time Series Tutorial, and the following will appear:

projects-page-time-series

Create a Project Workspace

3. To create a Project Workspace:

a. Click the Projects option on the top menu
b. Click New Project
d.
Specify a name for the project and provide a description
e. Click Create Project. This creates an empty Project page