For this tutorial, we will explore the Titanic dataset from the perspective of a passenger life insurance company with H2O.ai's enterprise product, Driverless AI. We will explore possible risk factors derived from this dataset that could have been considered when selling passenger insurance during this time. More specifically, we will create a predictive model to determine what factors contributed to a passenger surviving.
In this overview of Driverless AI, you will learn how to load data, explore data details, generate Auto visualizations, launch an experiment, explore feature engineering, view experiment results and get a quick tour of the Machine Learning Interpretability report.
Note: This tutorial has been built on Aquarium, which is H2O.ai's cloud environment providing software access for workshops, conferences, and training. The labs in Aquarium have datasets, experiments, projects, and other content preloaded. If you use your version of Driverless AI, you will not see the preloaded content.
Note: Aquarium's Driverless AI Test Drive lab has a license key built-in, so you don't need to request one to use it. Each Driverless AI Test Drive instance will be available to you for two hours, after which it will terminate. No work will be saved. If you need more time to further explore Driverless AI, you can always launch another Test Drive instance or reach out to our sales team via the contact us form.
Welcome to the Driverless AI Datasets page!
The Driverless UI is easy to navigate. The following features, as well as a few datasets, are found on the Datasets page. We will explore these features as we launch an experiment in the next tasks.
The concepts found in this section are meant to provide a high-level overview of Machine Learning. At the end of this section, you can find links to resources that offer a more in-depth explanation of the concepts covered here.
Machine learning is a subset of Artificial intelligence where the focus is to create machines that can simulate human intelligence. One critical distinction between artificial intelligence and machine learning is that machine learning models "learn" from the data the models get exposed to. Arthur Samuel, a machine learning pioneer back in 1959, defined machine learning as a " field of study that gives computers the ability to learn without being explicitly programmed" . A machine learning algorithm trains on a dataset to make predictions. These predictions are, at times, used to optimize a system or assist with decision making.
Advances in technology have made it easier for data to be collected and made available. The available type of data will determine the kind of training that the machine learning model can undergo. There are two types of machine learning training, supervised and unsupervised learning. Supervised learning is when the dataset contains the output that you are trying to predict. For those cases where the predicting variable is not present, it's called unsupervised learning. Both types of training define the relationship between input and output variables.
In machine learning, the input variables are called features and the output variables labels. The labels, in this case, are what we are trying to predict. The goal is to take the inputs/variables/features and use them to come up with predictions on never-before-seen data. In linear regression, the features are the x-variables, and the labels are the y-variables. An example of a label could be the price of future price of avocados. Some examples of features could be the features found in the dataset for this tutorial on Task 3 such as Passanger Class, Sex, Age, Passanger Fare, Cabin number etc.
A machine learning model defines the relationship between features and labels. A model can be trained by feeding it examples. Examples are a particular instance of data. You can have two types of examples: labeled and unlabeled. Labeled examples are those where the x and y values (features, labels) are known. Unlabeled examples are those where we know the x value, but we don't know what the y value is (feature,?). Your dataset is like an example; the columns that will be used for training are the features; the rows are the instances of those features. The column that you want to predict is the label.
Supervised learning takes labeled examples and allows a model that is being trained to learn the relationship between features and labels. The trained model can then be used on unlabelled data to predict the missing y value. The model can be tested with either labeled or unlabeled data. Testing a trained model with unlabeled data is called unsupervised training . Note that H2O Driverless AI creates models with labeled examples.
A machine learning model is as good as the data that is used to train it. If you use garbage data to train your model, you will get a garbage model. With this said, before uploading a dataset into tools that will assist you with building your machine learning model such as Driverless AI, ensure that the dataset has been cleaned and prepared for training. The process of transforming raw data into another format, which is more appropriate and valuable for analytics, is called data wrangling.
Data wrangling, which can include extractions, parsing, joining, standardizing, augmenting, cleansing, consolidating, missing data is fixed or removed. Data preparation includes the dataset being in the correct format for what you are trying to do. Duplicates have been removed. Missing data is fixed or removed, and finally, categorial values have been transformed or encoded to a numerical type. Tools like Python datatable, Pandas and R are great assets for data wrangling.
Data wrangling can be done in Driverless AI via a data recipe, the JDBC connector or through live code which will create a new dataset by modifying the existing one.
Data transformation or feature engineering is the process of creating new features from the existing ones. Proper data transformations on a dataset can include scaling, decomposition, and aggregation . Some data transformations include looking at all the features and identifying which features can be combined to make new ones that will be more useful to the performance of the model. For categorical features, the recommendation is for classes that have few observations to be grouped to reduce the likelihood of the model overfitting. Categorical features may be converted to numerical represenations since many algorithms cannot handle categorical features directly. Last but not least, remove features that are not used or are redundant . These are only a few suggestions when approaching feature engineering. Feature engineering is very time-consuming due to its repetitive nature; it can also be costly. The next step in creating a model is selecting an algorithm.
"Machine learning algorithms are described as learning a target function (f) that best maps input variables (x) to an output variable(y): Y= f(x)" . In supervised learning, there are many algorithms to select from for training. The type of algorithm(s) will depend on the size of your data set, structure, and the type of problem you are trying to solve. Through trial and error, the best performing algorithms can be found for your dataset. Some of those algorithms include linear regression, regression trees, random forests, naive Bayes, and random forest, boosting, to name a few .
One good practice when training a machine learning model is to split up your dataset into subsets: training, validation, and testing sets. A good ratio for the entire dataset is 70-15-15, 70% of the whole dataset for training, 15% for validation, and the remaining 15% for testing. The training set is the data that will be used to train the model, and it needs to be big enough to get significant results from it. The validation set is the data that has been held back from the training and will be used to evaluate and adjust the hyperparameters of the trained model and hence adjust the performance. Finally, the test set is data that has also been held back and will be used to confirm the results of the final model .
Note: The validation dataset is used for tuning the modeling pipeline. If provided, the entire training data will be used for training, and validation of the modeling pipeline is performed with only this validation dataset. When you do not include a validation dataset, Driverless AI will do K-fold cross validation for I.I.D. experiments and multiple rolling window validation splits for time series experiments. For this reason it is not generally recommended to include a validation dataset as you are then validating on only a single dataset. Please note that time series experiments cannot be used with a validation dataset: including a validation dataset will disable the ability to select a time column and vice versa.
Another part of model training is fitting and tuning the models. For fitting and tuning, hyperparameters need to be tuned, and cross-validation needs to take place using only the training data. Various hyperparameters values will need to be tested. "A hyperparameter is a parameter that is set before the learning process begins. These parameters are tunable and can directly affect how well a model trains. An example of hyper parameter in machine learning is learning rate" . With cross validation, the whole dataset is utilized, and each model is trained on a different subset of the training data . Additionally, a cross-validation loop will be set to calculate the cross-validation score for each set of hyperparameters for each algorithm. Based on the cross-validation score and hyperparameter values, you can select the model(remember that "a model in ML is the output of a machine learning algorithm run on data. It represents whatt was learned by a machine learning algorithm" ) for each algorithm that has been tuned with training data and test it using your test set.
One of the significant challenges faced in developing a single production-ready model is that it can take weeks or months to build it. Developing a model involves feature engineering, model building, and model deployment. All tasks are very repetitive, time-consuming, require advanced knowledge of feature generation, algorithms, parameters, and model deployment. Finally, there needs to be in-depth knowledge and confidence in how the model was generated to explain and justify how the model made its decisions.
AutoML or Automated Machine Learning is the process of automating algorithm selection, feature generation, hyperparameter tuning, iterative modeling, and model assessment. AutoML tools such as H2O Driverless AI makes it easy to train and evaluate machine learning models. Automating the repetitive tasks of Machine Learning Development allows people in the industry to focus on the data and the business problems they are trying to solve.
 Google's Machine Learning Crash Course
 About Train, Validation and Test Sets in Machine Learning
 Data Science Primer - Data Cleaning
 Feature Engineering
 Towards Data Science- Supervised vs Unsupervised Learning
 Selecting the best Machine Learning Algorithm for your regression problem
 Deep AI - What is a hyperparameter?
 H2O.ai's Driverless AI - Internal Validation Technique
 Difference between Algorithm and Model in Machine Learning
The typical Driverless AI workflow is to:
In addition, you can diagnose a model, transform another dataset, score the model against antoher dataset and manage your data in Projects. The focus of this tutorial will be in steps 1 - 4. The other aspects of the Driverless AI will be covered in other tutorials found in the Driverless AI learning path. We will start with loading the data.
1. Navigate back to the H2O Driverless AI Datasets page.
The dataset used for this experiment is a version of the Titanic Kaggle dataset. This dataset contains the list of estimated passengers aboard the RMS Titanic.
The RMS Titanic was a British commercial passenger liner that sank after colliding with an iceberg in the North Atlantic Ocean on April 15, 1912. More than 1,500 people lost their lives from an estimated 2,224 passengers and crew members while on their way to New York City from Southampton.
This tragedy shocked the international community and led to better safety regulations for ships. The lack of lifeboats, amongst other things, was one of the factors that resulted in a significant loss of life. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.
1309 rows, one row per passenger, and 16 columns representing attributes of each passenger:
Id randomly generated
1= 1st, 2 =2nd, 3=3rd
Passenger name without salutations
Age in years
Number of siblings/Spouse aboard
Number of Parents/Children aboard
Port of Embarkment
C = Cherbourg, Q = Queenstown, S = Southampton
1. Click on Add a Dataset(or Drag and Drop)
2. Select FILE SYSTEM
3. Enter the following:
/data/TestDrive/titanic.csvinto the search bar
4. If the file loaded successfully, then you should see an image similar to the one below:
Things to Note:
We are now going to explore the Titanic dataset that we just loaded.
1. Continuing on the Dataset Overview page, click on the titanic.csv dataset. The following options will appear:
Note: A dataset can only be deleted if it's not being used in an experiment. Otherwise, you must delete the experiment first, and then the dataset can be deleted.
2. Next, we are going to confirm that the dataset loaded correctly and that it has the correct number of rows and columns by clicking on Details.
3. Click on Details. Details will take you to the Dataset Details Page
Things to Note:
Logical type (can be changed)
Format for Date and Datetime columns(can be changed)
Note: Driverless AI recognizes the following column types: integer, string, real, boolean, and time. Date columns are given a string "str" type.
4. Select Dataset Rows
Things to Note:
5. Exit and return to Datasets Overview page.
From the Titanic.csv dataset, we are going to create two datasets, training and test. 75% of the data will be used for training the model, and 25% to test the trained model.
1. Click on the titanic.csv file and select Split
2. Split the data into two sets:
titanic_test, then save the changes. Use the image below as a guide:
Things to Note:
titanic_train(this will serve as the training set)
titanic_test(this will serve as the test set)
The split ratio of .75 (75% for the training set and 25% fo the test set) was selected for this particular dataset to not generalize the model given the total size of the set.
The training set contains 981 rows, each row representing a passenger, and 16 columns representing the attributes of each passenger.
The Test set contains 328 rows, each row representing a passenger, and 16 attribute columns representing attributes of each passenger.
Verify that the three Titanic datasets, titanic_test, titanic_train and titanic.csv are there:
Now that the titanic.csv dataset has been split, we will use the titanic_train set for the remaining of the tutorial.
There are two ways to visualize the training set:
Method 1 : Clicking on the titanic_train file, select Visualize, then click on the visualization file generated.
Method 2: Clicking on Autoviz located at the top of the UI page, where you will be asked for the dataset you want to visualize.
1. Pick a method to visualize the titanic_train dataset. A similar image should appear:
Click on the titanic_train visualization, and the following screen will appear.
Is it possible to visualize how variables on the training set are correlated? Can we determine what other variables are strongly correlated to a passenger's survival? The answer to those questions is yes! One of the graphs that allow us to visualize the correlations between variables is the Correlation Graph.
Let's explore the correlation between the ‘survived' variable and other variables in the dataset.
2. Select the Correlation Graph and then click on Help located at the lower-left corner of the graph.
3. Take a minute to read about how the correlation graph was constructed. Learn more about how variables are color-coded to show their correlations.
4. Take the ‘survived' variable and drag it slightly to have a better look at the other variables Driverless AI found it is correlated to.
What variables are strongly correlated with the ‘survived' variable?
Things to Note:
5. Exit out of the Correlation Graph view by clicking on the X at the top-right corner of the graph.
6. After you are done exploring the other graphs, go back to the datasets page.
Driverless AI shows the graphs that are "relevant" aspects of the data. The following are the type of graphs available:
We are going to launch our first experiment. An experiment means that we are going to generate a prediction using a dataset of our choice.
1. Return to the Dataset Overview page
2. Click on the titanic_train dataset then select Predict
If this is your first time launching an experiment, the following prompt will appear, asking if you want to take a tour.
If you would like to take a quick tour of the Experiments page, select YES, the quick tour will cover the following items:
3. Select Not Now to come back and take the tour another time.
4. The following Experiment page will appear:
Things to Note:
Note: To disable assistant, click on assistant again.
Continuing with our experiment:
Name your experiment as follows :
Titanic Classification Tutorial
5. Click Dropped Columns, drop the the following columns: Passenger_Id, name_with_salutations, name_without_salutations, boat, body and home.dest. Then select Done.
These attributes (columns) were removed to create a cleaner dataset. Attributes such as boat and body are excluded because they are clear indicators that a passenger survived and can lead to data leakage. For our experiment, the survived column will suffice to create a model.
A clean dataset is essential for the creation of a good predictive model. The process of data cleansing needs to be done with all datasets to rid the set of any unwanted observations, structural errors, unwanted outliers, or missing data.
6. Select Test Dataset and then click on
7. Now select the Target Column. In our case, the column will be ‘survived.'
The survived attribute was selected because, as an insurance company, we want to know what other attributes can contribute to the survival of passengers aboard a ship and incorporate that into our insurance rates.
8. Your experiment page should look similar to the one below; these are the system suggestions:
Things to Note:
9. Update the following experiment settings so that they match the image below, then select Launch Experiment.
Note: To Launch an Experiment: The dataset and the target column are the minimum elements required to launch an experiment.
10. The Experiment page will look similar to the one below after 45% complete:
Things to Note:
Once the experiment is complete, an Experiment Summary will appear:
Things to Note:
Driverless AI performs feature Engineering on the dataset to determine the optimal representation of the data. Various stages of the features appear throughout the iteration of the data. These can be viewed by hovering over points on the Iteration Data - Validation Graph and seeing the updates on the Variable Importance section.
Transformations in Driverless AI are applied to columns in the data. The transformers create engineered features in experiments. There are many types of transformers, below are just some of the transformers found in our dataset:
1. Look at some of the variables in Variable of Importance. Note that some of the variables start with
_CVTE followed by a column from the dataset. Some other variables might also begin with
_WoE depending on the experiment you run. These are the new, high-value features for our training dataset.
These transformations are created with the following transformers:
You can also hover over any of the variables under variable importance to get a simple explanation of the transformer used as seen in the image below:
The complete list of features used in the final model is available in the Experiment Summary artifacts. The Experiment Summary also provides a list of the original features and their estimated feature importance.
Let's explore the results of this classification experiment. You can find the results on the Experiment Summary at the left-bottom of the Experiment page. The resulting plots are insights from the training and validation data resulting from the classification problem. Each plot will be given a brief overview.
If you are interested in learning more about each plot and the metrics derived from those plots covered in this section, then check out our next tutorial Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus.
Once the experiment is done, a summary is generated at the bottom-right corner of the Experiment page.
The summary includes:
Most of the information in the Experiment Summary tab, along with additional detail, can be found in the Experiment Summary Report (Yellow Button "Download Experiment Summary").
2. ROC - Receiver Operating Characteristics
This type of graph is called a Receiver Operating Characteristic curve (or ROC curve.) It is a plot of the true positive rate against the false-positive rate for the different possible cutpoints of a diagnostic test.
An ROC curve is a useful tool because it only focuses on how well the model was able to distinguish between classes with the help of the Area Under the Cure or AUC. "AUC's can help represent the probability that the classifier will rank a randomly selected positive observation higher than a randomly selected negative observation". However, for models where one of the classes occurs rarely, a high AUC could provide a false sense that the model is correctly predicting the results. This is where the notion of precision and recall become essential.
The ROC curve below shows Receiver-Operator Characteristics curve stats on validation data along with the best Accuracy, MCC, and F1 values.
This ROC gives an Area Under the Curve or AUC of .8472. The AUC tells us that the model is able to separate the survivor class with an accuracy of 84.72%.
Learn more about the ROC Curve on Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus: ROC.
3. Prec-Recall: Precision-Recall Graph
Prec-Recall is a complementary tool to ROC curves, especially when the dataset has a significant skew. The Prec-Recall curve plots the precision or positive predictive value (y-axis) versus sensitivity or true positive rate (x-axis) for every possible classification threshold. At a high level, we can think of precision as a measure of exactness or quality of the results while recall as a measure of completeness or quantity of the results obtained by the model. Prec-Recall measures the relevance of the results obtained by the model.
The Prec-Recall plot below shows the Precision-Recall curve on validation data along with the best Accuracy, MCC, and F1 values. The area under this curve is called AUCPR.
Similarly to the ROC curve, when we take a look at the area under the curve of the Prec-Recall Curve of AUCPR we get a value of .8146. This tells us that the model brings forth relevant results or those cases of the passengers that survived with an accuracy of 81.46%.
Learn more about the Prec-Curve Curve on Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus: Prec-Recall.
4. Cumulative Lift Chart
Lift can help us answer the question of how much better one can expect to do with the predictive model compared to a random model(or no model). Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with a model and with a random model (or no model). In other words, the ratio of gain% to the random expectation % at a given quantile. The random expectation of the xth quantile is x% .
The Cumulative Lift chart shows lift stats on validation data. For example, "How many times more observations of the positive target class are in the top predicted 1%, 2%, 10%, etc. (cumulative) compared to selecting observations randomly?" By definition, the Lift at 100% is 1.0.
Learn more about the Cumulative Lift Chart on Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus: Cumulative Lift.
5. Cumulative Gains Chart
Gain and Lift charts measure the effectiveness of a classification model by looking at the ratio between the results obtained with a trained model versus a random model(or no model). The Gain and Lift charts help us evaluate the performance of the classifier as well as answer questions such as what percentage of the dataset captured has a positive response as a function of the selected percentage of a sample. Additionally, we can explore how much better we can expect to do with a model compared to a random model(or no model).
For better visualization, the percentage of positive responses compared to a selected percentage sample, we use Cumulative Gains and Quantile.
In the Gains Chart below, the x-axis shows the percentage of cases from the total number of cases in the test dataset, while the y-axis shows the percentage of positive outcomes or survivors in terms of quantiles.
The Cumulative Gains Chart below shows Gains stats on validation data. For example, "What fraction of all observations of the positive target class are in the top predicted 1%, 2%, 10%, etc. (cumulative)?" By definition, the Gains at 100% are 1.0.
The Gains chart above tells us that when looking at the 20% quantile, the model can positively identify ~45% of the survivors compared to a random model(or no model) which would be able to positively identify about ~20% of the survivors at the 20% quantile.
Learn more about the Cumulative Gains Chart on Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus: Cumulative Gains.
Kolmogorov-Smirnov or K-S measures the performance of classification models by measuring the degree of separation between positives and negatives for validation or test data. "The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0. In most classification models, the K-S will fall between 0 and 100, and that the higher the value, the better the model is at separating the positive from negative cases.".
K-S or the Kolmogorov-Smirnov chart measures the degree of separation between positives and negatives for validation or test data.
Hover over a point in the chart to view the quantile percentage and Kolmogorov-Smirnov value for that point.
For the K-S chart above, if we look at the top 60% of the data, the at-chance model (the dotted diagonal line) tells us that only 60% of the data was successfully separate between positives and negatives (survived and did not survived). However, with the model, it was able to do .4091, or about 41% of the cases were successfully separated between positives and negatives.
Learn more about the Kolmogorov-Smirnov chart on Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus: Kolmogorov-Smirnov chart.
 ROC Curves and Under the Curve (AUC) Explained
 H2O Driverless AI - Experiment Graphs
 Model Evaluation Classification
 Lift Analysis Data Scientist Secret Weapon
 H2O Driverless AI - Kolmogorov-Smirnov
 Model Evaluation- Classification
After the predictive model is finished, we can explore the interpretability of our model. In other words, what are the results and how did those results come to be?
Questions to consider before viewing the MLI Report:
There are two ways to generate the MLI Report, selecting the MLI link on the upper-right corner of the UI or clicking Interpret this Model button on the Experiment page.
Generate the MLI report:
1. On the Status: Complete Options, select Interpret this Model
2. Once the MLI model is complete, you should see an image similar to the one below:
3. Once the MLI Experiment is finished a pop up comes up, go to MLI page by clicking Yes.
4. The MLI Interpretability Page has the explanations to the model results in a human-readable format.
This section describes MLI functionality and features for regular experiments. For non-time-series experiments, this page provides several visual explanations and reason codes for the trained Driverless AI model, and it's results.
Things to Note:
Select the MLI Dashboard and explore the different types of insights and explanations regarding the model and its results. All plots are interactive.
1. K-Lime - Global Interpretability model explanation plot:
This plot shows Driverless AI model and LIME model predictions in sorted order by the Driverless AI model predictions. In white, is the global linear model of Driverless AI predictions.
Learn more about K-Lime with our Machine Learning Interpretability Tutorial.
2. Feature Importance -
This graph shows the essential features that drive the model behavior.
Learn more about Feature Importance with our Machine Learning Interpretability TUtorial.
3. Decision Tree Surrogate model
The decision Tree Surrogate model displays the model's approximate flowchart of the complex Driverless AI model's decision making.
Higher and more frequent features are more important. Features above or below one-another can indicate an interaction. Finally, the thickest edges are the most common decision paths through the tree that lead to a predicted numerical outcome.
Learn more about Decision Trees with our Machine Learning Interpretability Tutorial.
4. Partial Dependence and Individual Conditional Expectation (ICE) plot. This plot represents the model prediction for different values of the original variables. It shows the average model behavior for important original variables.
The grey bar represents the standard deviation of predictions. The yellow dot represents the average predictions.
Learn more about Partial Dependence Plots with our Machine Learning Interpretability Tutorial.
Explanations provide a detailed, easy-to-read Reason Codes for the top Global/Local Attributions.
6. Driverless AI offers other plots located under Driverless AI Model and Surrogate Models, take a few minutes to explore these plots; they are all interactive. About this Plot will provide an explanation of each plot.
Driverless AI Model
7. Click on the MLI link and learn more about "Machine Learning Interpretability with Driverless AI."
Driverless AI allows you to download auto-generated documents such as the Download Experiment Summary and the MLI Report, all at the click of a button.
1. Click on Download Experiment Summary
When you open the zip file, the following files should be included:
2. Open the auto-generated .doc report and review the experiment results.
3. Click on Download Autoreport
Autoreport is a Word version of an auto-generated report for the experiment. A report file (AutoDoc) is included in the experiment summary.
The zip file for the Autoreport provides insight into the following:
Check out Driverless AI next tutorial Machine Learning Experiment Scoring and Analysis Tutorial - Financial Focus
Where you will learn how to:
Driverless AI provides a Project Workspace for managing datasets and experiments related to a specific business problem or use case. Whether you are trying to detect fraud or predict user retention, datasets, and experiments can be stored and saved in the individual projects. A Leaderboard on the Projects page allows you to easily compare performance and results and identify the best solution for your problem.
From the Projects page, you can link datasets and/or experiments, and you can run new experiments. When you link an existing experiment to a Project, the datasets used for the experiment will automatically be linked to this project (if not already linked).
1. Select Projects , an image similar to the one below will appear:
Things to Note:
3. Open the Time Series Tutorial, an image similar to the one below will appear:
Things to Note:
### Create a Project Workspace
To create a Project Workspace: