Many tools, such as ROC and Precision-Recall Curves, are available to evaluate how good or bad a classification model is predicting outcomes. In this tutorial, we will use a subset of the Freddie Mac Single-Family Loan-Level dataset to build a classification model and use it to predict if a loan will become delinquent. Through H2O's Driverless AI Diagnostic tool, we will explore the financial impacts the false positive and false negative predictions have while exploring tools like ROC Curve, Prec-Recall, Gain and Lift Charts, K-S Chart. Finally, we will explore a few metrics such as AUC, F-Scores, GINI, MCC, and Log Loss to assist us in evaluating the performance of the generated model.

Note: We recommend that you go over the entire tutorial first to review all the concepts, that way, once you start the experiment, you will be more familiar with the content.

You will need the following to be able to do this tutorial:

Note: Aquarium's Driverless AI Test Drive lab has a license key built-in, so you don't need to request one to use it. Each Driverless AI Test Drive instance will be available to you for two hours, after which it will terminate. No work will be saved. If you need more time to further explore Driverless AI, you can always launch another Test Drive instance or reach out to our sales team via the contact us form.

About the Dataset

This dataset contains information about "loan-level credit performance data on a portion of fully amortizing fixed-rate mortgages that Freddie Mac bought between 1999 to 2017. Features include demographic factors, monthly loan performance, credit performance including property disposition, voluntary prepayments, MI Recoveries, non-MI recoveries, expenses, current deferred UPB and due date of last paid installment."[1]

[1] Our dataset is a subset of the Freddie Mac Single-Family Loan-Level Dataset. It contains 500,000 rows and is about 80 MB.

The subset of the dataset this tutorial uses has a total of 27 features (columns) and 500,137 loans (rows).

Download the Dataset

Download H2O's subset of the Freddie Mac Single-Family Loan-Level dataset to your local drive and save it at as csv file.

Launch Experiment

1. Load the loan_level.csv to Driverless AI by clicking Add Dataset (or Drag and Drop) on the Datasets overview page. Click on Upload File, then select loan_level_500K.csv file. Once the file is uploaded, select Details.

loan-level-details-selection

Note: You will see 6 more datasets, but you can ignore them, as we will be working with the loan_level_500k.csv file.

2. Let's take a quick look at the columns:

loan-level-details-pageThings to Note:

3. Continue scrolling through the current page to see more columns (image is not included)

4. Return to the Datasets overview page

5. Click on the loan_level_500k.csv file then split

loan-level-split-1

6. Split the data into two sets: freddie_mac_500_train and freddie_mac_500_test. Use the image below as a guide:

loan-level-split-2Things to Note:

  1. Type freddie_mac_500_train for OUTPUT NAME 1, this will serve as the training set
  2. Type freddie_mac_500_test for OUTPUT NAME 2, this will serve as the test set
  3. For Target Column select Delinquent
  4. You can set the Random Seed to any number you'd like, we chose 42, by choosing a random seed we will obtain a consistent split
  5. Change the split value to .75 by adjusting the slider to 75% or entering .75 in the section that says Train/Valid Split Ratio
  6. Save

The training set contains 375k rows, each row representing a loan, and 27 columns representing the attributes of each loan including the column that has the label we are trying to predict.

Note: the actual data in training and test split vary by user, as the data is split randomly. The Test set contains 125k rows, each row representing a loan, and 27 attribute columns representing attributes of each loan.

7. Verify that there are three datasets, freddie_mac_500_test, freddie_mac_500_train and loan_level_500k.csv:

loan-level-three-datasets

8. Click on the freddie_mac_500_train file then select Predict.

9. Select Not Now on the First time Driverless AI, Click Yes to get a tour!. A similar image should appear:

loan-level-predict

Name your experiment Freddie Mac Classification Tutorial

10. Select Dropped Cols, drop the following 2 columns:

These two columns are dropped because they are both clear indicators that the loans will become delinquent and will cause data leakage.

train-set-drop-columns

11. Select Target Column, then select Delinquenttrain-set-select-delinquent

12. Select Test Dataset, then freddie_mac_500_test

add-test-set

13. A similar Experiment page should appear:

experiment-settings-1

On task 2, we will explore and update the Experiment Settings.

1. Hover over to Experiment Settings and note the three knobs, Accuracy, Time and Interpretability.

The Experiment Settings describe the Accuracy, Time, and Interpretability of your specific experiment. The knobs on the experiment settings are adjustable, as the values change the meaning of the settings on the left-bottom page change.

Here is an overview of the Experiments settings:

Accuracy

By increasing the accuracy setting, Driverless AI gradually adjusts the method for performing the evolution and ensemble. A machine learning ensemble consists of multiple learning algorithms to obtain a better predictive performance that could be obtained from any one single learning algorithm[1]. With a low accuracy setting, Driverless AI varies features(from feature engineering) and models, but they all compete evenly against each other. At higher accuracy, each independent main model will evolve independently and be part of the final ensemble as an ensemble over different main models. At higher accuracies, Driverless AI will evolve+ensemble feature types like Target Encoding on and off that evolve independently. Finally, at highest accuracies, Driverless AI performs both model and feature tracking and ensembles all those variations.

Time

Time specifies the relative time for completing the experiment (i.e., higher settings take longer). Early stopping will take place if the experiment doesn't improve the score for the specified amount of iterations. The higher the time value, the more time will be allotted for further iterations, which means that the recipe will have more time to investigate new transformations in feature engineering and model's hyperparameter tuning.

Interpretability

The interpretability knob is adjustable. The higher the interpretability, the simpler the features the main modeling routine will extract from the dataset. If the interpretability is high enough then a monotonically constrained model will be generated.

2. For this tutorial update the following experiment settings so that they match the image below:

This configuration was selected to generate a model quickly with a sufficient level of accuracy in the H2O Driverless Test Drive environment.

experiment-settings-2

Expert Settings

3. Hover over to Expert Settings and click on it. An image similar to the one below will appear:

expert-settings-1Things to Note:

  1. + Upload Custom Recipe
  2. + Load Custom Recipe From URL
  3. Official Recipes (External)
  4. Experiment
  5. Model
  6. Features
  7. Timeseries
  8. NLP
  9. Recipes
  10. System

Expert Settings are options that are available for those who would like to set their settings manually. Explore the available expert settings by clicking in the tabs on top of the page.

Expert settings include:

Experiment Settings

Model Settings

Features Settings

Time Series Settings

NLP Settings

Recipes Settings

System Settings

4. For this experiment turn ON RuleFit models, under Model tab the select Save.

The RuleFit[2] algorithm creates an optimal set of decision rules by first fitting a tree model and then fitting a Lasso (L1-regularized) GLM model to create a linear model consisting of the most important tree leaves (rules). The RuleFit model helps with exceeding the accuracy of Random Forests while retaining explainability of decision trees.

expert-settings-rulefit-on

Turning on the RuleFit model will be added to the list of algorithms that Driverless AI will consider for the experiment. The selection of the algorithm depends on the data and the configuration selected.

5. Before selecting Launch, make sure that your Experiment page looks similar to the one above, once ready, click on Launch.

Learn more about what each setting means and how it can be updated from its default values by visiting H2O's Documentation- Expert Settings

Resources

[1] Ensemble Learning

[2] J. Friedman, B. Popescu. "Predictive Learning via Rule Ensembles". 2005

Deeper Dive

As we learned in the Automatic Machine Learning Introduction with Driverless AI it is essential that once a model has been generated that its performance is evaluated. These metrics are used to evaluate the quality of the model that was built and what model score threshold should be used to make predictions. There are multiple metrics for assessing a binary classification machine learning models such as Receiver Operating Characteristics or ROC curve, Precision and Recall or Prec-Recall, Lift, Gain and K-S Charts to name a few. Each metric evaluates different aspects of the machine learning model. The concepts below are for metrics used in H2O's Driverless AI to assess the performance of classification models that it generated. The concepts are covered at a very high level, to learn more in-depth about each metric covered here we have included additional resources at the end of this task.

Binary Classifier

Let's take a look at binary classification model. A binary classification model predicts in what two categories(classes) the elements of a given set belong to. In the case of our example, the two categories(classes) are defaulting on your home loan and not defaulting. The generated model should be able to predict in which category each customer falls under.

binary-output

However, two other possible outcomes need to be considered, the false negative and false positives. These are the cases that the model predicted that someone did not default on their bank loan and did. The other case is when the model predicted that someone defaulted on their mortgage, but in reality, they did not. The total outcomes are visualized through a confusion matrix, which is the two by two table seen below:

Binary classifications produce four outcomes:

Predicticted as Positive:

True Positive = TP

False Positive = FP

Predicted as Negative:

True Negative = TN

False Negative = FN

binary-classifier-four-outcomes

Confusion Matrix:

confusion-matrix

From this confusion table, we can measure error-rate, accuracy, specificity, sensitivity, and precision, all useful metrics to test how good our model is at classifying or predicting. These metrics will be defined and explained in the next sections.

On a fun side note, you might be wondering why the name "Confusion Matrix"? Some might say that it's because a confusion matrix can be very confusing. Jokes aside, the confusion matrix is also known as the error matrix since it makes it easy to visualize the classification rate of the model including the error rate. The term "confusion matrix" is also used in psychology and the Oxford dictionary defines it as "A matrix representing the relative frequencies with which each of a number of stimuli is mistaken for each of the others by a person in a task requiring recognition or identification of stimuli. Analysis of these data allows a researcher to extract factors (2) indicating the underlying dimensions of similarity in the perception of the respondent. For example, in colour-identification tasks, relatively frequent confusion of reds with greens would tend to suggest daltonism." [1] In other words, how frequently does a person performing a classification task confuse one item for another. In the case of ML, a machine learning model is implementing the classification and evaluating the frequency in which the model confuses one label from another rather than a human.

ROC

An essential tool for classification problems is the ROC Curve or Receiver Operating Characteristics Curve. The ROC Curve visually shows the performance of a binary classifier; in other words, it "tells how much a model is capable of distinguishing between classes" [2] and the corresponding threshold. Continuing with the Freddie Mac example the output variable or the label is whether or not the customer will default on their loan and at what threshold.

Once the model has been built and trained using the training dataset, it gets passed through a classification method (Logistic Regression, Naive Bayes Classifier, support vector machines, decision trees, random forest, etc...), this will give the probability of each customer defaulting.

The ROC curve plots the Sensitivity or true positive rate (y-axis) versus 1-Specificity or false positive rate (x-axis) for every possible classification threshold. A classification threshold or decision threshold is the probability value that the model will use to determine where a class belongs to. The threshold acts as a boundary between classes to determine one class from another. Since we are dealing with probabilities of values between 0 and 1 an example of a threshold can be 0.5. This tells the model that anything below 0.5 is part of one class and anything above 0.5 belongs to a different class. The threshold can be selected to maximize the true positives while minimizing false positives. A threshold is dependent on the scenario that the ROC curve is being applied to and the type of output we look to maximize. Learn more about the application of threshold and its implications on Task 6: ER: ROC.

Given our example of use case of predicting loans the following provides a description for the values in the confusion matrix:

What are sensitivity and specificity? The true positive rate is the ratio of the number of true positive predictions divided by all positive actuals. This ratio is also known as recall or sensitivity, and it is measured from 0.0 to 1.0 where 0 is the worst and 1.0 is the best sensitivity. Sensitive is a measure of how well the model is predicting for the positive case.

The true negative rate is the ratio of the number of true negative predictions divided by the sum of true negatives and false positives. This ratio is also known as specificity and is measured from 0.0 to 1.0 where 0 is the worst and 1.0 is the best specificity. Specificity is a measure for how well the model is predicting for the negative case correctly. How often is it predicting a negative case correctly.

The false negative rate is 1 - Sensitivity, or the ratio of false negatives divided by the sum of the true positives and false negatives [3].

The following image provides an illustration of the ratios for sensitivity, specificity and false negative rate.

sensitivity-and-specificity

Recall = Sensitivity = True Positive Rate = TP / (TP + FN)

Specificity = True Negative Rate = TN / (FP + TN)

false-positive-rate

1 - Specificity = False Positive Rate = 1 - True Negative Rate = FP / (FP + TN )

A ROC Curve is also able to tell you how well your model did by quantifying its performance. The scoring is determined by the percent of the area that is under the ROC curve otherwise known as Area Under the Curve or AUC.

Below are four types of ROC Curves with its AUC:

Note: The closer the ROC Curve is to the left ( the bigger the AUC percentage), the better the model is at separating between classes.

The Perfect ROC Curve (in red) below can separate classes with 100% accuracy and has an AUC of 1.0 (in blue):

roc-auc-1

The ROC Curve below is very close to the left corner, and therefore it does a good job in separating classes with an AUC of 0.7 or 70%:

roc-auc-07

In the case above 70% of the cases the model correctly predicted the positive and negative outcome and 30% of the cases it did some mix of FP or FN.

This ROC Curve lies on the diagonal line that splits the graph in half. Since it is further away from the left corner, it does a very poor job at distinguishing between classes, this is the worst case scenario, and it has an AUC of .05 or 50%:

roc-auc-05

An AUC of 0.5, tells us that our model is as good as a random model that has a 50% chance of predicting the outcome. Our model is not better than flipping a coin, 50% of the time the model can correctly predict the outcome.

Finally, the ROC Curve below represents another perfect scenario! When the ROC curve lies below the 50% model or the random chance model, then the model needs to be reviewed carefully. The reason for this is that there could have been potential mislabeling of the negatives and positives which caused the values to be reversed and hence the ROC curve is below the random chance model. Although this ROC Curve looks like it has an AUC of 0.0 or 0% when we flip it we get an AUC of 1 or 100%.

roc-auc-0

A ROC curve is a useful tool because it only focuses on how well the model was able to distinguish between classes. "AUC's can help represent the probability that the classifier will rank a randomly selected positive observation higher than a randomly selected negative observation" [4]. However, for models where the prediction happens rarely a high AUC could provide a false sense that the model is correctly predicting the results. This is where the notion of precision and recall become important.

Applicability of ROC Curves in the Real World

ROC and AUC curves are important evaluation metrics for calculating the performance of any classification model. In hopes of understanding the applicability of ROC curves, consider the next ROC curves with its AUC and smooth histograms from a binary classifier model trying to make the following point:

Task: Identify the most effective ROC Curve that will distinguish between green and red apples.

Below are three types of ROC Curves in correlation to finding the perfect ROC that will distinguish between red and green apples.

As noted above, the closer the ROC curve is to the left (the more significant the AUC percentage), the better the model is at separating classes.

Note: Before moving forward, it's essential to clarify that the smooth histograms plotted are the result of previous data. Such data has determined that apples with more than 50% (threshold) of its body being red will be considered a red apple. Therefore, anything below 50% will be a green apple.

ROC One:

ROC-1

In this case, the above smooth Histogram (1A) is telling us that the distribution will be as follows:

The green bell curve represents green apples while the red bell curve represents the red apples and the threshold will be at 50%. The x-axis represents the predicted probabilities, while the y-axis represents the count of observations.

From general observations, we can see that the smooth Histogram shows that the current classifier can distinguish between red and green apples only 50% of the time, and such distinction is at the 0.5 thresholds.

When we draw the ROC Curve (1B) for the smooth Histogram above, we will get the following results:

The ROC Curve is telling us how good the model is at distinguishing between two classes: in this case, we refer to red and green apples as the two classes. When looking at the ROC Curve, it will have an AUC of 1 (in blue). Hence, as discussed earlier, an AUC of one will tell us that the model is performing at 100% (perfect performance). Although, not always the case. We can have a ROC Curve of zero, and if we flip the curve, it can give us a ROC of one; that is why we always need to review the model carefully.

Accordingly, this ROC Curve is telling us that the classifier model can distinguish between red and green apples 100% of all cases when it has to identify. Consequently, the random model becomes absolute. For reference, the random model (dash line) essentially represents a classifier that does not do better than random guessing.

Therefore, the ROC Curve for this smooth Histogram will be perfect because it can separate red and green apples.

ROC Two:

ROC-2

When we draw the ROC curve (2B) for the smooth Histogram (2A) above, we will get the following results:

When looking at the ROC Curve, it will have an AUC of .7 (in blue).

Accordingly, this ROC curve tells us that the classifier model can't adequately distinguish between red and green apples 100% of all cases. And in a way, this classifier gets closer to the random model that does not do better than random guessing.

Therefore, this ROC curve is not perfect at classifying red and green apples. That is not to say that the ROC curve is entirely wrong; it just has a 30% margin of error.

ROC Three:

ROC-3

When we draw the ROC curve (3B) for the smooth Histogram (3A) above, we will get the following results:

When looking at the ROC Curve, it will have an AUC of .5 (in blue).

Accordingly, this ROC curve tells us that the classifier model can't adequately distinguish between red and green apples 100% of all cases. In a way, this classifier becomes similar to the random model that does not do better than random guessing.

Therefore, this ROC curve is not perfect at classifying red and green apples. That is not to say that the ROC curve is entirely wrong; it has a 50% margin of error.

Conclusion

In this case, we will choose the first ROC Curve (with a great classifier) because it leads to an AUC of 1.0 (can separate the green apple class and red apple class (the two classes) with 100% accuracy).

Prec-Recall

The Precision-Recall Curve or Prec-Recall or P-R is another tool for evaluating classification models that is derived from the confusion matrix. Prec-Recall is a complementary tool to ROC curves, especially when the dataset has a significant skew. The Prec-Recall curve plots the precision or positive predictive value (y-axis) versus sensitivity or true positive rate (x-axis) for every possible classification threshold. At a high level, we can think of precision as a measure of exactness or quality of the results while recall as a measure of completeness or quantity of the results obtained by the model. Prec-Recall measures the relevance of the results obtained by the model.

Precision is the ratio of correct positive predictions divided by the total number of positive predictions. This ratio is also known as positive predictive value and is measured from 0.0 to 1.0, where 0.0 is the worst and 1.0 is the best precision. Precision is more focused on the positive class than in the negative class, it actually measures the probability of correct detection of positive values (TP and FP).

Precision = True positive predictions / Total number of positive predictions = TP / (TP + FP)

As mentioned in the ROC section, Recall is the true positive rate which is the ratio of the number of true positive predictions divided by all positive actuals. Recall is a metric of the actual positive predictions. It tells us how many correct positive results occurred from all the positive samples available during the test of the model.

Recall = Sensitivity = True Positive Rate = TP / (TP + FN)

precision-recall

Below is another way of visualizing Precision and Recall, this image was borrowed from https://commons.wikimedia.org/wiki/File:Precisionrecall.svg.

prec-recall-visual

A Prec-Recall Curve is created by connecting all precision-recall points through non-linear interpolation [5]. The Pre-Recall plot is broken down into two sections, "Good" and "Poor" performance. "Good" performance can be found on the upper right corner of the plot and "Poor" performance on the lower left corner, see the image below to view the perfect Pre-Recall plot. This division is generated by the baseline. The baseline for Prec-Recall is determined by the ratio of Positives(P) and Negatives(N), where y = P/(P+N), this function represents a classifier with a random performance level[6]. When the dataset is balanced, the value of the baseline is y = 0.5. If the dataset is imbalanced where the number of P's is higher than N's then the baseline will be adjusted accordingly and vice versa.

The Perfect Prec-Recall Curve is a combination of two straight lines (in red). The plot tells us that the model made no prediction errors! In other words, no false positives (perfect precision) and no false negatives (perfect recall) assuming a baseline of 0.5.

prec-recall-1

Similarly to the ROC curve, we can use the area under the curve or AUC to help us compare the performance of the model with other models.

Note: The closer the Prec-Recall Curve is to the upper-right corner (the bigger the AUC percentage) the better the model is at correctly predicting the true positives.

This Prec-Recall Curve in red below has an AUC of approximately 0.7 (in blue) with a relative baseline of 0.5:

prec-recall-07

Finally, this Prec-Recall Curve represents the worst case scenario where the model is generating 100% false positives and false negatives. This Prec-Recall Curve has an AUC of 0.0 or 0%:

prec-recall-00

From the Prec-Recall plot some metrics are derived that can be helpful in assessing the model's performance, such as accuracy and Fᵦ scores.These metrics will be explained in more depth in the next section of the concepts. Just note that accuracy or ACC is the ratio number of correct predictions divided by the total number of predictions and Fᵦ is the harmonic mean of recall and precision.

When looking at ACC in Prec-Recall precision is the positive observations imperative to note that ACC does not perform well-imbalanced datasets. This is why the F-scores can be used to account for the skewed dataset in Prec-Recall.

As you consider the accuracy of a model for the positive cases you want to know a couple of things:

There are also various Fᵦ scores that can be considered, F1, F2 and F0.5. The 1, 2 and 0.5 are the weights given to recall and precision. F1 for instance means that both precision and recall have equal weight, while F2 gives recall higher weight than precision and F0.5 gives precision higher weight than recall.

Prec-Recall is a good tool to consider for classifiers because it is a great alternative for large skews in the class distribution. Use precision and recall to focus on small positive class — When the positive class is smaller and the ability to detect correctly positive samples is our main focus (correct detection of negatives examples is less important to the problem) we should use precision and recall.

If you are using a model metric of Accuracy and you see issues with Prec-Recall then you might consider using a model metric of logloss.

Applicability of Precision-Recall Curves in the Real World

In hopes of understanding the applicability of Precision-Recall Curves in the real world, let us see how we can make use of Precision-Recall curves as a metric to check the performance of the binary classification model used above to distinguish between green and red apples.

As mentioned above, Positive Predictive Value refers to Precision. Let us reimagine again that we are in an apple factory trying to build a model that will be able to distinguish between red and green apples. Therefore, the purpose of this model will be not to have red apple boxes contain green apples and vice versa. Accordingly, this will be a binary classification problem in which the dependent variable is 0 or 1—either a green apple, 0, or 1, a red apple. In this circumstance, Precision will be the proportion of our predictive model noted as red apples.

With that in mind, we will follow to calculate precision as follows:

Precision = True Positive / (True Positive + False Positive)

Recall (sensitivity) will specify the proportion of red apples that were predicted by us as red apples.

The recall will be calculated as follows:

Recall = True Positive / (True Positive + False Negative)

Hence, the main difference between Precision and Recall is the denominator in the Precision and Recall fraction. In Re-call, false negatives are included, whereas, in Precision, false positives are considered.

Difference between Precision and Recall

Let's assume there is a total of 1000 apples. Out of 1000, 500 apples are actually red. Out of 500, we correctly predicted 400 of them as red.

For Precision, the following will be true: out of 1000 apples, we predicted 800 as red apples. Out of 800, we correctly predicted 400 of them as red apples.

To further understand these two ideas of Recall and Precision, imagine for a moment one of your high school teachers asking you about the dates of four main holidays - Halloween, Christmas, New Year, and Memorial Day. You manage to recall all these four dates but with twenty attempts in total. Your Recall score will be 100%, but your Precision score will be 20%, which is four divided by twenty.

It is important to note that after you calculate Recall and Precision values from various confusion matrices for different thresholds, and you plot the results on a Precision-Recall curve take into consideration the following:

In conclusion, Precision-Recall curves allow for a more in-depth analysis of binary classification models. And when trying to distinguish between red and green apples.

GINI, ACC, F1 F0.5, F2, MCC and Log Loss

ROC and Prec-Recall curves are extremely useful to test a binary classifier because they provide visualization for every possible classification threshold. From those plots we can derive single model metrics (e.g. ACC, F1, F0.5, F2 and MCC). There are also other single metrics that can be used concurrently to evaluate models such as GINI and Log Loss. The following will be a discussion about the model scores, ACC, F1, F0.5, F2, MCC, GINI and Log Loss. The model scores are what the ML model optimizes on.

GINI

The Gini index is a well-established method to quantify the inequality among values of frequency distribution and can be used to measure the quality of a binary classifier. A Gini index of zero expresses perfect equality (or a totally useless classifier), while a Gini index of one expresses maximal inequality (or a perfect classifier).

GINI index formula

gini-index-formula

where pj is the probability of class j [18].

The Gini index is based on the Lorenz curve. The Lorenz curve plots the true positive rate (y-axis) as a function of percentiles of the population (x-axis).

The Lorenz curve represents a collective of models represented by the classifier. The location on the curve is given by the probability threshold of a particular model. (i.e., Lower probability thresholds for classification typically lead to more true positives, but also to more false positives.)[12]

The Gini index itself is independent of the model and only depends on the Lorenz curve determined by the distribution of the scores (or probabilities) obtained from the classifier.

Accuracy

Accuracy or ACC (not to be confused with AUC or area under the curve) is a single metric in binary classification problems. ACC is the ratio number of correct predictions divided by the total number of predictions. In other words, how well can the model correctly identify both the true positives and true negatives. Accuracy is measured in the range of 0 to 1, where 1 is perfect accuracy or perfect classification, and 0 is poor accuracy or poor classification[8].

Using the confusion matrix table, ACC can be calculated in the following manner:

Accuracy equation = (TP + TN) / (TP + TN + FP + FN)

accuracy-equation

F-Score: F1, F0.5 and F2

The F1 Score is another measurement of classification accuracy. It represents the harmonic average of the precision and recall. F1 is measured in the range of 0 to 1, where 0 means that there are no true positives, and 1 when there is neither false negatives nor false positives or perfect precision and recall [9].

Using the confusion matrix table, the F1 score can be calculated in the following manner:

F1 = 2TP /( 2TP + FN + FP)

F1 equation:

f1-score-equation

F0.5 equation:

f05-score-equation

Where:
Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives). Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (true positives + false negatives)[15].

DISPLAY F2 score equation

The F2 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall than to precision. More weight should be given to recall for cases where False Negatives are considered to have a stronger negative business impact than False Positives. For example, if your use case is to predict which customers will churn, you may consider False Negatives worse than False Positives. In this case, you want your predictions to capture all of the customers that will churn. Some of these customers may not be at risk for churning, but the extra attention they receive is not harmful. More importantly, no customers actually at risk of churning have been missed [15].

f2-score-equation

Where: Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives). Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).

MCC

MCC or Matthews Correlation Coefficient which is used as a measure of the quality of binary classifications [1]. The MCC is the correlation coefficient between the observed and predicted binary classifications. MCC is measured in the range between -1 and +1 where +1 is the perfect prediction, 0 no better than a random prediction and -1 all incorrect predictions [9].

Using the confusion matrix table MCC can be calculated in the following manner:

MCC equation:

mcc-equation

Log Loss (Logloss)

The logarithmic loss metric can be used to evaluate the performance of a binomial or multinomial classifier. Unlike AUC which looks at how well a model can classify a binary target, logloss evaluates how close a model's predicted values (uncalibrated probability estimates) are to the actual target value. For example, does a model tend to assign a high predicted value like 0.80 for the positive class, or does it show poor ability to recognize the positive class and assign a lower predicted value like 0.50? A model with a log loss of 0 would be the perfect classifier. When the model is unable to make correct predictions, the log loss increases making the model a poor model[11].

Binary classification equation:

logloss-binary-classification-equation

Multiclass classification equation:

logloss-multiclass-classification-equation

Where:

The Diagnostics section in Driverless AI calculates the ACC, F1, MCC values and plots those values in each ROC and Pre-Recall curves making it easier to identify the best threshold for the model generated. Additionally, it also calculates the log loss score for your model allowing you to quickly assess whether the model you generated is a good model or not.

Let's get back to evaluating metrics results for models.

Gain and Lift Charts

Gain and Lift charts measure the effectiveness of a classification model by looking at the ratio between the results obtained with a trained model versus a random model (or no model) [7]. The Gain and Lift charts help us evaluate the performance of the classifier as well as answer questions such as what percentage of the dataset captured has a positive response as a function of selected percentage of a sample. Additionally, we can explore how much better we can expect to do with a model compared to a random model (or no model) [7].

One way we can think of gain is "for every step that is taken to predict an outcome, the level of uncertainty decreases. A drop of uncertainty is the loss of entropy which leads to knowledge gain" [15]. The Gain Chart plots the true positive rate (sensitivity) versus the predictive positive rate(support) where:

Sensitivity = Recall = True Positive Rate = TP / (TP + FN)

Support = Predictive Positive Rate = TP + FP / (TP + FP + FN+TN)

sensitivity-and-support

To better visualize the percentage of positive responses compared to a selected percentage sample, we use Cumulative Gains and Quantile. Cumulative gains is obtained by taking the predictive model and applying it to the test dataset which is a subset of the original dataset. The predictive model will score each case with a probability. The scores are then sorted in ascending order by the predictive score. The quantile takes the total number of cases(a finite number) and partitions the finite set into subsets of nearly equal sizes. The percentile is plotted from 0th and 100th percentile. We then plot the cumulative number of cases up to each quantile starting with the positive cases at 0% with the highest probabilities until we reach 100% with the positive cases that scored the lowest probabilities.

In the cumulative gains chart, the x-axis shows the percentage of cases from the total number of cases in the test dataset, while the y-axis shows the percentage of positive responses in terms of quantiles. As mentioned, since the probabilities have been ordered in ascending order we can look at the percent of predictive positive cases found in the 10% or 20% as a way to narrow down the number of positive cases that we are interested in. Visually the performance of the predictive model can be compared to that of a random model(or no model). The random model is represented below in red as the worst case scenario of random sampling.

cumulative-gains-chart-worst-case

How can we identify the best case scenario in relation to the random model? To do this we need to identify a Base Rate first. The Base Rate sets the limits of the optimal curve. The best gains are always controlled by the Base Rate. An example of a Base Rate can be seen on the chart below (dashed green).

cumulative-gains-chart-best-case

The above chart represents the best case scenario of a cumulative gains chart assuming a base rate of 20%. In this scenario all the positive cases were identified before reaching the base rate.

The chart below represents an example of a predictive model (solid green curve). We can see how well the predictive model did in comparison to the random model (dotted red line). Now, we can pick a quantile and determine the percentage of positive cases up that quartile in relation to the entire test dataset.

cumulative-gains-chart-predictive-model

Lift can help us answer the question of how much better one can expect to do with the predictive model compared to a random model (or no model). Lift is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with a model and with a random model (or no model). In other words, the ratio of gain % to the random expectation % at a given quantile. The random expectation of the xth quantile is x %[16].

Lift = Predictive rate/ Actual rate

When plotting lift, we also plot it against quantiles in order to help us visualize how likely it is that a positive case will take place since the Lift chart is derived from the cumulative gains chart. The points of the lift curve are calculated by determining the ratio between the result predicted by our model and the result using a random model (or no model). For instance, assuming a base rate (or hypothetical threshold) of 20% from a random model, we would take the cumulative gain percent at the 20% quantile, X and divide by it by 20. We do this for all the quantiles until we get the full lift curve.

We can start the lift chart with the base rate as seen below, recall that the base rate is the target threshold.

lift-chart-base-rate

When looking at the cumulative lift for the top quantiles, X, what it means is that when we select lets say 20% from the quantile from the total test cases based on the mode, we can expect X/20 times the total of the number of positive cases found by randomly selecting 20% from the random model.

lift-chart

Kolmogorov-Smirnov Chart

Kolmogorov-Smirnov or K-S measures the performance of classification models by measuring the degree of separation between positives and negatives for validation or test data [13].

The KS statistic is the maximum difference between the cumulative percentage of responders or 1's (cumulative true positive rate) and cumulative percentage of non-responders or 0's (cumulative false positive rate). The significance of KS statistic is, it helps to understand what portion of the population should be targeted to get the highest response rate (1's) [17].

k-s-chart

References

[1] Confusion Matrix definition" A Dictionary of Psychology"

[2] Towards Data Science - Understanding AUC- ROC Curve

[3] Introduction to ROC

[4] ROC Curves and Under the Curve (AUC) Explained

[5] Introduction to Precision-Recall

[6] Tharwat, Applied Computing and Informatics (2018)

[7] Model Evaluation Classification

[8] Wiki Accuracy

[9] Wiki F1 Score

[10] Wiki Matthew's Correlation Coefficient

[11] Wiki Log Loss

[12] H2O's GINI Index

[13] H2O's Kolmogorov-Smirnov

[14] Model Evaluation- Classification

[15] What is Information Gain in Machine Learning

[16] Lift Analysis Data Scientist Secret Weapon

[17] Machine Learning Evaluation Metrics Classification Models

[18] Scikit-Learn: Decision Tree Learning I - Entropy, GINI, and Information Gain

Deeper Dive and Resources

At the end of the experiment, a summary of the project will appear on the right-lower corner. Also, note that the name of the experiment is at the top-left corner.

experiment-results-summary

The summary includes the following:

Most of the information in the Experiment Summary tab, along with additional detail, can be found in the Experiment Summary Report (Yellow Button "Download Experiment Summary").

Below are three questions to test your understanding of the experiment summary and frame the motivation for the following section.

1. Find the number of features that were scored for your model and the total features that were selected.

2. Take a look at the validation Score for the final pipeline and compare that value to the test score. Based on those scores would you consider this model a good or bad model?

Note: If you are not sure what Log loss is, feel free to review the concepts section of this tutorial.

3. So what do the Log Loss values tell us? The essential Log Loss value is the test score value. This value tells us how well the model generated did against the freddie_mac_500_test set based on the error rate. In case of experiment Freddie Mac Classification Tutorial, the test score LogLoss = 0.1194556 +/- 0.001470279 which is the log of the misclassification rate. The greater the Log loss value the more significant the misclassification. For this experiment, the Log Loss was relatively small meaning the error rate for misclassification was not as substantial. But what would a score like this mean for an institution like Freddie Mac?

In the next few tasks we will explore the financial implications of misclassification by exploring the confusion matrix and plots derived from it.

Deeper Dive and Resources

Now we are going to run a model diagnostics on the freddie_mac_500_test set. The diagnostics model allows you to view model performance for multiple scorers based on an existing model and dataset through the Python API.

1. Select Diagnostics

diagnostics-select

2. Once in the Diagnostics page, select + Diagnose Model

diagnose-model

3. In the Create new model diagnostics :

  1. Click on Diagnosed Experiment then select the experiment that you completed in Task 4: Freddie Mac Classification Tutorial
  2. Click on Dataset then select the freddie_mac_500_test dataset
  3. Initiate the diagnostics model by clicking on Launch Diagnostics

create-new-model-diagnostic

4.After the model diagnostics is done running, a model similar to the one below will appear:

new-model-diagnostics

Things to Note:

  1. Name of new diagnostics model
  2. Model: Name of ML model used for diagnostics
  3. Dataset: name of the dataset used for diagnostic
  4. Message : Message regarding new diagnostics model
  5. Status : Status of new diagnostics model
  6. Time : Time it took for the new diagnostics model to run
  7. Options for this model

5. Click on the new diagnostics model and a page similar to the one below will appear:

diagnostics-model-results

Things to Note:

  1. Info: Information about the diagnostics model including the name of the test dataset, name of the experiment used and the target column used for the experiment
  2. Scores: Summary for the values for GINI, MCC, F05, F1, F2, Accuracy, Log loss, AUC and AUCPR in relation to how well the experiment model scored against a "new" dataset
    • Note: The new dataset must be the same format and with the same number of columns as the training dataset
  3. Metric Plots: Metrics used to score the experiment model including ROC Curve, Pre-Recall Curve, Cumulative Gains, Lift Chart, Kolmogorov-Smirnov Chart, and Confusion Matrix
  4. Download Predictions: Download the diagnostics predictions

Note: The scores will be different for the train dataset and the validation dataset used during the training of the model.

Confusion Matrix

As mentioned in the concepts section, the confusion matrix is the root from where most metrics used to test the performance of a model originate. The confusion matrix provides an overview performance of a supervised model's ability to classify.

Click on the confusion matrix located on the Metrics Plot section of the Diagnostics page, bottom-right corner. An image similar to the one below will come up:

diagnostics-confusion-matrix-0

The confusion matrix lets you choose a desired threshold for your predictions. In this case, we will take a closer look at the confusion matrix generated by the Driverless AI model with the default threshold, which is 0.5.

The first part of the confusion matrix we are going to look at is the Predicted labels and Actual labels. As shown on the image below the Predicted label values for Predicted Condition Negative or 0 and Predicted Condition Positive or 1 run vertically while the Actual label values for Actual Condition Negative or 0 and Actual Condition Positive or 1 run horizontally on the matrix.

Using this layout, we will be able to determine how well the model predicted the people that defaulted and those that did not from our Freddie Mac test dataset. Additionally, we will be able to compare it to the actual labels from the test dataset.

diagnostics-confusion-matrix-1

Moving into the inner part of the matrix, we find the number of cases for True Negatives, False Positives, False Negatives and True Positive. The confusion matrix for this model generated tells us that:

diagnostics-confusion-matrix-2

The next layer we will look at is the Total sections for Predicted label and Actual label.

On the right side of the confusion matrix are the totals for the Actual label and at the base of the confusion matrix, the totals for the Predicted label.

Actual label

Predicted label

diagnostics-confusion-matrix-3

The final layer of the confusion matrix we will explore are the errors. The errors section is one of the first places where we can check how well the model performed. The better the model does at classifying labels on the test dataset the lower the error rate will be. The error rate is also known as the misclassification rate which answers the question of how often is the model wrong?

For this particular model these are the errors:

What does the misclassification error of .03546 mean?
One of the best ways to understand the impact of this misclassification error is to look at the financial implications of the False Positives and False Negatives. As mentioned previously, the False Positives represent the loans predicted not to default and in reality did default.
Additionally, we can look at the mortgages that Freddie Mac missed out on by not granting loans because the model predicted that they would default when in reality they did not default.

One way to look at the financial implications for Freddie Mac is to look at the total paid interest rate per loan. The mortgages on this dataset are traditional home equity loans which means that the loans are:

For this tutorial, we will assume a 6% Annual Percent Rate (APR) over 30 years. APR is the amount one pays to borrow the funds. Additionally, we are going to assume an average home loan of $167,473 (this average was calculated by taking the sum of all the loans on the freddie_mac_500.csv dataset and dividing it by 30,001 which is the total number of mortgages on this dataset). For a mortgage of $167,473 the total interest paid after 30 years would be $143,739.01 [1].

When looking at the False Positives, we can think about 152 cases of people which the model predicted should be not be granted a home loan because they were predicted to default on their mortgage. These 155 loans translate to over 18 million dollars in loss of potential income (152 * $143,739.01) in interest.

Now, looking at the True Positives, we do the same and take the 4,282 cases that were granted a loan because the model predicted that they would not default on their home loan. These 4,282 cases translate to about over 618 million dollars in interest losses since the 4,282 cases defaulted.

The misclassification rate provides a summary of the sum of the False Positives and False Negatives divided by the total cases in the test dataset. The misclassification rate for this model was .03546. If this model were used to determine home loan approvals, the mortgage institutions would need to consider approximately 618 million dollars in losses for misclassified loans that got approved and shouldn't have and 18 million dollars on loans that were not approved since they were classified as defaulting.

One way to look at these results is to ask the question: is missing out on approximately 18 million dollars from loans that were not approved better than losing about 618 million dollars from loans that were approved and then defaulted? There is no definite answer to this question, and the answer depends on the mortgage institution.

diagnostics-confusion-matrix-4

Scores

Driverless AI conveniently provides a summary of the scores for the performance of the model given the test dataset.

The scores section provides a summary of the Best Scores found in the metrics plots:

The image below represents the scores for the Freddie Mac Classification Tutorial model using the freddie_mac_500_test dataset:

diagnostics-scores

When the experiment was run for this classification model, Driverless AI determined that the best scorer for it was the Logarithmic Loss or LOGLOSS due to the imbalanced nature of the dataset. LOGLOSS focuses on getting the probabilities right (strongly penalizes wrong probabilities). The selection of Logarithmic Loss makes sense since we want a model that can correctly classify those who are most likely to default while ensuring that those that qualify for a loan get can get one.

Recall that Log loss is the logarithmic loss metric that can be used to evaluate the performance of a binomial or multinomial classifier, where a model with a Log loss of 0 would be the perfect classifier. Our model scored a LOGLOSS value = 0.1195+/- .0015 after testing it with test dataset. From the confusion matrix, we saw that the model had issues classifying perfectly; however, it was able to classify with an ACCURACY of 0.9646 +/- .0006. The financial implications of the misclassifications have been covered in the confusion matrix section above.

Driverless AI has the option to change the type of scorer used for the experiment. Recall that for this dataset the scorer was selected to be logloss. An experiment can be re-run with another scorer. For general imbalanced classification problems, AUCPR and MCC scorers are good choices, while F05, F1, and F2 are designed to balance between recall and precision.
The AUC is designed for ranking problems. Gini is similar to the AUC but measures the quality of ranking (inequality) for regression problems.

In the next few tasks we will explore the scorer further and the Scores values in relation to the residual plots.

References

[1] Amortization Schedule Calculator

Deeper Dive and Resources

From the Diagnostics page click on the ROC Curve. An image similar to the one below will appear:

diagnostics-roc-curve

To review, an ROC curve demonstrates the following:

Going back to the Freddie Mac dataset, even though the model was scored with the Logarithmic Loss to penalize for error we can still take a look at the ROC curve results and see if it supports our conclusions from the analysis of the confusion matrix and scores section of the diagnostics page.

1. Based on the ROC curve that Driverless AI model generated for your experiment, identify the AUC. Recall that a perfect classification model has an AUC of 1.

2. For each of the following points on the curve, determine the True Positive Rate, False Positive rate, and threshold by hovering over each point below as seen on the image below:

diagnostics-roc-best-acc

Recall that for a binary classification problem, accuracy is the number of correct predictions made as a ratio of all predictions made. Probabilities are converted to predicted classes in order to define a threshold. For this model, it was determined that the best accuracy is found at threshold .5102.

At this threshold, the model predicted:

3. From the AUC, Best MCC, F1, and Accuracy values from the ROC curve, how would you qualify your model, is it a good or bad model? Use the key points below to help you asses the ROC Curve.

Remember that for the ROC curve:

Note: If you are not sure what AUC, MCC, F1, and Accuracy are or how they are calculated review the concepts section of this tutorial.

New Model with Same Parameters

In case you were curious and wanted to know if you could improve the accuracy of the model, this can be done by changing the scorer from Logloss to Accuracy.

1. To do this, click on the Experiments page.

2. Click on the experiment you did for task 1 and select New Model With Same Params

new-model-w-same-params

An image similar to the one below will appear. Note that this page has the same settings as the setting in Task 1. The only difference is that on the Scorer section Logloss was updated to Accuracy. Everything else should remain the same.

3. If you haven't done so, select Accuracy on the scorer section then select Launch Experiment

new-model-accuracy

Similarly to the experiment in Task 1, wait for the experiment to run. After the experiment is done running, a similar page will appear. Note that on the summary located on the bottom right-side both the validation and test scores are no longer being scored by Logloss instead by Accuracy.

new-experiment-accuracy-summary

We are going to use this new experiment to run a new diagnostics test. You will need the name of the new experiment. In this case, the experiment name is 1.Freddie Mac Classification Tutorial.

4. Go to the Diagnostics tab.

5. Once in the Diagnostics page, select +Diagnose Model

6. In the Create new model diagnostics :

  1. Click on Diagnosed Experiment then select the experiment that you completed in Task in this case the experiment name is 1.Freddie Mac Classification Tutorial
  2. Click on Dataset then select the freddie_mac_500_test dataset
  3. Initiate the diagnostics model by clicking on Launch Diagnostics

diagnostics-create-new-model-for-accuracy

7. After the model diagnostics is done running a new diagnostic will appear

8. Click on the new diagnostics model. On the Scores section observe the accuracy value. Compare this Accuracy value to the Accuracy value from task 6.

diagnostics-scores-accuracy-model

9. Next, locate the new ROC curve and click on it. Hover over the Best ACC point on the curve. An image similar to the one below will appear:

diagnostics-roc-curve-accuracy-model

How much improvement did we get from optimizing the accuracy via the scorer?

The new model predicted:

The first model predicted:

The threshold for best accuracy changed from .5102 for the first diagnostics model to .5134 for the new model. This increase in threshold improved accuracy or the number of correct predictions made as a ratio of all predictions made. Note, however, that while the number of FP decreased the number of FN increased. We were able to reduce the number of cases that were predicted to falsy default, but in doing so, we increased the number of FN or cases that were predicted not to default and did.

The takeaway is that there is no win-win; sacrifices need to be made. In the case of accuracy, we increased the number of mortgage loans, especially for those who were denied a mortgage because they were predicted to default when, in reality, they did not. However, we also increased the number of cases that should not have been granted a loan and did. As a mortgage lender, would you prefer to reduce the number of False Positives or False Negatives?

10. Exit out of the ROC curve by clicking on the x located at the top-right corner of the plot, next to the Download option

Deeper Dive and Resources

Continuing on the diagnostics page, select the P-R curve. The P-R curve should look similar to the one below:

diagnostics-pr-curve

Remember that for the Prec-Recall:

Looking at the P-R curve results, is this a good model to determine if a customer will default on their home loan? Let's take a look at the values found on the P-R curve.

1. Based on the P-R curve that Driverless AI model generated for you experiment identify the AUC.

2. For each of the following points on the curve, determine the True Positive Rate, False Positive rate, and threshold by hovering over each point below as seen on the image below:

diagnostics-prec-recall-best-mccr

3. From the observed AUC, Best MCC, F1 and Accuracy values for P-R, how would you qualify your model, is it a good or bad model? Use the key points below to help you asses the P-R curve.

Remember that for the P-R curve :

Note: If you are not sure what AUC, MCC, F1, and Accuracy are or how they are calculated review the concepts section of this tutorial.

New Model with Same Parameters

Similarly to task 6, we can improve the area under the curve for precision-recall by creating a new model with the same parameters. Note that you need to change the Scorer from Logloss to AUCPR. You can try this on your own.

To review how to run a new experiment with the same parameters and a different scorer, follow the step on task 6, section New Model with Same Parameters.

new-model-w-same-params-aucpr

Note: If you ran the new experiment, go back to the diagnostic for the experiment we were working on.

Deeper Dive and Resources

Continuing on the diagnostics page, select the CUMULATIVE GAIN curve. The Gains curve should look similar to the one below:

diagnostics-gains

Remember that for the Gains curve:

Note: The y-axis of the plot has been adjusted to represent quantiles, this allows for focus on the quantiles that have the most data and therefore the most impact.

1. Hover over the various quantile points on the Gains chart to view the quantile percentage and cumulative gain values

2. What is the cumulative gain at 1%, 2%, 10% quantiles?

diagnostics-gains-10-percent

For this Gain Chart, if we look at the top 1% of the data, the at-chance model (the dotted diagonal line) tells us that we would have correctly identified 1% of the defaulted mortgage cases. The model generated (yellow curve) shows that it was able to identify about 12% of the defaulted mortgage cases.

If we hover over to the top 10% of the data, the at-chance model (the dotted diagonal line) tells us that we would have correctly identified 10% of the defaulted mortgage cases. The model generated (yellow curve) says that it was able to identify about 53% of the defaulted mortgage cases.

3. Based on the shape of the gain curve and the baseline (white diagonal dashed line) would you consider this a good model?

Remember that the perfect prediction model starts out pretty steep, and as a rule of thumb the steeper the curve, the higher the gain. The area between the baseline (white diagonal dashed line) and the gain curve (yellow curve) better known as the area under the curve visually shows us how much better our model is than that of the random model. There is always room for improvement. The gain curve can be steeper.

Note: If you are not sure what AUC or what the gain chart is, feel free to review the concepts section of this tutorial.

4. Exit out of the Gains chart by clicking on the x located at the top-right corner of the plot, next to the Download option

Deeper Dive and Resources

Continuing on the diagnostics page, select the LIFT curve. The Lift curve should look similar to the one below:

diagnostics-lift

Remember that for the Lift curve:

A Lift chart is a visual aid for measuring model performance.

Note: The y-axis of the plot has been adjusted to represent quantiles, this allows for focus on the quantiles that have the most data and therefore the most impact.

1. Hover over the various quantile points on the Lift chart to view the quantile percentage and cumulative lift values

2. What is the cumulative lift at 1%, 2%, 10% quantiles?
diagnostics-lift-10-percent
For this Lift Chart, all the predictions were sorted according to decreasing scores generated by the model. In other words, uncertainty increases as the quantile moves to the right. At the 10% quantile, our model predicted a cumulative lift of about 5.3%, meaning that among the top 10% of the cases, there were five times more defaults.

3. Based on the area between the lift curve and the baseline (white horizontal dashed line) is this a good model?

The area between the baseline (white horizontal dashed line) and the lift curve (yellow curve) better known as the area under the curve visually shows us how much better our model is than that of the random model.

4. Exit out of the Lift chart by clicking on the x located at the top-right corner of the plot, next to the Download option

Deeper Dive and Resources

Continuing on the diagnostics page, select the KS chart. The K-S chart should look similar to the one below:

diagnostics-ks

Remember that for the K-S chart:

Note: The y-axis of the plot has been adjusted to represent quantiles, this allows for focus on the quantiles that have the most data and therefore the most impact.

1. Hover over the various quantile points on the Lift chart to view the quantile percentage and cumulative lift values

2. What is the cumulative lift at 1%, 2%, 10% quantiles?

diagnostics-ks-20-percent

For this K-S chart, if we look at the top 20% of the data, the at-chance model (the dotted diagonal line) tells us that only 20% of the data was successfully separate between positives and negatives (defaulted and not defaulted). However, with the model it was able to do 0.548 or about 54.8% of the cases were successfully separated between positives and negatives.

3. Based on the K-S curve(yellow) and the baseline (white diagonal dashed line) is this a good model?

4. Exit out of the K-S chart by clicking on the x located at the top-right corner of the plot, next to the Download option

Deeper Dive and Resources

Driverless AI makes it easy to download the results of your experiments, all at the click of a button.

1. Let's explore the auto generated documents for this experiment. On the Experiment page select Download Experiment Summary.

download-experiment-summary

The Experiment Summary contains the following:

A report file is included in the experiment summary. This report provides insight into the training data and any detected shifts in distribution, the validation schema selected, model parameter tuning, feature evolution and the final set of features chosen during the experiment.

2. Open the report .docx file, this auto-generated report contains the following information:

3. Take a few minutes to explore the report

4. Explore Feature Evolution and Feature Transformation, how is this summary different from the summary provided in the Experiments Page?

Answer:In the experiment page, you can set the name of your experiment, set up the dataset being used to create an experiment, view the total number of rows and columns of your dataset, drop columns, select a dataset to validate, etc. 

Different, the experiment summary report contains insight into the training data and any detected shifts in distribution, the validation schema, etc. In particular, when exploring the feature evolution and feature transformation in the summary report, we will encounter the following information: 

Feature evolution: This summary will detail the algorithms used to create the experiment. 

Feature transformation: The summary will provide information about automatically engineer new features with high-value features for a given dataset. 

5. Find the section titled Final Model on the report.docx and explore the following items:

Deeper Dive and Resources

Check out the next tutorial : Machine Learning Interpretability where you will learn how to: