Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that tries to identify and extract opinions from a given text. Sentiment analysis aims to gauge the attitudes, sentiments, and emotions of a speaker/writer based on the computational treatment of subjectivity in a text. This can be in the form of like/dislike binary rating or in the form of numerical ratings from 1 to 5.

Figure 1. Sentiment

Sentiment Analysis is an important sub-field of NLP. It can help to create targeted brand messages and assist a company in understanding consumer's preferences. These insights could be critical for a company to increase its reach and influence across a range of sectors.

Here are some of the uses of Sentiment Analysis from a business perspective:

In this self-paced course, we will learn some core NLP concepts that will enable us to build and understand an NLP model capable of classifying fine food reviews from Amazon customers. In other words, we will conduct a Sensitivity Analysis on the various customer reviews.

Note: It is highly recommended that you go over the entire self-paced course before starting the experiment.

References

You will need the following to be able to do this self-paced course:

Note: Aquarium's Driverless AI Test Drive lab has a license key built-in, so you don't need to request one to use it. Each Driverless AI Test Drive instance will be available to you for two hours, after which it will terminate. No work will be saved. If you need more time to further explore Driverless AI, you can always launch another Test Drive instance or reach out to our sales team via the contact us form.

About the Dataset

The dataset consists of reviews of fine foods from Amazon. The data spans a period of more than 10 years, from Oct 1999 up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories[1]. The data consists of 568,454 reviews, 256,059 users, 74,258 products and 260 users with > 50 reviews.

Our aim is to study these reviews and try and predict whether a review is positive or negative.

The data has been originally hosted by SNAP (Stanford Large Network Dataset Collection), a collection of more than 50 large network datasets. In includes social networks, web graphs, road networks, internet networks, citation networks, collaboration networks, and communication networks [2].

Dataset Overview

In Aquarium, the Driverless AI Test Drive instance has the Amazon fine food review dataset already and has been split for this self-paced course's experiment. The split resulted in a train and test dataset. The datasets can be located on the Datasets overview page. However, you can also upload the datasets externally. To learn more about how to add the dataset please consult Appendix A: Add the Dataset.

1. In the Datasets overview page, observe the two datasets we will use for this self-paced course:

datasets-overview

2. Click the following dataset and right after select the Details option: AmazonFineFoodReviews_train...:

dataset-details

3. Let's take a quick look at the columns of the training dataset:

4. Return to the Datasets page.

Launch Experiment

The experiment has already been pre-built, given that it takes more than two hours for the experiment to complete. Below you will be guided on how to access the pre-built experiment right before we start our analysis on the built NLP model's effectiveness. For now, consider the intructions below if you were to build the experiment from scratch:

1. In the Datasets page, click on the following dataset, and right after select the Predict option: AmazonFineFoodReviews-train-26k.csv:

launch-experiment

2. As soon as you select the Predict option, you are asked if you want to take a tour of the Driverless AI environment. Skip it for now by clicking Not Now. The following will appear:

initial-experiment-overview

3. Next, you will need to feed in the following information into Driverless AI:

name-experiment

At this point, your experiment preview page will similarly look as follows:

final-experiment-screen

In Task 2, we will continue editing our experiment settings.

Acknowledgement

References

Deeper Dive and Resources

This task deals with settings that will enable us to run an effective NLP experiment. Let us now understand such settings and let's adjust them accurately:

experiment-settings

  1. Additionally, there are three more buttons located beneath the experimental settings knob which stand for the following:
    • Classification or Regression: Driverless AI automatically determines the problem type based on the response column. Though not recommended, you can override this setting by clicking this button. Our current problem is that of Classification.
      • Make sure this setting is set to Classification
    • Reproducible: This button allows you to build an experiment with a random seed and get reproducible results. If this is disabled (default), the results will vary between runs.
      • Don't enable this setting
    • GPUS Enable: Specify whether to enable GPUs. (Note that this option is ignored on CPU-only systems).
      • Make sure this setting is enable

We selected the above settings to generate a model with sufficient accuracy in the H2O Driverless AI Test Drive environment. At this point, your experiment pre-view page should similarly look as follows:

final-experiment-launch

The amount of time this experiment will take to complete will depend on on the memory, availability of GPU in a system, and the expert settings a user might select. If the system does not have a GPU, it might run for a longer time. You can Launch the Experiment and wait for it to finish, or you can access a pre-build version in the Experiment section. After discussing few NLP concepts in the upcoming two tasks, we will discuss how to access this pre-built experiment right before analyzing its performance.

Resources

Deeper Dive

Natural Language Processing (NLP)

NLP is the field of study that focuses on the interactions between human language and computers. NLP sits at the intersection of computer science, artificial intelligence, and computational linguistics[1]. NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as:

The text data is highly unstructured, but the Machine learning algorithms usually work with numeric input features. So before we start with any NLP project, we need to pre-process and normalize the text to make it ideal for feeding into the commonly available Machine learning algorithms. This essentially means we need to build a pipeline of some sort that breaks down the problem into several pieces. We can then apply various methodologies to these pieces and plug the solution together in a pipeline.

Building a Typical NLP Pipeline

nlp-pipeline

The figure above shows how a typical pipeline looks. It is also important to note that there may be variations depending upon the problem at hand. Hence the pipeline will have to be adjusted to suit our needs. Driverless AI automates the above process. Let's try and understand some of the components of the pipeline in brief:

Text preprocessing

Text pre-processing involves using various techniques to convert raw text into well-defined sequences of linguistic components with standard structure and notation. Some of those techniques are:

It is important to note here that the above steps are not mandatory, and their usage depends upon the use case. For instance, in sentiment analysis, emoticons signify polarity, and stripping them off from the text may not be a good idea. The general goal of Normalization, Stemming, and Lemmatization techniques is to improve the model's generalization. Essentially we are mapping different variants of what we consider to be the same or very similar "word" to one token in our data.

Feature Extraction

The Machine Learning Algorithms usually expect features in the form of numeric vectors. Hence, after the initial preprocessing phase, we need to transform the text into a meaningful vector (or array) of numbers. This process is called feature extraction. Let's see how some of the feature-extracting techniques work.

The intuition behind the Bag of Words is that documents are similar if they have identical content, and we can get an idea about the meaning of the document from its content alone.

Example implementation

The following models a text document using bag-of-words here are two simple text documents:

Based on these two text documents, a list is constructed as follows for each document:

Representing each bag-of-words as a JSON object and attributing to the respective JavaScript variable:

It is important to note that BoW does not retain word order and is sensitive towards document length, i.e., token frequency counts could be higher for longer documents.

It is also possible to create BoW models with consecutive words, also known as n-grams:

The dimensions of the output vectors are high. This also gives importance to the rare terms that occur in the corpus, which might help our classification tasks:

Building Text Classification Models

Once the features have been extracted, they can then be used for training a classifier.

With this task in mind, let's learn about Driverless AI NLP Recipes.

References

Deeper Dive and Resources

Note: This section will discuss all current NLP model capabilities of Driverless AI. Keep in mind that not all settings discussed below have been enabled in the current sentiment analysis experiment.

Text data can contain critical information to inform better predictions. Driverless AI automatically converts text strings into features using powerful techniques like TFIDF, CNN, and GRU. Driverless AI now also includes state-of-the-art PyTorch BERT transformers. With advanced NLP techniques, Driverless AI can also process larger text blocks, build models using all available data, and solve business problems like sentiment analysis, document classification, and content tagging.

The Driverless AI platform can support both standalone text and text with other columns as predictive features. In particular, the following NLP recipes are available for a given text column:

driverless-nlp-recipe

Key Capabilities of Driverless AI NLP Recipes

Deeper Dive and Resources

Industry Use Cases leveraging NLP

If you decided to run the experiment constructed in tasks one and two, it most likely is still running. In that case, whether your experiment is still finishing or you didn't launch the experiment, you can access the pre-built version in the Experiments section:

1. In the Experiments section select the experiment with the following name: Sentiment Analysis Tutorial:

pre-ran-experiment

2. Let's review the experiment summary page, and let's determine the goodness and efficiency of our built model:

experiment-results-ui

If you would like to explore how custom recipes can improve predictions; in other words, how custom recipes could decrease the value of LOGLOSS (in our current observe experiment), please refer to Appendix B.

Deeper Dive and Resources

Deeper Dive and Resources

It's time to test your skills!

The challenge is to analyze and perform Sentiment Analysis on the tweets using the US Airline Sentiment dataset. This dataset will help to gauge people's sentiments about each of the major U.S. airlines.

This data comes from Crowdflower's Data for Everyone library and constitutes Twitter reviews about how travelers in February 2015 expressed their feelings on Twitter about every major U.S. airline. The reviews have been classified as positive, negative, and neutral.

Steps:

1. Import the dataset from here:

Here are some samples from the dataset:

challenge-dataset

2. Split the dataset into a training set and a testing set in an 80:20 ratio.

3. Run an experiment where the target column is airline_sentiment using only the default Transformers. You can exclude all other columns from the dataset except the ‘text' column.

4. Run another instance of the same experiment, but this time include the Tensorflow models and the built-in transformers.

5. Next, repeat the experiment with a custom recipe from here.

6. Using Logloss as the scorer, observe the following outcomes:

Deeper Dive and Resources

Add the Datasets

Consider the following steps to import the training and test Amazon Fine Food Reviews datasets:

1. Select + Add Dataset (or Drag and Drop) then click on the following option: File System:

appendix-add-datasets

2. Enter the following into the search bar: data/Kaggle/AmazonFineFoodReviews/.

3. Select the follwing two datasets:

appendix-datasets-preview

4. Right after, cick the following button: Click to Import Selection.

5. If the file loaded successfully, then the two datasets will be display in the Datasets page:

appendix-upload-dataset

The latest versions of Driverless AI implement a key feature called BYOR[1], which stands for Bring Your Own Recipes, and was introduced with Driverless AI (1.7.0). This feature has been designed to enable Data Scientists or domain experts to influence and customize the machine learning optimization used by Driverless AI as per their business needs. This additional feature engineering technique is aimed at improving the accuracy of the model.

Recipes are customizations and extensions to the Driverless AI platform. They are nothing but Python code snippets uploaded into Driverless AI at runtime, like plugins. Recipes can be either one or a combination of the following:

recipes-workflow

Uploading a Custom Recipe

H2O has built and open-sourced several recipes[2], which can be used as templates. For this experiment, we could use the following recipe: text_sentiment_transformer.py which extracts sentiment from text using pre-trained models from TextBlob[3].

Please note that in this appendix, we will show you how to add the Sentiment transformer. However, we don't recommend that you run this on Aquarium, as Aquarium provides a small environment; the experiment might not finish on time or might not give you the expected results. If you are trying to see how recipes can help improve an NLP experiment, we recommend that you obtain a bigger machine with more resources to see improvements.

1. In the Experiments section, click on the three dots next to the experimet: Sentiment Analysis. In it, select the following option: New Experiment with Same Settings. The following will appear:

clickon-expert-settings

2. A new window with Expert Experiment Settings will appear. Here you can either upload a custom recipe or load a custom recipe from a URL:

expert-experiment-settings-overview-1

3. The first way to upload a custom recipe is by clicking on the + UPLOAD CUSTOM RECIPE button (a): this option allows you to upload custom recipes located on your computer. We will not use this option.

4. The second way to upload a custom recipe is by clicking on the + LOAD CUSTOM RECIPE FROM URL button (b): this option allows you to upload a recipe located on Github. We will use this option. Click this (b) option and paste the following custom recipe:

https://raw.githubusercontent.com/h2oai/driverlessai-recipes/rel-1.9.1/transformers/nlp/text_sentiment_transformer.py

5. While the recipe is uploading, the following will appear (Driverless AI automatically performs basic acceptance tests for all custom recipes (this can de enable/disable):

acceptance-tests

6. Driverless AI offers several available recipes that can be accessed when clicking on the OFFICIAL RECIPES (OPEN SOURCE) button©:

official-recipes

7. Whenever you use a recipe, you have access to the following recipe settings located in the Recipes tab (e.g., transformers, models, scorers):

selecting-specific-transformers

8. Click Save. The selected transformer should now appear on the main Experiment screen as follows:

9. Now, you are ready to launch the Experiment with the Custom Recipe.

References