CS 4641 Final Project: Food Review Analysis
By Matthew Rho, Prashant Sathambakkam, Sai Gogineni
Introduction
Online reviews are a modern-day necessity when it pertains to shopping for products, simply because they are an effective indicator on the quality of said product for both the manufacturers and consumers. In reference of this project, Amazon has more than 2.5 million active sellers, and 150 million Amazon Prime members. The point being, Amazon is an ecommerce giant from which 89% of online users trust Amazon’s integrity more than other sites; it also has the most developed infrastructure for online reviews.
Good review – 5 stars:
Poor review – 2 stars:
Machine Learning Method
The goal of this project is to be able to read a review and determine what rating the customer would give using machine learning algorithms; we chose to use Amazon Fine Food reviews as our dataset to train this algorithm. Regarding machine learning, we will be using both SVM Classifier (Support Vector Machines) and Naive Bayes Classifier to discover which iteration of this algorithm (in comparison) is most accurate at reaching this goal. Naive Bayes performs well when using independent features (assumed to each contribute equally to the result); SVM performs with significant accuracy given less computational power. Our project is based on supervised learning, as our training data is labeled with features; we will describe these features more in depth in the following section. We had to preprocess our training data, create feature vectors, and train our algorithm on these feature vectors.
By the end of this project, if we can achieve a satisfactory accuracy in the prediction of a rating based on a review, it would be more useful as a tool for any real-world applications. Namely, an algorithm that can read any review and determine the equivalent rating would be incredibly versatile for both the sellers and consumers, not to mention it would give more feedback for PR companies.
Training Data
Our training data comes from an Amazon Fine Foods review dataset compiled on Kaggle. This was our selection for the source of data for this project, because there are a substantial number of reviews; specifically, this dataset has 568,454 reviews, and each review has 9 potential features. Since this algorithm would have had an inefficient runtime if we analyzed all reviews, we reduced the dataset into a subset of 10,000 reviews. Upon further review, we found there were essentially four important features relevant to our analysis and subsequent algorithm training: the product ID, the helpfulness rating, the actual product rating, and the review text itself. We were able to discard the other irrelevant features (such as the time when the review was written, the user profile name, etc.). Using the product ID, we were able to see our training data and isolate 7 different types of products in the Amazon Fine Foods market.
Word cloud of the most common words among all reviews:
Bar graphs of the number of reviews for each of 7 types of products:
We can see below that we have the distributions for the 1-5 ratings for each product. X-axis represents the product rating, and the Y-axis represents the number of reviews for that specific rating.
Correlation between helpfulness numerator/denominator and rating:
Our dataset reports the helpfulness rating of a product by how many people voted it helpful or unhelpful, which are represented by the helpfulness numerator and denominator respectively. We can see that this graph distribution appears to be balanced. The X-axis is the helpfulness numerator/helpfulness denominator number, and the Y-axis is the rating of the reviews.
Preprocessing
The words we choose from the complete review text will determine how effective our algorithm will be, as they provide the actual insight into the product rating. This step is where we must clean the actual text of our reviews. As it happens, these reviews also contain irrelevant text, like punctuation or articles (words that define a noun as specific or unspecific); therefore, those certain words must be omitted as they are not indicative of the sentiment communicated by the review. We were able to use Python to omit irrelevant words using the given universal list of stop words. Another step in our preprocessing is that we omit repeating words, correct misspelled words, convert all text to lowercase, and remove URLs, which were prevalent in a large number of reviews.
Preprocessing the review text:
Shown below, we can see how the actual review text has been preprocessed into a list of relevant words that correspond to our review. (buy, several, vitality, can, dog,…)
Word Feature Vectors
This step is where we create a list of all words based on the preprocessed review text. Next, we make the list into distinct word features, and the key for each value is the frequency of that word in the training set. This is essentially our dictionary of the training set, such that we can now match the vocabulary in each preprocessed review text to create word feature vectors, using the apply_features() function in Python’s nltk library. This is where the feature extraction occurs.
Train/Test Ratio
A crucial step in our project for providing data for these machine learning algorithms was experimenting with the ratio of training data to test data. The reason that this is important is because it pertains to the concept of overfitting, which is a modeling error where one can build their algorithm too specifically on limited training data to the point where it becomes inaccurate when it analyzes test data. In order to discover which ratio would be the most efficient and accurate, we decided to test 9 ratios: 10% training/90% test, 20% training/80% test, 30% training/70% test, 40% training/60% test, 50% training/50% test, 60% training/40% test, 70% training/30% test, 80% training/20% test, 90% training/10% test.
Train/Test Ratio Predictions for Product 1 (Naive Bayes)
Naive Bayes and SVM from Python
Finally, we are able to use our specified ratio of training data to train our two selected machine learning algorithms of desire: Naive Bayes Classifier and SVM Classifier. For Naive Bayes, we were able to use the algorithm in Python’s nltk library. For SVM, we were able to create our own SVM using Python’s sci-kit learn library. We trained these classifiers on the word feature vectors we calculated in the previous step. From each of these algorithms, once the classifier is trained, we use its classify function to predict the product rating of the reviews given the test data. The accuracy in these predictions will be our final result, which has to be tested against all other train/test data ratios. We can see the results in the following section.
Results
The way that we calculated the results for our Naive Bayes and SVM algorithms was by checking if the rating predicted by the algorithm was the same as the actual rating of the test data. This is a simple calculation where we divided the number of successfully predicted reviews against all the reviews in the test data set.
Upon running all the 18 total tests, we were able to make an interesting discovery; we can see from the following results that SVM was relatively more accurate in its predictions of the product rating than Naive Bayes. Naive Bayes performed in the low-60% accuracy range, and SVM performed in the mid-60% to low-70% accuracy range. Additionally, as expected, both of these algorithms’ accuracy was lower when their train data was limited. However, we could also see the concept of overfitting demonstrated when the 80% train/20% test ratio was the most accurate over the 90% train data/10% test ratio. Thus, we can conclude that the 80% train/20% test ratio on the Support Vector Machine Classifier produces the most accurate algorithm to predict the product rating given any written review.
Train/Test Ratio Comparison Chart:
SVM vs. Naive Bayes Accuracy Chart:
Overall, we can see that the highest level of accuracy that we were able to achieve was 71.37% rating. For this project, we consider this is a satisfactory level of accuracy, because this is still significantly greater than a randomized predictor; not to mention, a human-based predictor would probably be around that accuracy level. That said, even if satisfactory, there is always room for improvement and optimization. It is possible, for example, to combine the approaches of SVM and Naive Bayes to make an even more accurate algorithm that can be a consideration for future work.
Conclusion
Currently, our model is able to predict the rating of a review based upon a scale from 1 to 5. If this scale is altered (i.e. ranging from 1-10, reversing the scale so that 1 is the best and 5 is the worst, etc.), our model cannot accurately predict the rating. In future iterations, we would like to have a variable in the model where the user is able to decide what scale to use and the model is able to accurately predict the rating of any reviews on any scale.
Another aspect we would like to improve would be to identify “troll” or “spam” reviews. Many people often put up fake reviews which skews the model used to accurately predict the ratings. It would be cool if we could have the model identify which reviews should be discarded, so that they will not affect the predictions.
Lastly, as mentioned above, we would like to combine the SVM and Naive Bayes models together to form a “hybrid” model that would work better than either of the two models separately. Naive Bayes assumes that each of our features is independent of each other which is not completely true but has fast processing times. SVM takes long amounts of time to process large data but can negate the large impact outliers (troll or spam accounts) would have on the model. Creating a hybrid that takes the best qualities from each algorithm would provide a model with even better accuracy that we go by using each algorithm separately.