Abstract

This project uses machine learning to classify videos by their popularity based on attributed data of the video. We have collected data using kaggle and the YouTube API to generate information about basic features of a video including its Likes, Dislikes, category, Channel Subscriber Count, View Count and Ratio of likes to dislikes. Using the data obtained we have run decision tree models on the dataset just containing Rounded Likes, Rounded View count, subscriber count and category ID. Using this information we have obtained information about the "popularity" of the video with respect to being able to get an understanding of how the likes to dislikes ratio varies for the videos. The dataset being used is comprised of 8650 examples.

The following decision trees were obtained on running the model made using python libraries of scikit-learn. Note: before preparing this model, the dataset was explored on Weka to find the distribution of the attributes and it was seen that a ZeroR model obtained an accuracy of ~55%.

Information Gain Entropy Decision Tree

Information Gain Entropy Decision Tree


Gini Index Decision Tree

Gini Index Decision Tree

For the above trees, the dataset used was split into test and training sets. A 70-30 set was taken into consideration, so here 70% of the dataset was used for training and 30% of the data was used for testing. The gini index criterion used obtained an accuracy of ~84% and information gain criterion on the decision tree gained an accuracy of ~80%.

Final Project Report

Our final project report can be viewed here

Project Authors

George Malty and Shankar Dutt Salwan