Data Description

The dataset `Trending YouTube Video Statistics’ is put together by Mitchell Jolly from YouTube API, which was has records from 2008 and was last updated on 2019-06-02. The scripts that scraped the data from YouTube API can be found here, and the primary aim of the dataset is for use in determining the year’s top trending Youtube videos.

There are 10 datasets presented specific to the following countries: USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan and India. We choose the Canada dataset to explore. The dataset contains rows of trending videos which include features like category, trending date, tags, number of views, likes, dislikes, shares and descriptions of videos.

Make sure you have loaded in the data first and processed it.

# read Canada dataset csv file
CAN <- read.csv('../data/youtube_processed.csv')

Below is the number of rows and columns for the dataset.

nrow(CAN) 
## [1] 40881
ncol(CAN)
## [1] 18

The following are tye data types of the sixteen columns in the dataset.

features <- CAN %>% colnames() %>% tibble()
types <- CAN %>% sapply(class) %>% tibble()
feature_type <- cbind(features,types)
colnames(feature_type)<-c("Features","Type")
kable(feature_type) %>% kableExtra::kable_styling(full_width = F)
Features Type
X.1 integer
X integer
video_id factor
trending_date factor
title factor
channel_title factor
category_id integer
publish_time logical
tags factor
views integer
likes integer
dislikes integer
comment_count integer
thumbnail_link factor
comments_disabled factor
ratings_disabled factor
video_error_or_removed factor
description factor

For some columns like title, tags and description, it makes more sense for its data class to be char instead of factor. This may be part of the data grooming process.

EDA

We plot of trend between likes and views:

ggplot(CAN, aes(views, likes)) +
  geom_point(alpha =0.2,position="jitter", color = "blue") + 
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(x = "Views", y = "Likes") +
  ggtitle("Trends between Youtube video views and likes")+
  theme_bw()

We see that in general the number of likes increase as we have more views. The points are concentrated at the bottom left corner (there are more videos with number of views less than 50 million, and likes less than 1 million).

We explore how many videos are in each category.

category_vids <- CAN %>% group_by(category_id) %>% 
  tally() %>% 
  arrange(desc(n))
kable(category_vids) %>%  kableExtra::kable_styling(full_width = F)
category_id n
24 13451
25 4159
22 4105
23 3773
10 3731
17 2787
1 2060
26 2007
20 1344
28 1155
27 991
19 392
15 369
2 353
43 124
29 74
30 6
list_of_category = c(24,25,22,23,10)
CAN %>% filter(category_id==24) %>%
  mutate(year_month = format(as.Date(trending_date), "%Y-%m") ) %>%
  group_by(channel_title, trending_date) %>% 
  # summarize(mean_likes = mean(likes),
  #           mean_dislikes= mean(dislikes),
  #           mean_comment_count = mean(comment_count),
  #           mean_views = mean(views)) %>%
  arrange(trending_date) %>%
  ggplot(aes(x=as.Date(trending_date),y=views)) + geom_line() +
  labs(x = "Date", y = "Number of views") +
  ggtitle("View counts over time in entertainment category") +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_x_date(date_breaks = "months")

Trending channels are as follows in order of top trending to least trending:

CAN %>% group_by(channel_title) %>% 
  summarise(count = n(),
            sum_views = mean(views),
            sum_likes = mean(likes),
            sum_comments = mean(comment_count),
            sum_dislikes = mean(dislikes)) %>% 
  arrange(desc(sum_comments)) %>% datatable()

Here is a plot of number of videos by category.

category_vids %>% ggplot(aes(y=n,
             x = fct_reorder(as.factor(category_vids$category_id),
                             category_vids$n,
                             max, .incr=TRUE))) +
  geom_bar(stat="identity") + 
  coord_flip() + 
  ylab("count") + 
  xlab("category") + 
  theme_bw() +
  theme(legend.position = "none") +
  ggtitle("Number of videos by Category")

The category corresponding to its ID can be found here.

Top 5 Categories are:

Bottom 5 Categories are:

Correlation plots

Next, we explore correlation between numerical columns.

CAN %>% select(views, likes, dislikes,comment_count) %>% 
  cor() %>% 
  round(2) %>% 
  corrplot(
    type="lower", 
    method="color", 
    tl.srt=45,
    addCoef.col = "white",
    diag = FALSE)

We note the highest correlation between the number of likes and number of comments, and the lowest correlation beetween number of views and number of dislikes.

Research questions

What is the relationship between video category, number of views it recieves, likes/dislikes and comment count?

Plan of action

With our research questions we are mainly interested in the comment counts, likes, dislikes, views, and how these change over time. Hence we will perform subsequent analysis using this reduced dataset, after dealing with any missing values. To investigate the questions we will plot time series of the data to visualise trends in variables over time as well as perform both simple and multiple linear regression analysis to estimate the relationship between variables.