The dataset `Trending YouTube Video Statistics’ is put together by Mitchell Jolly from YouTube API, which was has records from 2008 and was last updated on 2019-06-02. The scripts that scraped the data from YouTube API can be found here, and the primary aim of the dataset is for use in determining the year’s top trending Youtube videos.
There are 10 datasets presented specific to the following countries: USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan and India. We choose the Canada dataset to explore. The dataset contains rows of trending videos which include features like category, trending date, tags, number of views, likes, dislikes, shares and descriptions of videos.
Make sure you have loaded in the data first and processed it.
# read Canada dataset csv file
CAN <- read.csv('../data/youtube_processed.csv')
Below is the number of rows and columns for the dataset.
nrow(CAN)
## [1] 40881
ncol(CAN)
## [1] 18
The following are tye data types of the sixteen columns in the dataset.
features <- CAN %>% colnames() %>% tibble()
types <- CAN %>% sapply(class) %>% tibble()
feature_type <- cbind(features,types)
colnames(feature_type)<-c("Features","Type")
kable(feature_type) %>% kableExtra::kable_styling(full_width = F)
Features | Type |
---|---|
X.1 | integer |
X | integer |
video_id | factor |
trending_date | factor |
title | factor |
channel_title | factor |
category_id | integer |
publish_time | logical |
tags | factor |
views | integer |
likes | integer |
dislikes | integer |
comment_count | integer |
thumbnail_link | factor |
comments_disabled | factor |
ratings_disabled | factor |
video_error_or_removed | factor |
description | factor |
For some columns like title, tags and description, it makes more sense for its data class to be char
instead of factor
. This may be part of the data grooming process.
We plot of trend between likes and views:
ggplot(CAN, aes(views, likes)) +
geom_point(alpha =0.2,position="jitter", color = "blue") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
labs(x = "Views", y = "Likes") +
ggtitle("Trends between Youtube video views and likes")+
theme_bw()
We see that in general the number of likes increase as we have more views. The points are concentrated at the bottom left corner (there are more videos with number of views less than 50 million, and likes less than 1 million).
We explore how many videos are in each category.
category_vids <- CAN %>% group_by(category_id) %>%
tally() %>%
arrange(desc(n))
kable(category_vids) %>% kableExtra::kable_styling(full_width = F)
category_id | n |
---|---|
24 | 13451 |
25 | 4159 |
22 | 4105 |
23 | 3773 |
10 | 3731 |
17 | 2787 |
1 | 2060 |
26 | 2007 |
20 | 1344 |
28 | 1155 |
27 | 991 |
19 | 392 |
15 | 369 |
2 | 353 |
43 | 124 |
29 | 74 |
30 | 6 |
list_of_category = c(24,25,22,23,10)
CAN %>% filter(category_id==24) %>%
mutate(year_month = format(as.Date(trending_date), "%Y-%m") ) %>%
group_by(channel_title, trending_date) %>%
# summarize(mean_likes = mean(likes),
# mean_dislikes= mean(dislikes),
# mean_comment_count = mean(comment_count),
# mean_views = mean(views)) %>%
arrange(trending_date) %>%
ggplot(aes(x=as.Date(trending_date),y=views)) + geom_line() +
labs(x = "Date", y = "Number of views") +
ggtitle("View counts over time in entertainment category") +
scale_y_continuous(labels = scales::comma_format()) +
scale_x_date(date_breaks = "months")
Trending channels are as follows in order of top trending to least trending:
CAN %>% group_by(channel_title) %>%
summarise(count = n(),
sum_views = mean(views),
sum_likes = mean(likes),
sum_comments = mean(comment_count),
sum_dislikes = mean(dislikes)) %>%
arrange(desc(sum_comments)) %>% datatable()
Here is a plot of number of videos by category.
category_vids %>% ggplot(aes(y=n,
x = fct_reorder(as.factor(category_vids$category_id),
category_vids$n,
max, .incr=TRUE))) +
geom_bar(stat="identity") +
coord_flip() +
ylab("count") +
xlab("category") +
theme_bw() +
theme(legend.position = "none") +
ggtitle("Number of videos by Category")
The category corresponding to its ID can be found here.
Top 5 Categories are:
Bottom 5 Categories are:
Next, we explore correlation between numerical columns.
CAN %>% select(views, likes, dislikes,comment_count) %>%
cor() %>%
round(2) %>%
corrplot(
type="lower",
method="color",
tl.srt=45,
addCoef.col = "white",
diag = FALSE)
We note the highest correlation between the number of likes and number of comments, and the lowest correlation beetween number of views and number of dislikes.
What is the relationship between video category, number of views it recieves, likes/dislikes and comment count?
What trends exist between comment count and video likes/dislikes
What trends exist between comment count and video likes/dislikes
With our research questions we are mainly interested in the comment counts, likes, dislikes, views, and how these change over time. Hence we will perform subsequent analysis using this reduced dataset, after dealing with any missing values. To investigate the questions we will plot time series of the data to visualise trends in variables over time as well as perform both simple and multiple linear regression analysis to estimate the relationship between variables.