Data Description

The dataset `Trending YouTube Video Statistics’ is put together by Mitchell Jolly from YouTube API, which was has records from 2008 and was last updated on 2019-06-02. The scripts that scraped the data from YouTube API can be found here, and the primary aim of the dataset is for use in determining the year’s top trending Youtube videos.

There are 10 datasets presented specific to the following countries: USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan and India. We choose the Canada dataset to explore. The dataset contains rows of trending videos which include features like category, trending date, tags, number of views, likes, dislikes, shares and descriptions of videos.

Make sure you have loaded in the data first and processed it.

# read Canada dataset csv file
CAN <- read.csv('../data/youtube_processed.csv')

Below is the number of rows and columns for the dataset.

nrow(CAN)

## [1] 40881

ncol(CAN)

## [1] 18

The following are tye data types of the sixteen columns in the dataset.

features <- CAN %>% colnames() %>% tibble()
types <- CAN %>% sapply(class) %>% tibble()
feature_type <- cbind(features,types)
colnames(feature_type)<-c("Features","Type")
kable(feature_type) %>% kableExtra::kable_styling(full_width = F)

Features	Type
X.1	integer
X	integer
video_id	factor
trending_date	factor
title	factor
channel_title	factor
category_id	integer
publish_time	logical
tags	factor
views	integer
likes	integer
dislikes	integer
comment_count	integer
thumbnail_link	factor
comments_disabled	factor
ratings_disabled	factor
video_error_or_removed	factor
description	factor

For some columns like title, tags and description, it makes more sense for its data class to be char instead of factor. This may be part of the data grooming process.

EDA

We plot of trend between likes and views:

ggplot(CAN, aes(views, likes)) +
  geom_point(alpha =0.2,position="jitter", color = "blue") + 
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(x = "Views", y = "Likes") +
  ggtitle("Trends between Youtube video views and likes")+
  theme_bw()

We see that in general the number of likes increase as we have more views. The points are concentrated at the bottom left corner (there are more videos with number of views less than 50 million, and likes less than 1 million).

We explore how many videos are in each category.

category_vids <- CAN %>% group_by(category_id) %>% 
  tally() %>% 
  arrange(desc(n))
kable(category_vids) %>%  kableExtra::kable_styling(full_width = F)

category_id	n
24	13451
25	4159
22	4105
23	3773
10	3731
17	2787
1	2060
26	2007
20	1344
28	1155
27	991
19	392
15	369
2	353
43	124
29	74
30	6

list_of_category = c(24,25,22,23,10)
CAN %>% filter(category_id==24) %>%
  mutate(year_month = format(as.Date(trending_date), "%Y-%m") ) %>%
  group_by(channel_title, trending_date) %>% 
  # summarize(mean_likes = mean(likes),
  #           mean_dislikes= mean(dislikes),
  #           mean_comment_count = mean(comment_count),
  #           mean_views = mean(views)) %>%
  arrange(trending_date) %>%
  ggplot(aes(x=as.Date(trending_date),y=views)) + geom_line() +
  labs(x = "Date", y = "Number of views") +
  ggtitle("View counts over time in entertainment category") +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_x_date(date_breaks = "months")

Trending channels are as follows in order of top trending to least trending:

CAN %>% group_by(channel_title) %>% 
  summarise(count = n(),
            sum_views = mean(views),
            sum_likes = mean(likes),
            sum_comments = mean(comment_count),
            sum_dislikes = mean(dislikes)) %>% 
  arrange(desc(sum_comments)) %>% datatable()

Here is a plot of number of videos by category.

category_vids %>% ggplot(aes(y=n,
             x = fct_reorder(as.factor(category_vids$category_id),
                             category_vids$n,
                             max, .incr=TRUE))) +
  geom_bar(stat="identity") + 
  coord_flip() + 
  ylab("count") + 
  xlab("category") + 
  theme_bw() +
  theme(legend.position = "none") +
  ggtitle("Number of videos by Category")

The category corresponding to its ID can be found here.

Top 5 Categories are:

Category 24: Entertainment
Category 25: News and Politics
Category 22: People and Blogs
Category 23: Comedy
Category 10: Music

Bottom 5 Categories are:

Category 30: Movies
Category 29: Nonprofits & Activism
Category 43: Shows
Category 2: Autos and Vehicles
Category 15: Pets and Animals

Correlation plots

Next, we explore correlation between numerical columns.

CAN %>% select(views, likes, dislikes,comment_count) %>% 
  cor() %>% 
  round(2) %>% 
  corrplot(
    type="lower", 
    method="color", 
    tl.srt=45,
    addCoef.col = "white",
    diag = FALSE)

We note the highest correlation between the number of likes and number of comments, and the lowest correlation beetween number of views and number of dislikes.

Research questions

What is the relationship between video category, number of views it recieves, likes/dislikes and comment count?

Does the number of comment counts on Youtube videos correlate with the number of likes or dislikes on a video?
What trends exist between comment count and video likes/dislikes
Does the number of comment counts on Youtube videos correlate with the number of likes or dislikes on a video?
What trends exist between comment count and video likes/dislikes

Plan of action

With our research questions we are mainly interested in the comment counts, likes, dislikes, views, and how these change over time. Hence we will perform subsequent analysis using this reduced dataset, after dealing with any missing values. To investigate the questions we will plot time series of the data to visualise trends in variables over time as well as perform both simple and multiple linear regression analysis to estimate the relationship between variables.

Dataset, EDA and research question

Marion Nyberg & Rachel Han

28/02/2020

Data Description

EDA

Correlation plots

Research questions

Plan of action