library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(here)
## here() starts at /Users/racquellemangahas/Desktop/stat547_class/project/group_13-1
library(ggplot2)
library(tidyr)
We found the dataset at: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016/data
This compiled dataset pulled from four other datasets linked by time and place was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum. The inspiration for this study was to prevent suicide. This data set includes 11 columns and provides information about country, year, sex, age group, count of suicides, population, suicide rate, country-year composite key, gdp_for_year, gdp_per_capita, generation (based on age grouping average).
The references for this study are:
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506
World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#
[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook
World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/
suiciderates<- read.table(("suiciderates.csv"),sep=" ")
## Error in file(file, "rt"): cannot open the connection
Peek at dataset:
DT::datatable(suiciderates)
## Error in crosstalk::is.SharedData(data): object 'suiciderates' not found
Exploratory Data Analysis of ‘suiciderates’
How many rows?
nrow(suiciderates)
## Error in nrow(suiciderates): object 'suiciderates' not found
How many columns?
ncol(suiciderates)
## Error in ncol(suiciderates): object 'suiciderates' not found
Summary of suiciderates dataset:
summary(suiciderates)
## Error in summary(suiciderates): object 'suiciderates' not found
Figuring out NAs in ‘suiciderates’ dataset:
Out of entire dataset (27820 observations of 12 variables), what % are NAs?
sum(is.na(suiciderates))/27820*12
## Error in eval(expr, envir, enclos): object 'suiciderates' not found
For column Human Development Index (HDI) for year, what % are NAs?
sum(is.na(suiciderates$HDI.for.year))/27820
## Error in eval(expr, envir, enclos): object 'suiciderates' not found
Since there are 8.39% of NAs in the dataset, and the variable ‘HDI for year’ consists of 70% NAs, we have decided to completely ignore that variable in our analyses, since ‘HDI for year’ values wouldn’t be significant to factor in when looking at suicide rates due to lack of data.
Removing NAs and creating refined ‘suicideratesnew’ dataset:
Next, I will select for only the variables I am interested in, thus removing ‘HDI for year’.
suicideratesnew <- suiciderates %>%
select(-HDI.for.year)
## Error in eval(lhs, parent, parent): object 'suiciderates' not found
DT::datatable(suicideratesnew)
## Error in crosstalk::is.SharedData(data): object 'suicideratesnew' not found
I will now check to see how many NAs are still remaining in this dataset:
sum(is.na(suicideratesnew))/27820*11
## Error in eval(expr, envir, enclos): object 'suicideratesnew' not found
There are now 0% of NAs in the new dataset, further exemplifying that ‘HDI for year’ contained all the NAs.
Exploratory Data Analysis of ‘suicideratesnew’:
How many rows?
nrow(suicideratesnew)
## Error in nrow(suicideratesnew): object 'suicideratesnew' not found
How many columns?
ncol(suicideratesnew)
## Error in ncol(suicideratesnew): object 'suicideratesnew' not found
Summary of suicideratesnew dataset:
summary(suicideratesnew)
## Error in summary(suicideratesnew): object 'suicideratesnew' not found
Plots
In this first plot, we will look at how suicides may differ between generations, globally between 1985-2016.
gen_suicides <- suicideratesnew %>%
group_by(generation) %>%
summarise("mean_suicides"=mean(suicides_no))
## Error in eval(lhs, parent, parent): object 'suicideratesnew' not found
DT::datatable(gen_suicides)
## Error in crosstalk::is.SharedData(data): object 'gen_suicides' not found
gen_suicides %>%
ggplot() +
geom_col(aes(x=fct_reorder(generation, mean_suicides),y=mean_suicides, fill=generation)) +
xlab("Generation") +
ylab("Mean # of suicides") +
theme_minimal() +
coord_flip() +
ggtitle("Average number of suicides globally across generations (1985-2016)") +
theme(plot.title = element_text(hjust = 0.5))
## Error in eval(lhs, parent, parent): object 'gen_suicides' not found
In the second plot, we look at how suicide rates have changed over the years, particularly in Canada, and see if there is a trend.
canada_suicides <- suicideratesnew %>%
filter(country== 'Canada') %>%
group_by(year) %>%
summarise("sum_suicides"=sum(suicides_no))
## Error in eval(lhs, parent, parent): object 'suicideratesnew' not found
DT::datatable(canada_suicides)
## Error in crosstalk::is.SharedData(data): object 'canada_suicides' not found
canada_suicides %>%
ggplot() +
geom_line(aes(x=year, y=sum_suicides)) +
xlab("Year") +
ylab("Sum of suicides") +
theme_minimal() +
ggtitle("Number of suicides in Canada (1985-2016)") +
theme(plot.title = element_text(hjust = 0.5))
## Error in eval(lhs, parent, parent): object 'canada_suicides' not found
Lastly, we will see the distribution of suicides between sexes within the entire dataset.
suicideratesnew %>%
ggplot() +
geom_violin(aes(x=sex, y= log10(suicides_no), fill=sex)) +
xlab("Sex") +
ylab("log10(Number of suicides)") +
theme_minimal() +
ggtitle("Distribution of suicides between sexes, globally (1985-2016)") +
theme(plot.title = element_text(hjust = 0.5))
## Error in eval(lhs, parent, parent): object 'suicideratesnew' not found
Research Question
Between 1985-2016, how did suicide rates differ between sexes and generations, and is there a significant correlation with the amount of GDP per capita for each country?
How?
With our research question, we are interested in the suicide rates among different generations. Later, we will perform a linear regression analysis and plot the relevant variables (variables of interest) with a regression line after we come to a conclusion that there is a relationship between these variables.