library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(here)

## here() starts at /Users/racquellemangahas/Desktop/stat547_class/project/group_13-1

library(ggplot2)
library(tidyr)

Task 1: Choosing a dataset

We found the dataset at: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016/data

Task 2: Project Proposal & EDA

2.1: Introduce and describe your dataset

This compiled dataset pulled from four other datasets linked by time and place was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum. The inspiration for this study was to prevent suicide. This data set includes 11 columns and provides information about country, year, sex, age group, count of suicides, population, suicide rate, country-year composite key, gdp_for_year, gdp_per_capita, generation (based on age grouping average).

The references for this study are:

United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

2.2: Load your dataset

suiciderates<- read.table(("suiciderates.csv"),sep=" ")

## Error in file(file, "rt"): cannot open the connection

Peek at dataset:

DT::datatable(suiciderates)

## Error in crosstalk::is.SharedData(data): object 'suiciderates' not found

2.3: Explore your dataset

Exploratory Data Analysis of ‘suiciderates’

How many rows?

nrow(suiciderates)

## Error in nrow(suiciderates): object 'suiciderates' not found

How many columns?

ncol(suiciderates)

## Error in ncol(suiciderates): object 'suiciderates' not found

Summary of suiciderates dataset:

summary(suiciderates)

## Error in summary(suiciderates): object 'suiciderates' not found

Figuring out NAs in ‘suiciderates’ dataset:

Out of entire dataset (27820 observations of 12 variables), what % are NAs?

sum(is.na(suiciderates))/27820*12

## Error in eval(expr, envir, enclos): object 'suiciderates' not found

For column Human Development Index (HDI) for year, what % are NAs?

sum(is.na(suiciderates$HDI.for.year))/27820

## Error in eval(expr, envir, enclos): object 'suiciderates' not found

Since there are 8.39% of NAs in the dataset, and the variable ‘HDI for year’ consists of 70% NAs, we have decided to completely ignore that variable in our analyses, since ‘HDI for year’ values wouldn’t be significant to factor in when looking at suicide rates due to lack of data.

Removing NAs and creating refined ‘suicideratesnew’ dataset:

Next, I will select for only the variables I am interested in, thus removing ‘HDI for year’.

suicideratesnew <- suiciderates %>% 
  select(-HDI.for.year)

## Error in eval(lhs, parent, parent): object 'suiciderates' not found

DT::datatable(suicideratesnew)

## Error in crosstalk::is.SharedData(data): object 'suicideratesnew' not found

I will now check to see how many NAs are still remaining in this dataset:

sum(is.na(suicideratesnew))/27820*11

## Error in eval(expr, envir, enclos): object 'suicideratesnew' not found

There are now 0% of NAs in the new dataset, further exemplifying that ‘HDI for year’ contained all the NAs.

Exploratory Data Analysis of ‘suicideratesnew’:

How many rows?

nrow(suicideratesnew)

## Error in nrow(suicideratesnew): object 'suicideratesnew' not found

How many columns?

ncol(suicideratesnew)

## Error in ncol(suicideratesnew): object 'suicideratesnew' not found

Summary of suicideratesnew dataset:

summary(suicideratesnew)

## Error in summary(suicideratesnew): object 'suicideratesnew' not found

Plots

In this first plot, we will look at how suicides may differ between generations, globally between 1985-2016.

gen_suicides <- suicideratesnew %>% 
  group_by(generation) %>% 
  summarise("mean_suicides"=mean(suicides_no))

## Error in eval(lhs, parent, parent): object 'suicideratesnew' not found

DT::datatable(gen_suicides)

## Error in crosstalk::is.SharedData(data): object 'gen_suicides' not found

gen_suicides %>% 
  ggplot() +
  geom_col(aes(x=fct_reorder(generation, mean_suicides),y=mean_suicides, fill=generation)) +
  xlab("Generation") +
  ylab("Mean # of suicides") +
  theme_minimal() +
  coord_flip() + 
  ggtitle("Average number of suicides globally across generations (1985-2016)") +
  theme(plot.title = element_text(hjust = 0.5))

## Error in eval(lhs, parent, parent): object 'gen_suicides' not found

In the second plot, we look at how suicide rates have changed over the years, particularly in Canada, and see if there is a trend.

canada_suicides <- suicideratesnew %>% 
  filter(country== 'Canada') %>% 
  group_by(year) %>% 
  summarise("sum_suicides"=sum(suicides_no))

## Error in eval(lhs, parent, parent): object 'suicideratesnew' not found

DT::datatable(canada_suicides)

## Error in crosstalk::is.SharedData(data): object 'canada_suicides' not found

canada_suicides %>% 
  ggplot() +
  geom_line(aes(x=year, y=sum_suicides)) +
  xlab("Year") +
  ylab("Sum of suicides") +
  theme_minimal() +
  ggtitle("Number of suicides in Canada (1985-2016)") +
  theme(plot.title = element_text(hjust = 0.5))

## Error in eval(lhs, parent, parent): object 'canada_suicides' not found

Lastly, we will see the distribution of suicides between sexes within the entire dataset.

suicideratesnew %>% 
  ggplot() +
  geom_violin(aes(x=sex, y= log10(suicides_no), fill=sex)) +
  xlab("Sex") +
  ylab("log10(Number of suicides)") +
  theme_minimal() +
  ggtitle("Distribution of suicides between sexes, globally (1985-2016)") +
  theme(plot.title = element_text(hjust = 0.5))

## Error in eval(lhs, parent, parent): object 'suicideratesnew' not found

Research question & plan of action

Research Question

Between 1985-2016, how did suicide rates differ between sexes and generations, and is there a significant correlation with the amount of GDP per capita for each country?

How?

With our research question, we are interested in the suicide rates among different generations. Later, we will perform a linear regression analysis and plot the relevant variables (variables of interest) with a regression line after we come to a conclusion that there is a relationship between these variables.

Milestone 1