Alcohol use has been linked with cognitive impairement in the short term in a variety of situations such as in operation of a motor vehicle. Numerous factors have been found to affect a student’s performance in a class, from sleep to diet.One previous study has shown the negative affect of alcohol on academic achievement in a student dataset from the United States Balsa et al. Thus, it would be interesting to see if this affect on performance can be replicated in other datasets and whether time of alcohol consumption (weekend or weekday) makes a difference.
The datasets are obtained from UCI and is originally from Fabio Pagnotta and Hossain Mohammad Amran. It contains survey data from Portugese highschool students in a Math and Portugese class and contains information on 33 attributes. Each class is its own .csv file, but I will be focussing on the attributes from the Portugese class dataset as it contains more students (649 students). Each student makes up each row. This was generated from a colon separated file I made from the original txt metadata file.
Below is the entire variable set before cleaning up the file names:
variable | description | type |
---|---|---|
school | student’s school | binary: GP for Gabriel Pereira or MS for Mousinho da Silveira |
sex | student’s sex | binary: F for female or M for male |
age | student’s age | numeric: from 15 to 22 |
address | student’s home address type | binary: U for urban or R for rural |
famsize | family size | binary: LE3 for less or equal to 3 or GT3 for greater than 3 |
Pstatus | parent’s cohabitation status | binary: T for living together or A for apart |
Medu | mother’s education | numeric: 0 for none, 1 for primary education (4th grade), 2 for 5th to 9th grade, 3 for secondary education or 4 for higher education |
Fedu | father’s education | numeric: 0 for none, 1 for primary education (4th grade), 2 for 5th to 9th grade, 3 for secondary education or 4 for higher education |
Mjob | mother’s job | nominal: teacher, health care related, civil services (e.g. administrative or police), at_home or other |
Fjob | father’s job | nominal: teacher, health care related, civil services (e.g. administrative or police), at_home or other |
reason | reason to choose this school | nominal: close to home, school reputation, course preference or other |
guardian | student’s guardian | nominal: mother, father or other |
traveltime | home to school travel time | numeric: 1 for <15 min., 2 for 15 to 30 min., 3 for 30 min. to 1 hour, or 4 for >1 hour |
studytime | weekly study time | numeric: 1 for <2 hours, 2 for 2 to 5 hours, 3 for 5 to 10 hours, or 4 for >10 hours |
failures | number of past class failures | numeric: n if 1<=n<3, else 4 |
schoolsup | extra educational support | binary: yes or no |
famsup | family educational support | binary: yes or no |
paid | extra paid classes within the course subject (Math or Portuguese) | binary: yes or no |
activities | extra-curricular activities | binary: yes or no |
nursery | attended nursery school | binary: yes or no |
higher | wants to take higher education | binary: yes or no |
internet | Internet access at home | binary: yes or no |
romantic | with a romantic relationship | binary: yes or no |
famrel | quality of family relationships | numeric: from 1 for very bad to 5 for excellent |
freetime | free time after school | numeric: from 1 for very low to 5 for very high |
goout | going out with friends | numeric: from 1 for very low to 5 for very high |
Dalc | workday alcohol consumption | numeric: from 1 for very low to 5 for very high |
Walc | weekend alcohol consumption | numeric: from 1 for very low to 5 for very high |
health | current health status | numeric: from 1 for very bad to 5 for very good |
absences | number of school absences | numeric: from 0 to 93 |
G1 | first period grade | numeric: from 0 to 20 |
G2 | second period grade | numeric: from 0 to 20 |
G3 | final grade | numeric: from 0 to 20, output target |
In this correllogram, we see a variety of factors having an association with final grades . The colour scheme shows all positive correlations as blue, and all negative correlations as red.Term 1 grades(t1_grades) and term 2 grades(t2_grades) having the highest correlation with final_grades makes sense here, as earlier term grades are correlated with later term grades. We will mainly focus on the alcohol (workday and weekend), which show negative correlation.
Let’s look at weekend alcohol and workday alcohol use’s spread.
We see differences in the spread from the very low(1) to very high (5) consumption, with a general decrease in the mean as the amount of alcohol consumption increases increases, especially in the workday consumption.
Let’s look at the distribution of grades.
The distribution of grades appear to be a bit left skewed.
Let’s look at potential confounding factors like sex of the student, parental status and family support and their spread in average final grades
#####Potential Confounding factors and grades
It doesn’t look like there is a huge difference between the grades in males compared to females. Males have a slighly lower average, but overall are similar. This is good because it will not be a huge confound in the data.Also family support and parental status have similar average values.
In this analysis, I will use linear regression to determine the relationship between alcohol use, either weekend, workday or both and final grades for students. I chose the final grades as a output variable because it is more resistant to short term effects because it depends on work throughout the term.
I will remove those with very bad health status (1), as to reduce confounds in the data. My main focus is on the alcohol use categories and final grades, so I will probably ignore the other factors.I will then perform linear regression analysis and plot a regression line using the relevant variables.
I performed a multivariate simple linear regression using the lm package, after removing the very bad health status(1). I used workday alcohol and weekend alcohol as covariates and looked at interaction between these 2 as well.
I included a plot with workday alcohol on the x axis coloured by weekend alcohol
I didn’t include an interaction plot because it was not significant(as you will see below).
Let’s look at our linear model results.
tidy_lm_model <- tidy(lm_model)
tidy_lm_model
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.7 0.619 22.2 1.21e-78
## 2 dataset$workday_alc -1.16 0.478 -2.44 1.52e- 2
## 3 dataset$weekend_alc -0.370 0.208 -1.78 7.61e- 2
## 4 dataset$workday_alc:dataset$weekend_alc 0.161 0.117 1.38 1.69e- 1
glance(lm_model)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.0437 0.0385 3.16 8.45 1.70e-5 4 -1434. 2877. 2899.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
The only result that seems to be significant is the workday alcohol with grades, whose pvalue is 0.0151823. The interaction term is not significant, thus we can point that workday alcohol affects grades as a main effect. The value of the main effect of this factor is is -1.1643133 This appears to be a negative correlation betweeen workday alcohol and grades, which means as workday drinking increases final grades decrease.
Let’s look at qqplots of the residulat and residual vs fitted plots.
The residuals do not all fall onto the qqplot and thus are not fully normally distributed. A residual vs fitted plot should show random dispersion around the x axis, and thus shows the appropriateness of this data for a linear model. In this case, the points do not fully seem randomly dispersed.
The only predictor variable that was significant was workday alcohol which had a negative association with final grades. This is in line with Balsa et al.’s study, which saw a significant, but small negative association with alcohol and grades, specifically for males. In my case, I did not separate by gender, which could be a future analysis. Also, I think including other covariates like family support in the future would be a good idea. Finally, given the qqplot and residual vs fitted plot show , it would be best to potentially change the model from a simple linear regression that treats the predictor of alcohol use as a numeric, into a more complex model that treats this predictor as a categorical and uses dummy variables.
Balsa AI, Giuliano LM, French MT. The effects of alcohol use on academic achievement in high school. Econ Educ Rev. 2011;30(1):1–15. doi:10.1016/j.econedurev.2010.06.015