Mental Health in Tech: Can treatment status be predicted?
In this particular build week project, I took a look at a data set from Kaggle which was a survey conducted about mental health amongst tech workers back in 2014.
I’m quite passionate about mental health, as medically speaking its often overlooked at and under-reported and in some cultures frankly not even believed in. So I was keen to select a dataset from Kaggle about mental health, and particularly this data was gathered from a survey conducted in 2014 from individuals globally who are working in “Tech”.
A quick background on mental health, according to an infographic created by National Alliance on Mental Illness, 1 in 5 U.S adults experience mental illness, and a 37% prevalence rate in those who identify as Lesbian, Gay, and Bisexual. Based on this prior knowledge it was exciting to delve into this dataset that had some interesting variables, and I was hoping to be able to predict whether or not a person sought treatment, based on the given factors.
Some factors that I initially presumed would be of value would have been Gender, Age, Family History, Benefits, Leave, Mental health consequence (discussing a mental health issue with your employer would have negative consequence). But the data showed something far more interesting and, in a way, completely boring as well. An oxymoron but please indulge me. Below are charts depicting why I thought the above variables would have tracked well with treatment being the predictor.
Initially, a baseline score of simple majority count shows that seeking treatment is a 50:50 coin-toss as the number who sought treatment is half the dataset vs those who did not was the other half. This is not a satisfactory result, so three different models were applied. Below are their accuracy scores (a ratio of correctly predicted observations vs the total observations). Those familiar with these models might have some intuition better than mine, but I would have thought that XGBoost would yield the best prediction score.
In an attempt to understand my data more I graphed the feature importance of my model, i.e. which features are the most important to the model for making its predictions. And permutation importance, which is defined as the decrease in a model score when a single feature value is randomly shuffled.
As we can see the variable that made the most impact was work_interfere, which was the answer to the question: If you have a mental health condition, do you feel that it interferes with your work? And family_history, which is the answer to the question: Do you have a family history of mental illness?
Which now, those who are savvy to data science are rushing quickly to say “data-leakage”, and yes this is an example of firm data leakage, but it logically makes sense as to why these would be good predictors for whether not a person sought treatment according to the survey. If they feel it interferes with work they would seek treatment, if they had a family history of it, they would have sought treatment.
But the obvious aside there are two more variables that surprised me, were leave and care_options. Which are defined as the answer to the questions: How easy is it for you to take medical leave for a mental health condition? & Do you know the options for mental health care your employer provides?
And at first glance, they seem slightly unintuitive (based on my first assumptions), but most of these questions in some way are a form of leak wherein we have leaky predictors. And while there are things that can be done to eliminate leaky predictors on the side of the data analyst; screening columns that are statistically correlated to your target, and I guess that’s where having some degree of domain experience comes into handy.
But also on the side of the surveyor, they should be cautious about how their datasets will be used and warn against using predictive models on datasets where the columns highly correlate, ie asking if they have a family history of mental health, or straight up asking if they have a mental health condition that gets in the way of work. Perhaps not all Kaggle datasets were bound to be good datasets for predictive modeling.