American Opioid Crisis through the lens of Data Science
26 May 2018
by Alexander Goodkind
Between 2012 and 2016 the number of overdoses for every 100,000th person went
from 8 to 12 causing for an epidemic to be declared on a national level
Data science is about telling a story through data. My approach to this very
topic was optimistic, I wanted to take my existing knowledge of computer science
and learn about data science while also learning about a non-computer science
topic and set a goal of communicating the information I learned about this topic
through a visual story (data visualization).
Part 1: “find a dataset”
Dataset I chose was the Opioid Overdose Death Rate in America between 1999–2016.
I wanted to use this because I grew up in a family where there wasn’t a lot of a
engineers, or tech minds at all. My mom and her mom were both nurses, so I never
was exposed to technology, and everything I know has been self taught. However a
big part of my life has been knowing a lot about how hospitals and insurance
works, especially how opioid prescriptions work and how heavily prescribed they
are. My mom is plagued with chronic arthritis and back problems and as a nurse
will purposefully take 1/4 of the prescribed amount of medicine because she has
seen what life changing effects the drugs does to people.
Now with a data set selected I was set on a mission to tell this story in a
meaningful and effective way. Knowing our dataset we need to seek out actual
data, for me the best place I could find was cdc.gov and NVSS (national center
for vital statistics). They have public available data on statewide and county
wide overdose rates on a per drug basis.
Part 2: “clean the data”
The dataset that was provided by wonder.cdc.gov was given a tab separated text
documented, conveniently pandas has an option that fixes that for us.
however, we had about 68 rows of unnecessary “footnotes” that was added by the
website, we can solve that easily by running:
we also had a notes column that contained gibberish, we can do a similar
operation to rid it:
now we can save the result for use in Plotly
Part 3: “feel the data”
This is the final part but also the longest and most difficult part as it
requires the most work.
I elected to use Pandas as the tool to manipulate the data
Plotly as my main visualization tool because it easily
allows exporting to HTML/JS documents.
Because I am not using Plotly cloud I need to instantiate my environment in
offline mode so I can export my data instead of using their online graphing
tools. This can be done a couple different ways, the way I did it was:
Next I imported my previously edited dataset
I also set up a hash dictionary of state 2 letter codes to state names and put
them in the dataset:
These next few lines are declaring our axises and data points:
We now need to recalculate the crude rate given to use by the CDC as they do not
include numbers below 8 (considering them insignificant, however in our dataset
they still are) and then insert a text row for each column that will eventually
correspond to a hover with some general information:
This next part is where we need to start thinking about how our data is going to
look, we need to create a new data frame that represents a flattened version of
the original. Almost as if each row is a year which corresponds to a dictionary
of states which corresponds to their respective crude rates (rate/100,000 rate).
The way we deduce this is because in the end we want a frame animation that is
based on the years 1999, 2000, 2001, …, 2016, and each year will contain the
same set of states, AL, AK, …, WY, which contain their respective crude rates.
Therefore we need to make our data frame represent this.
Long story short the way we do this in pandas is through the following lines of
Where df_flat is our flattened data frame, and we are setting it equal to the
the original data frame being grouped by Year, State, and State Code, and Text.
At the end we are summing all other non-grouped values.
Part 4: “Constructing the visualization”
The first thing we need to do is define our initial frame (z-axis) this will be
the year 1999
now we can begin constructing all our other frames
Now we can define our color range, a good tool I used for this can be found
(be sure to select 11)
Next we construct our main data dictionary, this just ties together the first
few things we defined and some titles:
Now we need to make our sliders, this is super important because it allows the
user-controllable slider bar to match with each frame:
The slider basically defines the transition effects and what frame will do what
Now the very last part is we must define the layout, which just defines some
logic for how the graph will look:
A take away from this whole ordeal is that data science and data analytics are
not just buzzwords, but are powerful tools, ways of processing data, that allows
us to collect, draw new conclusions, and convey new information on data in
fascinating, efficient, and interesting ways.