American Opioid Crisis through the lens of Data ScienceAmerican Opioid Crisis through the lens of Data Science
Between 2012 and 2016 the number of overdoses for every 100,000th person went from 8 to 12 causing for an epidemic to be declared on a national level
Data science is about telling a story through data. My approach to this very topic was optimistic, I wanted to take my existing knowledge of computer science and learn about data science while also learning about a non-computer science topic and set a goal of communicating the information I learned about this topic through a visual story (data visualization).
Part 1: “find a dataset”
Dataset I chose was the Opioid Overdose Death Rate in America between 1999–2016. I wanted to use this because I grew up in a family where there wasn’t a lot of a engineers, or tech minds at all. My mom and her mom were both nurses, so I never was exposed to technology, and everything I know has been self taught. However a big part of my life has been knowing a lot about how hospitals and insurance works, especially how opioid prescriptions work and how heavily prescribed they are. My mom is plagued with chronic arthritis and back problems and as a nurse will purposefully take 1/4 of the prescribed amount of medicine because she has seen what life changing effects the drugs does to people.
Now with a data set selected I was set on a mission to tell this story in a meaningful and effective way. Knowing our dataset we need to seek out actual data, for me the best place I could find was cdc.gov and NVSS (national center for vital statistics). They have public available data on statewide and county wide overdose rates on a per drug basis.
Part 2: “clean the data”
The dataset that was provided by wonder.cdc.gov was given a tab separated text documented, conveniently pandas has an option that fixes that for us.
however, we had about 68 rows of unnecessary “footnotes” that was added by the website, we can solve that easily by running:
we also had a notes column that contained gibberish, we can do a similar operation to rid it:
now we can save the result for use in Plotly
Part 3: “feel the data”
This is the final part but also the longest and most difficult part as it requires the most work.
I elected to use Pandas as the tool to manipulate the data Plotly as my main visualization tool because it easily allows exporting to HTML/JS documents.
Because I am not using Plotly cloud I need to instantiate my environment in offline mode so I can export my data instead of using their online graphing tools. This can be done a couple different ways, the way I did it was:
Next I imported my previously edited dataset
I also set up a hash dictionary of state 2 letter codes to state names and put them in the dataset:
These next few lines are declaring our axises and data points:
We now need to recalculate the crude rate given to use by the CDC as they do not include numbers below 8 (considering them insignificant, however in our dataset they still are) and then insert a text row for each column that will eventually correspond to a hover with some general information:
This next part is where we need to start thinking about how our data is going to look, we need to create a new data frame that represents a flattened version of the original. Almost as if each row is a year which corresponds to a dictionary of states which corresponds to their respective crude rates (rate/100,000 rate).
The way we deduce this is because in the end we want a frame animation that is based on the years 1999, 2000, 2001, …, 2016, and each year will contain the same set of states, AL, AK, …, WY, which contain their respective crude rates.
Therefore we need to make our data frame represent this.
Long story short the way we do this in pandas is through the following lines of code:
Where df_flat is our flattened data frame, and we are setting it equal to the the original data frame being grouped by Year, State, and State Code, and Text. At the end we are summing all other non-grouped values.
Part 4: “Constructing the visualization”
The first thing we need to do is define our initial frame (z-axis) this will be the year 1999
now we can begin constructing all our other frames
Now we can define our color range, a good tool I used for this can be found here (be sure to select 11)
Next we construct our main data dictionary, this just ties together the first few things we defined and some titles:
Now we need to make our sliders, this is super important because it allows the user-controllable slider bar to match with each frame:
The slider basically defines the transition effects and what frame will do what
Now the very last part is we must define the layout, which just defines some logic for how the graph will look:
Now we can finally graph!
Take a look here:
Take a look here:
Clone the repo and try it with a different dataset:
A take away from this whole ordeal is that data science and data analytics are not just buzzwords, but are powerful tools, ways of processing data, that allows us to collect, draw new conclusions, and convey new information on data in fascinating, efficient, and interesting ways.