Understanding Over-Policing using Python

4 min readJun 15, 2020

Over the following weeks, I will be taking part in a mentorship program put on by ParsonsTKO and TechSoup where I will be working on projects focused on data analysis. Each week will have a specific theme that relates to a certain part of the data life cycle. This week’s article will cover how to find data that fits your project scope as well as how to load it into python for analysis. This is a collaborative project and others in my group have written articles pertaining to the same data but using different tools. I will link those articles here:

Juliana Albertini: ParsonsTKO & TechSoup Data Strategy Mentorship Program: Week 2 Recap

Sebastian Martin Perez: Dog Paddling in the Deep End of Data Science

Goals for the Week

Last week’s topic was on data collection and the overall search for data. Our group decided the topic most worth exploring in the current social context is the prosecution of crime relative to law enforcement spending. Initial questions: have police interactions risen with budgets? Is police spending increasing proportionally with the increase in population? Is policing getting more dangerous? These were great questions to prime my search for data. My goal for the week was to solidify my understanding of data collection as well as get familiar with the problem of over-policing.

Collecting the Data

Juliana Albertini found a great website that has collected data on police at the federal level. Here you will find the page for the justice statistics which will have a publications and products section that contains a zipped data folder called Federal Justice Statistics, 2015–2016. In this folder, there are several datasets. The format of these datasets is not what we would like to load into a Jupyter Notebook so it will take some preliminary cleaning to load into Python. There are ways to load the dataset as is, however, it is slightly out of the scope of this project. For now, just download the zipped folder and explore some of the data in excel.

The next dataset will be collected from The Tax Policy Center curated by the Urban Institute. After you click “getting started,” the site will bring you to a query process that allows you to pull data “primarily from the US Census Bureau” and extract it in a CSV format. This website is great for pulling the data in a simple format and allows for a state by state breakdown of policing data.

For this week we will stop here in terms of data collected.

Loading the Data

First, the data from the BJS source needs to be cleaned up. Below is an example of how the sheets may look when you first download them.

A dataset containing totals for each step of the arrest process over 22 years

The data in this particular set is the federal arrest data from 1994 to 2016. All we want is the column headers and the data below it. The easiest way to do this is to copy and paste the data into another sheet (All sheets needed/used will be in the GitHub link provided at the end of the article). You should be left with the following:

As for the data from the Tax Policy Center website we can just load it as is which is super cool!

Porting to Python

Okay, now that the data is in the correct format to begin loading, we can go ahead and open up a Jupyter Notebook (if you are not comfortable with Jupyter Notebooks, lookout for my comprehensive guide to getting started with Notebooks). Make sure that the notebook is in the same folder as the datasets otherwise the notebook will not be able to see them.

First, we must import the right packages to import the datasets. For our purposes, we only need two libraries, Pandas and IPython.display.

import pandas as pd
from IPython.display import display

Next, we can use the read_excel() function like so,

#if your data is in csv or txt format use pd.read_csv
#important to note that you have to specify the delimiter used when reading a txt file.  Arrest = pd.read_excel(‘name of file’.xlsx)
Expend = pd.read_excel(‘name of file’.xlsx)

Running this code will create two data sets, Arrest and Expend, that we can now print to an output cell by the following,

#only use the display function if you want to see multiple tables in one output celldisplay(Arrest_data.head())
display(Expend_data.head())

Now we have two datasets that are loaded into a Jupyter Notebook and ready for cleaning.

Github: https://github.com/Johnpadilla-personal/Over-policing-analysis

Understanding Over-Policing using Python

Goals for the Week

Collecting the Data

Loading the Data

Porting to Python

Written by Johnathan Padilla