An Introduction to Data Cleaning: Using Regular Expressions to Clean your Data

Johnathan Padilla
4 min readJun 24, 2020

Johnathan Padilla, Thomas Kidd, and Hadrien Picq are fellows in the TechSoup and ParsonsTKO Summer 2020 Data Strategy Mentorship Program, a select group of upcoming data analysts and scientists to learn more about what the social impact and non-profit analytics sectors look like from industry professionals.

Regex is a Superpower

Unlike powers inherited by radioactive spiders, regex can be used as irresponsibly as you want. There are limits to this power but when used skillfully those limits can be largely ignored. I’ll show you a quick example of regex’s power. Say you are given a string like the following:

Words = 'the quick brownTfox runs super! fast: '

The sentence is all garbled and has characters that we do not want to keep. We can correct and recreate this sentence using regex. First, you have to split the string on the characters of your choice.

split = re.split('T|!|:',Words)
split

I chose to split on the ‘T’, ‘!’ and the ‘:’ which yields a list of only the desired characters

['the quick brown', 'fox runs super', ' fast', ' ']

From here I would want to recombine the strings to one string. I can do this by adding each portion of the list to each other like so

corrected_words = split[0] + split[3] + split[1] + split[2]
corrected_words

Be sure to note that I placed the space character in between ‘brown’ and ‘fox’ as no space existed there in the first place. Printing the new string should look like this:

'the quick brown fox runs super fast'

Using the regex method is a faster way of cleaning text data that is simpler and quicker to use compared to manually splitting the strings. Now that we have established a basis for using regex I can dive into the data being used for the project.

Overview of the Data

Sample of data

The data above was pulled straight from OpenAQ’s S3 bucket using AWS Athena. The data was exported into CSV format and read into a python notebook using pd.read_csv(). Each column contains information ranging from date to source type. Much of the data is stored in a list within the cell so we will have to work around that to access the actual values. There are also some columns that will be useless later on. Let’s get to cleaning!

Cleaning the Data

First, we have to import the necessary packages and load the dataset into the notebook:

import pandas as pd
import re
df = pd.read_csv('18.01.01 - 18.01.29.csv')

Now that the dataset is loaded in we first want to get rid of some useless columns.

df = df.drop(columns = {'city', 'country', 'sourcename', 'mobile', 'sourcetype', 'attribution', 'averagingperiod'})

Next, looking at the date column we can see that the dates and times are inside of a bracketed list with two different types of DateTime assignments. The first DateTime type is in UTC and the second is local time. For easy analysis, we would want the local date in its own column. The same goes for the local times. We can do this by leveraging regex.

def get_date(data):
'''
yeilds cleaned date and creates new date column to use for time series plotting
'''

dates = []
times = []
i = 0
while i <= len(data)-1:
time_split = re.split('{ |=|T|-', data['date'][i])
date_split = re.split('{ |=|T',data['date'][i])
dates.append(date_split[1])
times.append(time_split[-2])
i +=1
df['time'] = times
df['date'] = dates

In the function above I used the re.split() function to pull the local date and local time separately from the original date column. I iterate this process until I have gone through every row of the dataset. I append the correct values to initially empty lists so they can be assigned to new columns once it is done iterating. Note that I assigned the dates to the ‘date’ column which will overwrite the data that was there originally; if you want to retain this data assign the cleaned date to a column named something other than ‘date’.

Final dataframe yielded by the function get_date()

One optional cleaning would be to organize the dataset in date order. The same process can be used to get latitude and longitude individually.

Recap

We learned about regex and just one of its many applications. We also went over how to use regex in a function that cleans a data set containing air quality data. For me, this dataset is 90% ready for exploratory analysis. I would do a few more stylistic cleaning techniques to make the dataset a little more friendly to work with. Be on the lookout for the next installment of this series where I will be getting into the actual analysis of the data.

--

--

No responses yet