It’s remarkable how much it’s possible to do on modern computer hardware. Sometimes though, to make the most of this hardware, it’s important to use the right software tools for the job.
In this post we’re going to show how it’s possible to easily clean a data file with 20 million life policies using Python on a modest modern laptop.
The hardware we’re going to use has 8 gigabytes of memory and uses a recent (2020) Intel quad core processor. It’s likely that your everyday work laptop has roughly this specification or better.
The data file is a comma delimited text file with 20 million rows. Each row has 34 fields with fields representing the full range of data needed to model a life protection policy. The file is around 2.3 Gigabytes in size.
There are a few problems with this data though if we want to use it in our modelling. It has dates of birth rather than age at entry and some fields that we don’t need. So we need to do some simple cleaning.
As we’re using Python the obvious tool to use is pandas the popular data analysis and manipulation library. Let’s try reading the file using pandas in a Jupyter notebook.
Uh-oh. We immediately get an out of error. There just isn't enough memory to read a file of this size.
It’s worth saying here that for the majority of tasks of this type, pandas will work fine. For example, in this case it will read a 10 million policy file with no problems.
However, even with a smaller policy file, reading everything into memory, as pandas does, is inherently inefficient. If we want to perform complex operations on the file then it's very likely we will run out of memory at a later point.
It would be possible to write a program that uses the Python standard library to read this file line by line, break each line into individual fields and do the required cleaning. This sounds like quite a lot of work though.
There is a much better way. Enter the Python library petl.
Put simply, petl allows you to specify transformations of tabular data with that data being streamed, ie read line by line, rather than read into memory in one go.
The program to clean this data file is remarkably short:
import petl as etl def clean_data(table): table = etl.cutout(table, "region") table = etl.addfield(table, 'entry_year', lambda r: date_to_year(r['entry_date'])) table = etl.addfield(table, 'life_1_age_at_entry_years', lambda r: age_at_entry(r['life_1_birth_date'], r['entry_date'])) table = etl.addfield(table, 'life_2_age_at_entry_years', lambda r: age_at_entry(r['life_2_birth_date'], r['entry_date'])) table = etl.addfield(table, 'life_1_age_at_entry_months', 0) table = etl.addfield(table, 'life_2_age_at_entry_months', 0) table = etl.cutout(table, 'life_1_birth_date') table = etl.cutout(table, 'life_2_birth_date') table = etl.cutout(table, 'entry_date') return table s = "/home/jovyan/protection.csv" d = "/home/jovyan/protection_cleaned.csv" t = etl.fromcsv(s) c = clean_data(t) etl.tocsv(c, d)
It’s also very easy to understand. Each line in the clean_data function describes how we are to transform individual fields in the data file.
We've also used a couple of functions 'age_at_entry' and 'date_to_year' (not shown here) that generate dates and ages from the data supplied.
The full Jupyter notebook with this code is available for download here:
Running this on my laptop it takes around 22 minutes to read, clean and then write the 20 million policy data file.
Even better, memory use is minimal: only a few Megabytes of additional memory is used to do this. This makes integrating this code into a much larger workflow very straightforward.
There is a lot more to petl too. Its capabilities include:
- Reading from a range of data sources, including Excel xlsx files and XML files.
- Generating and range of statistics for the values read.
- Reading from and writing to databases using the Python library SQLAlchemy.
Of course, petl isn’t a complete replacement for pandas. For simple cleaning of large data files, though, it’s a great alternative.
It’s astonishing how capable modern computers can be with the right software.
Not all of us will need to deal with data-files with 20 million policies but it’s good to know that we have the power and the tools to do so if we need to.
Please subscribe to receive email updates of future posts, using the button on the bottom right.
Please also get in touch if you’d like to discuss any topic we cover or share how you're using data science tools in your work: contact details are on the 'Contact' page accessible through the menu at the top of this page. Or alternatively share your thoughts in the comments below.