Image
man at a computer working on code

Perspectives|

How to Solve Missing Data in Python

Two data scientists plug a Python pain point with their imputation package Autoimpute.

Written by Philip Baker.
5-minute read

  • Data Analytics
  • Technology and Innovation

In Brief

  1.  Working with incomplete data sets is one of the hardest challenges data scientists face.
  2. Unlike other languages, the Python programming language had no robust approach to imputation before Shahid Barkat and Joseph Kearney built Autoimpute.
  3. Key to their package’s success is users’ ability to choose the imputation method most relevant to the specific context of their data.

 

Missing data is one of the main challenges data scientists face when preparing data to model.

Due to the complexity of the problem, most programming languages have no built-in way to solve it, instead leaving it up to the user to develop a way to impute the data on their own.

As part of their capstone project, Shahid Barkat and Joseph Kearney, 2019 graduates of the Master of Science in Analytics (MScA), developed an imputation package they named Autoimpute that goes far to solving this issue for the Python programming language. 

“Our goal was to create usable software that would help people regardless of their industry,” Kearney says. “Further, we wanted to put it out there and see how well it performs in the real world.”

They met both of these goals. After graduating and publishing their package on GitHub, a dedicated following of users grew around Autoimpute, leading Barkat and Kearney—in the months before COVID-19—to travel to Python data conferences to present their work. Since then, inquiries from users have continued on a regular basis.

“We’re happy with its success in the open-source community from a recognition standpoint and also for our personal careers,” Barkat says. “The response we ended up getting was much more than we expected when we first started this project.”

Image
man in checkered shirt smiling at camera

“We’re happy with its success in the open-source community from a recognition standpoint and also for our personal careers. The response we ended up getting was much more than we expected when we first started this project.”

Shahid Barkat, MscA '19

Both Barkat and Kearney have settled into new jobs since graduating, a fact they attribute to having Autoimpute to boast about, along with their UChicago MScA degrees. Barkat works in Chicago as a customer engineer for Google, while Kearney is a senior analytics engineer at Seated in New York.

“Having a proven example of your work in addition to the degree really helps people get a sense for what you can offer,” Kearney says. “It serves to cut through some of the intense competition you find in the field today.”

Image
man with folded arms

“Having a proven example of your work in addition to the degree really helps people get a sense for what you can offer. It serves to cut through some of the intense competition you find in the field today.”

Joe Kearney, MScA '19

The hard work of cleaning data

When Barkat and Kearney began their project, they did not foresee that they would focus on missingness.  Through discussions with their faculty advisor, Arnab Bose, chief scientific officer at Abzooba, it became clear that missing data was one of the key pain points for data scientists working today, particularly for those working in Python. 

From a student’s perspective, this might have come as a surprise. Although cleaning data takes up nearly 80% of a data scientist’s working time, dealing with messy data is not a central feature of coursework, which tends to focus on the intricacies of algorithms and machine learning tools. For students, data sets tend to be perfect.

Once Barkat and Kearney appreciated the significance of this lacuna in Python, however, they realized that imputation was an area where they could add some real value.

“In the data science community, you have all these cool machine learning methods, and those new to the field want to jump straight to that,” Kearney says. “But in the end, your results are only as good as the data you put in. We wanted to take a step back and acknowledge that even if dealing with missing data isn’t sexy, it is required—especially if you want your analysis to yield results that people take seriously.”

A very modular package

Part of the reason for Python’s increasing popularity with data scientists is its broad applicability across industries. To accommodate this wide community, Barkat and Kearney incorporated an array of the most effective imputation methods into Autoimpute, granting users in that way the power to choose the methods most relevant to the specific context of their data.

“The goal for our package was to be industry agnostic,” Kearney says. “The object was to develop a package that allowed users, regardless of their industry, to choose a method that could learn the specific characteristics of their data and then give the analyst the further power to choose methods specific to particular situations.” 

Barkat notes that the aspect to highlight is Autoimpute’s modularity. 

“There’s a lot of flexibility for practitioners using our package. They can adjust the parameters and choose the different imputation methods they’d like to use. They can change the number of times that a data set gets imputed. Users can add imputation methods of their own if they want as well. They just have to follow the approach documented in our GitHub, and we can bolt it on.”

Excited for future growth

In the end, Barkat points to Autoimpute’s full set of methods and end-to-end framework for imputation as key to its success. “I think that’s where we found a niche. Our package is solely focused on imputation. We’re not trying to do anything else outside of that scope.” 

Busy with their new jobs these days, Barkat and Kearney nevertheless have ambition for the package and are excited to work with others who want to contribute. In their GitHub package documentation, they discuss some potential directions for growth, which include scaling the existing methods and integrating it with other Python packages. 

“Two people have already made meaningful contributions that have improved the package,” Barkat notes, “and numerous others have suggested changes or edits to make the package more robust. Right now, we’re continuing to support the package, answer queries, fix bugs, and engage with ideas in the community.”

Kearney agrees, adding, “Even if we met our original goal, which was to build something others could use in their day-to-day, we still see a lot of exciting directions and possibilities for Autoimpute’s future.”

Image
man looking at camera

Philip Baker

Staff Writer

Philip Baker is a staff writer at the University of Chicago. He graduated from the College with a degree in English.