This post is part of our data science and analytics learning series. Please contact us if you have a topic you would like us to cover!
In data science, we want to avoid GIGO, or Garbage In, Garbage Out. In other words, your output can only be as good as your input. We build models with machine learning to make decisions based on the model’s predictions – for competitive advantage or to anticipate behavior, for example.
For more trustworthy results, we want to train our models on good, clean data.
What is Data Prep, and Why Does It Matter?
Data preparation is a critical step in any data science pipeline. Some examples of data prep:
- Remove or replace null or empty values. For example, replace empty entries in an “Age” column with the average of all of the other values in the column
- Create completely new columns. For example, calculate a “Profit” column from “Revenue” and “Costs” columns
- Enhance your data. For example, extend datasets with other related datasets to expand the context.
The initial quality and intended use of your data dictates the amount of prep work you’ll want to do.
How to Prepare Your Data for Machine Learning
1. Cull Data to Focus
Machine learning models are designed to pick out trends in your data. But too much noise can hamper how the model identifies patterns. To avoid distracting the model, you’ll want to remove incorrect and unrelated data so that the model avoids irrelevant or obvious trends.
For example, if you have retail sales data that’s broken down by hour, it might make sense to remove any data that falls outside of business hours, such as early mornings, evenings, and other non-business hours. This helps your model by discarding trends that are easily explainable – sales are lower after hours than during business hours. It allows the model to focus on the more subtle (and interesting) trends that occur throughout the business day.
Other examples to look for are:
Remove outliers. For example, spotlight trends in donations data by discarding the very small or very large donations.
Ensure correct grouping. For example, make sure a “State” column represents all values homogeneously, by using either the two-letter abbreviation or the full name for each state.
2. Normalize for Accuracy
If your data uses varying units of measurement across columns, consider normalizing your data to a standard unit of measurement. This helps your model avoid mistaking large numbers in one feature, or column, as more important than smaller numbers in a different feature.
For example, if your data includes one feature that uses pounds and another that uses tons, consider choosing one unit of measurement—either pounds or tons—for both features. With shared units across features, your model can better assess their relationship.
3. Understand the Task and Metrics Before Training
Your intuition and grasp of your data are important pieces of the equation as you refine your models. Before training a model, zoom in on a specific problem or task and ensure your data addresses it. You might want to create new features or metrics from existing data that are relevant to trends you’re trying to analyze or predict.
For example, if your dataset includes a “Date” column, a Boolean “Holiday” column could focus on analyzing only days that impact your bottom line.
4. Bin Data to Focus Models
Continuous data is data that can have a broad range of values. The weights of people working in a warehouse, for example, are continuous values, as opposed to the grouped weights of distinct products in the warehouse. When preparing your data, especially for tree-based models, consider binning continuous values into a manageable number of buckets, usually between three and ten. Discretizing data helps your model focus on broader trends otherwise hidden by noisier data, and it simplifies interpretations and decisions.
For example, assess whether it is meaningful, given your chosen model, to separate:
An “Age” column into common demographic age ranges, such as 18-24, 25-34, etc.
A “Loan Approval” column into high, medium, and low bins.
5. Hone Time Data with Lag or Moving Averages
When working with time series data, such as sales or growth over time, a lag or a moving average helps your model focus on trends instead of the larger variations between smaller units, such as hours or days.
For example, creating a 14-day moving average can expose bi-weekly cycles that might otherwise be obscured.
Measure Twice, Cut Once
Data preparation is a crucial aspect of your data science workflow. We’ve covered a few tips and tricks to help you prepare your data to boost the trustworthiness of your models and analysis. For more tips, visit our Machine Learning Best Practices, where we also address model training.