For more trustworthy results, we want to train our models on good, clean data.
What is Data Prep, and Why Does It Matter?
Data preparation is a critical step in any data science pipeline. Some examples of data prep:
- Remove or replace null or empty values. For example, replace empty entries in an “Age” column with the average of all of the other values in the column
- Create completely new columns. For example, calculate a “Profit” column from “Revenue” and “Costs” columns
- Enhance your data. For example, extend datasets with other related datasets to expand the context.
The initial quality and intended use of your data dictates the amount of prep work you’ll want to do.
How to Prepare Your Data for Machine Learning
1. Cull Data to Focus
Machine learning models are designed to pick out trends in your data. But too much noise can hamper how the model identifies patterns. To avoid distracting the model, you’ll want to remove incorrect and unrelated data so that the model avoids irrelevant or obvious trends.
Ensure correct grouping. For example, make sure a “State” column represents all values homogeneously, by using either the two-letter abbreviation or the full name for each state.
2. Normalize for Accuracy
If your data uses varying units of measurement across columns, consider normalizing your data to a standard unit of measurement. This helps your model avoid mistaking large numbers in one feature, or column, as more important than smaller numbers in a different feature.
For example, if your data includes one feature that uses pounds and another that uses tons, consider choosing one unit of measurement—either pounds or tons—for both features. With shared units across features, your model can better assess their relationship.
3. Understand the Task and Metrics Before Training
Your intuition and grasp of your data are important pieces of the equation as you refine your models. Before training a model, zoom in on a specific problem or task and ensure your data addresses it. You might want to create new features or metrics from existing data that are relevant to trends you’re trying to analyze or predict.
For example, if your dataset includes a “Date” column, a Boolean “Holiday” column could focus on analyzing only days that impact your bottom line.
4. Bin Data to Focus Models
Continuous data is data that can have a broad range of values. The weights of people working in a warehouse, for example, are continuous values, as opposed to the grouped weights of distinct products in the warehouse. When preparing your data, especially for tree-based models, consider binning continuous values into a manageable number of buckets, usually between three and ten. Discretizing data helps your model focus on broader trends otherwise hidden by noisier data, and it simplifies interpretations and decisions.
A “Loan Approval” column into high, medium, and low bins.
5. Hone Time Data with Lag or Moving Averages
When working with time series data, such as sales or growth over time, a lag or a moving average helps your model focus on trends instead of the larger variations between smaller units, such as hours or days.
For example, creating a 14-day moving average can expose bi-weekly cycles that might otherwise be obscured.
Measure Twice, Cut Once
Data preparation is a crucial aspect of your data science workflow. We’ve covered a few tips and tricks to help you prepare your data to boost the trustworthiness of your models and analysis. For more tips, visit our Machine Learning Best Practices Docs page, where we also address model training.
_______
DataChat is a cohesive analytics platform that uses natural language to make a broad range of data science tools, including data wrangling, preparation, exploration, visualization, and predictive modeling, accessible to everyone to improve business outcomes. Contact us to learn more about how DataChat can help you improve your business outcomes – or sign up to try free today!