Data Discovery First!

Data Discovery First!

Any meaningful analysis project starts with understanding your data. I can’t count the number of times I’ve gotten to what I thought was the end of a project, only to realize that I had some fundamental misunderstanding of the data. In a recent project to model marketing campaign performance for a brick and mortar retailer, we identified proximity to a store as the most important driver of campaign success. Upon closer scrutiny, it turns out the store distance metric was outdated and unreliable, forcing us back to the feature engineering stage.

 

Because of these types of pitfalls, I always start a new project by digging into the data to make sure I have a solid understanding of it before going further. A seamless approach to tackle this task is to check the distributions of each of the important columns with histograms. This helps me catch things like outliers, missing values, unexpected value, missing date ranges, and class imbalances.

 

The Describe skill in DataChat is an easy way to create these histograms. It also gives me a quick overview of some key distribution metrics—such as mean, min, and max—but even more importantly, it generates links to histograms. 

 

Here we see a somewhat normal distribution. I can now apply standard statistical techniques like normalization or standard deviation based outlier detection methods (like a violin chart) in subsequent analyses.

However, my “GDP Per Capita” column appears to follow a classic power law distribution, so I know to be careful of the effect of outliers. If I was going to train a machine learning model, I might try to use a logarithmic scale instead.

 

In another example, this huge class imbalance lets me know that I need to be careful of outliers and overfitting.

 

If you’d like to follow along, you can find the data that powered these charts at https://tinyurl.com/DataChatWHR.

 

As we can see, understanding your data before you begin your analysis can help you avoid common pitfalls and problems later on. Sometimes it may also point to your data being incorrect (like in the store distance example). So, validating the data is an important first step, and with the descriptive visual analytics tools in DataChat, this part is easy. By the way, if you aren’t an expert in distributions, don’t worry. DataChat can do many things automatically for you, especially when you start to use its automatic machine learning capabilities, such as Analyze. We’ll cover those capabilities in a later post.

5 Tips for Preparing Your Data for Machine Learning

This post is part of our data science and analytics learning series. Please contact us if you have a topic you would like us to cover!

In data science, we want to avoid GIGO, or Garbage In, Garbage Out. In other words, your output can only be as good as your input. We build models with machine learning to make decisions based on the model’s predictions – for competitive advantage or to anticipate behavior, for example.

For more trustworthy results, we want to train our models on good, clean data.

What is Data Prep, and Why Does It Matter?

Data preparation is a critical step in any data science pipeline. Some examples of data prep:

  • Remove or replace null or empty values. For example, replace empty entries in an “Age” column with the average of all of the other values in the column
  • Create completely new columns. For example, calculate a “Profit” column from “Revenue” and “Costs” columns
  • Enhance your data. For example, extend datasets with other related datasets to expand the context.

The initial quality and intended use of your data dictates the amount of prep work you’ll want to do.

Photo Credit: Malcom Calder / Press Center Photograph

How to Prepare Your Data for Machine Learning

1. Cull Data to Focus

Machine learning models are designed to pick out trends in your data. But too much noise can hamper how the model identifies patterns. To avoid distracting the model, you’ll want to remove incorrect and unrelated data so that the model avoids irrelevant or obvious trends.

For example, if you have retail sales data that’s broken down by hour, it might make sense to remove any data that falls outside of business hours, such as early mornings, evenings, and other non-business hours. This helps your model by discarding trends that are easily explainable – sales are lower after hours than during business hours. It allows the model to focus on the more subtle (and interesting) trends that occur throughout the business day.

Other examples to look for are:

Remove outliers. For example, spotlight trends in donations data by discarding the very small or very large donations.

Ensure correct grouping. For example, make sure a “State” column represents all values homogeneously, by using either the two-letter abbreviation or the full name for each state.

2. Normalize for Accuracy

If your data uses varying units of measurement across columns, consider normalizing your data to a standard unit of measurement. This helps your model avoid mistaking large numbers in one feature, or column, as more important than smaller numbers in a different feature.

For example, if your data includes one feature that uses pounds and another that uses tons, consider choosing one unit of measurement—either pounds or tons—for both features. With shared units across features, your model can better assess their relationship.

3. Understand the Task and Metrics Before Training

Your intuition and grasp of your data are important pieces of the equation as you refine your models. Before training a model, zoom in on a specific problem or task and ensure your data addresses it. You might want to create new features or metrics from existing data that are relevant to trends you’re trying to analyze or predict.

For example, if your dataset includes a “Date” column, a Boolean “Holiday” column could focus on analyzing only days that impact your bottom line.

4. Bin Data to Focus Models

Continuous data is data that can have a broad range of values. The weights of people working in a warehouse, for example, are continuous values, as opposed to the grouped weights of distinct products in the warehouse. When preparing your data, especially for tree-based models, consider binning continuous values into a manageable number of buckets, usually between three and ten. Discretizing data helps your model focus on broader trends otherwise hidden by noisier data, and it simplifies interpretations and decisions.

For example, assess whether it is meaningful, given your chosen model, to separate:

An “Age” column into common demographic age ranges, such as 18-24, 25-34, etc.

A “Loan Approval” column into high, medium, and low bins.

5. Hone Time Data with Lag or Moving Averages

When working with time series data, such as sales or growth over time, a lag or a moving average helps your model focus on trends instead of the larger variations between smaller units, such as hours or days.

For example, creating a 14-day moving average can expose bi-weekly cycles that might otherwise be obscured.

Measure Twice, Cut Once

Data preparation is a crucial aspect of your data science workflow. We’ve covered a few tips and tricks to help you prepare your data to boost the trustworthiness of your models and analysis. For more tips, visit our Machine Learning Best Practices, where we also address model training.