Any meaningful analysis project starts with understanding your data. I can’t count the number of times I’ve gotten to what I thought was the end of a project, only to realize that I had some fundamental misunderstanding of the data. In a recent project to model marketing campaign performance for a brick and mortar retailer, we identified proximity to a store as the most important driver of campaign success. Upon closer scrutiny, it turns out the store distance metric was outdated and unreliable, forcing us back to the feature engineering stage.
Because of these types of pitfalls, I always start a new project by digging into the data to make sure I have a solid understanding of it before going further. A seamless approach to tackle this task is to check the distributions of each of the important columns with histograms. This helps me catch things like outliers, missing values, unexpected value, missing date ranges, and class imbalances.
The Describe skill in DataChat is an easy way to create these histograms. It also gives me a quick overview of some key distribution metrics—such as mean, min, and max—but even more importantly, it generates links to histograms.
Here we see a somewhat normal distribution. I can now apply standard statistical techniques like normalization or standard deviation based outlier detection methods (like a violin chart) in subsequent analyses.
However, my “GDP Per Capita” column appears to follow a classic power law distribution, so I know to be careful of the effect of outliers. If I was going to train a machine learning model, I might try to use a logarithmic scale instead.
In another example, this huge class imbalance lets me know that I need to be careful of outliers and overfitting.
If you’d like to follow along, you can find the data that powered these charts at https://tinyurl.com/DataChatWHR.
As we can see, understanding your data before you begin your analysis can help you avoid common pitfalls and problems later on. Sometimes it may also point to your data being incorrect (like in the store distance example). So, validating the data is an important first step, and with the descriptive visual analytics tools in DataChat, this part is easy. By the way, if you aren’t an expert in distributions, don’t worry. DataChat can do many things automatically for you, especially when you start to use its automatic machine learning capabilities, such as Analyze. We’ll cover those capabilities in a later post.
DataChat is a cohesive analytics platform that uses natural language to make a broad range of data science tools, including data wrangling, preparation, exploration, visualization, and predictive modeling, accessible to everyone to improve business outcomes. Contact us or schedule a demo to learn more about how DataChat can help you improve your business outcomes.