Collaboration with Data Recipes

Collaboration with Data Recipes

I’ve always felt the push and pull of balancing synchronous and asynchronous communication channels for analytics work. On one hand, it’s essential to make sure everyone is on the same page with a shared understanding of the problem and the data, for which meetings and collaborative sessions are key. On the other hand, time for focused deep work is precious, so limiting meetings is also critical. Meeting fatigue can kill productivity.


The good news is that DataChat is well-positioned to support finding the right balance when working with others. In particular, our Collaborate skill allows multiple users to co-create on the same analytics problem in real time by editing the data, charts, and models in a shared session.

For asynchronous teamwork, we also facilitate co-creation without requiring extensive meeting time. Every data product in DataChat, from charts to machine learning models, is backed by a data recipe, which provides a complete data history. We call these recipes “workflows”, and they can be shared and edited, allowing team members to review and refine each other’s work.

 

DataChat workflows provide unrivaled understandability, transparency, and collaboration by exposing every step of the analytics work. Often, I can just send a workflow to a colleague or client to explain my ideas on how to approach a complicated topic. We can then go back and forth on edits to make sure it actually solves the problem. Similar to collaboration on Google Docs, now you can collaborate and co-create data recipes in DataChat. In fact, we think of recipes very much like documents when we think of collaboration.

Time Series Prediction as a Tool for Creating Business Targets

Sometimes I work with teams that know that they want to use AI to help guide their business, but they aren’t sure where to start. If the data has a temporal component (and it often does), I turn to time series prediction because it is so simple to do in DataChat, and the business value is immediate. Imagine you have a time series dataset containing the KPI that matters to you. For example, monthly sales data, a report of daily active users, or inventory levels over time.

 

You can use DataChat’s Predict Time Series skill to infer future values of this KPI based on your historical time series data. 

This prediction can be seen as a “business as usual” baseline. Having a baseline like this makes it easy for you to look back and see whether your business has beaten expectations. The ability to beat expectations – especially when the expectations are following a historical trend of improvement – is a great indicator of a high performing team.

Why I Love Analyze

Even though I have a Master’s degree in Computer Science with a focus on AI and machine learning (ML), I prefer to use DataChat’s Analyze skill over traditional tools like Python for ML tasks. With Python, I have to remember complicated syntax, know which packages to use and when, consult the package man page as packages change over time, remember how to hold out a percentage of my data for training, and more. All of this is slow, tedious, and, frankly, frustrating. With Analyze, I can get the work done so much faster because it automates all of these tedious details.

I don’t build models using Python every day. I often forget the exact syntax for the various packages that I want to use. Instead of slowly plodding through a notebook with sklearn or pandas APIs pulled up on another screen, I simply let DataChat handle all of that complexity for me. I also never have to worry about Python package management.

Not only is Analyze faster from a productivity perspective, but it also makes sure I follow data science best practices. For example, Analyze automatically:

  • Cross-trains and selects the best model from multiple cutting edge ML models, so I don’t have to worry about model selection or staying on top of the latest ML literature.
  • Performs k-fold cross validation to help avoid overfitting.
  • Bins continuous columns to improve model performance.

Both faster and better? It’s no wonder that I prefer using DataChat for my ML needs.

Data Discovery First!

Any meaningful analysis project starts with understanding your data. I can’t count the number of times I’ve gotten to what I thought was the end of a project, only to realize that I had some fundamental misunderstanding of the data. In a recent project to model marketing campaign performance for a brick and mortar retailer, we identified proximity to a store as the most important driver of campaign success. Upon closer scrutiny, it turns out the store distance metric was outdated and unreliable, forcing us back to the feature engineering stage.

 

Because of these types of pitfalls, I always start a new project by digging into the data to make sure I have a solid understanding of it before going further. A seamless approach to tackle this task is to check the distributions of each of the important columns with histograms. This helps me catch things like outliers, missing values, unexpected value, missing date ranges, and class imbalances.

 

The Describe skill in DataChat is an easy way to create these histograms. It also gives me a quick overview of some key distribution metrics—such as mean, min, and max—but even more importantly, it generates links to histograms. 

 

Here we see a somewhat normal distribution. I can now apply standard statistical techniques like normalization or standard deviation based outlier detection methods (like a violin chart) in subsequent analyses.

However, my “GDP Per Capita” column appears to follow a classic power law distribution, so I know to be careful of the effect of outliers. If I was going to train a machine learning model, I might try to use a logarithmic scale instead.

 

In another example, this huge class imbalance lets me know that I need to be careful of outliers and overfitting.

 

If you’d like to follow along, you can find the data that powered these charts at https://tinyurl.com/DataChatWHR.

 

As we can see, understanding your data before you begin your analysis can help you avoid common pitfalls and problems later on. Sometimes it may also point to your data being incorrect (like in the store distance example). So, validating the data is an important first step, and with the descriptive visual analytics tools in DataChat, this part is easy. By the way, if you aren’t an expert in distributions, don’t worry. DataChat can do many things automatically for you, especially when you start to use its automatic machine learning capabilities, such as Analyze. We’ll cover those capabilities in a later post.