May Webinar Recap: An Introduction to DataChat

May Webinar Recap: An Introduction to DataChat

May Webinar Recap: An Introduction to DataChat

We recently hosted a webinar to show how DataChat’s Conversational Intelligence is transforming the way organizations leverage AI and BI to make data-driven decisions. In this 30 minute video,Jignesh Patel, co-founder and CEO, and Danny Thompson, Executive Vice President of Sales, gave an introduction to our all-in-one, state-of-the-art analytics platform, discussed the benefits of using our Guided English Language to create no-code data science pipelines, and gave examples of success stories from our customers.

Some common questions from the webinar included:

Is DataChat a SaaS platform? What cloud vendors do you support?

Yes, DataChat is an analytics SaaS platform and can run in any major cloud provider, including Amazon Web Services (AWS), Google Cloud Provider (GCP), or Azure. Also, we allow customers to deploy in their private cloud to address any data privacy concerns.

What data sources can DataChat support (files, database, etc)?

DataChat supports a number of data sources including common flat files (such as CSV, TSV, and Excel files) and many major database systems, including Snowflake, BigQuery, Postgresql, MySQL, SQL Server, and more.

Can a college student get access to DataChat?

Send an email to info@datachat.ai. We are spinning up a new program to provide FREE access to the DataChat platform for a limited amount of time specifically for college and university students. Also, we are launching DataChat University to help you learn the platform at your own pace. Lastly, we provide very rich documentation on our website that is available to our customers.

We’re always excited to share stories about how DataChat can power analytics for our customers. If you have a specific use case you would like to see demonstrated, please let us know at info@datachat.ai. You can also follow us on LinkedIn and Twitter for more updates and announcements.

5 Tips for Preparing Your Data for Machine Learning

This post is part of our data science and analytics learning series. Please contact us if you have a topic you would like us to cover!

In data science, we want to avoid GIGO, or Garbage In, Garbage Out. In other words, your output can only be as good as your input. We build models with machine learning to make decisions based on the model’s predictions – for competitive advantage or to anticipate behavior, for example.

For more trustworthy results, we want to train our models on good, clean data.

What is Data Prep, and Why Does It Matter?

Data preparation is a critical step in any data science pipeline. Some examples of data prep:

  • Remove or replace null or empty values. For example, replace empty entries in an “Age” column with the average of all of the other values in the column
  • Create completely new columns. For example, calculate a “Profit” column from “Revenue” and “Costs” columns
  • Enhance your data. For example, extend datasets with other related datasets to expand the context.

The initial quality and intended use of your data dictates the amount of prep work you’ll want to do.

Photo Credit: Malcom Calder / Press Center Photograph

How to Prepare Your Data for Machine Learning

1. Cull Data to Focus

Machine learning models are designed to pick out trends in your data. But too much noise can hamper how the model identifies patterns. To avoid distracting the model, you’ll want to remove incorrect and unrelated data so that the model avoids irrelevant or obvious trends.

For example, if you have retail sales data that’s broken down by hour, it might make sense to remove any data that falls outside of business hours, such as early mornings, evenings, and other non-business hours. This helps your model by discarding trends that are easily explainable – sales are lower after hours than during business hours. It allows the model to focus on the more subtle (and interesting) trends that occur throughout the business day.

Other examples to look for are:

Remove outliers. For example, spotlight trends in donations data by discarding the very small or very large donations.

Ensure correct grouping. For example, make sure a “State” column represents all values homogeneously, by using either the two-letter abbreviation or the full name for each state.

2. Normalize for Accuracy

If your data uses varying units of measurement across columns, consider normalizing your data to a standard unit of measurement. This helps your model avoid mistaking large numbers in one feature, or column, as more important than smaller numbers in a different feature.

For example, if your data includes one feature that uses pounds and another that uses tons, consider choosing one unit of measurement—either pounds or tons—for both features. With shared units across features, your model can better assess their relationship.

3. Understand the Task and Metrics Before Training

Your intuition and grasp of your data are important pieces of the equation as you refine your models. Before training a model, zoom in on a specific problem or task and ensure your data addresses it. You might want to create new features or metrics from existing data that are relevant to trends you’re trying to analyze or predict.

For example, if your dataset includes a “Date” column, a Boolean “Holiday” column could focus on analyzing only days that impact your bottom line.

4. Bin Data to Focus Models

Continuous data is data that can have a broad range of values. The weights of people working in a warehouse, for example, are continuous values, as opposed to the grouped weights of distinct products in the warehouse. When preparing your data, especially for tree-based models, consider binning continuous values into a manageable number of buckets, usually between three and ten. Discretizing data helps your model focus on broader trends otherwise hidden by noisier data, and it simplifies interpretations and decisions.

For example, assess whether it is meaningful, given your chosen model, to separate:

An “Age” column into common demographic age ranges, such as 18-24, 25-34, etc.

A “Loan Approval” column into high, medium, and low bins.

5. Hone Time Data with Lag or Moving Averages

When working with time series data, such as sales or growth over time, a lag or a moving average helps your model focus on trends instead of the larger variations between smaller units, such as hours or days.

For example, creating a 14-day moving average can expose bi-weekly cycles that might otherwise be obscured.

Measure Twice, Cut Once

Data preparation is a crucial aspect of your data science workflow. We’ve covered a few tips and tricks to help you prepare your data to boost the trustworthiness of your models and analysis. For more tips, visit our Machine Learning Best Practices, where we also address model training.

Introduction to DataChat: A Q&A with Co-Founder Rogers Jeffrey Leo John

We asked DataChat’s co-founder, Rogers Jeffrey Leo John, to tell us about the origins of the company. What was missing from the marketplace? How has DataChat pushed the envelope?

Q: How did DataChat start?

The idea formed after Jignesh [Patel, our CEO and co-founder] was a visiting scientist at Pivotal Labs. He observed their data teams and the problems they were trying to solve. He noticed that most of the problems (and their solutions) followed similar patterns: when training a model, they ended up in a loop of loading the Python package, selecting features, training the model, then analyzing the results. This was followed by tweaking the features and retraining the model over again.

With that observation in mind, we wrote our first paper. In that paper, we suggested an early prototype of Ava, our Conversational Intelligence assistant, that could abstract the model training loop into a Python template.

Q: How did you improve the model training process?

We realized that, by leveraging controlled natural language (CNL), we could abstract away the programming languages (Python, R, SQL, etc.) from the user in favor of a subset of English. That was the genesis of DataChat’s Guided English Language© (GEL), which was inspired by the “language” used by aviators, such as the NATO phonetic alphabet. GEL allows the user to build data science workflows without needing to know Python, R, SQL, or any other traditional data science tool.

We spun our research out into DataChat and have been growing and evolving ever since.

Q: How has Ava evolved?

While developing Ava and GEL to make model training more intuitive, we’ve also expanded GEL to cover a wide array of data science tools and functions, including data ingestion, data wrangling, and visualization, along with machine learning and explainable artificial intelligence. This makes us a truly all-in-one platform that allows more business users to work with their own data to answer their own questions without needing to learn how to code or work with more complicated data science tools.

Q: What are some innovations in DataChat’s design?

One problem we’re solving is the reproducibility gap. A few years ago, the industry didn’t care about reproducibility; they were more concerned with model accuracy and less concerned about how they got there. We baked reproducibility into DataChat from the beginning.

By having conversations with Ava in GEL, we’re actually automating the documentation and commenting pieces of the data science process, too. Our workflows are built in English, which makes it very easy to look back and see exactly what happened and when. This makes it easy to understand the logic behind the pipeline, but also improves governance and transparency across an organization.

Q: As the platform matured, what other problems have you focused on?

Most of the gains in DataChat can be attributed to how we’ve included industry best practices directly in the platform to avoid common pitfalls, such as avoiding including your label in your feature set. This helps build confidence for novice users and lessens the mental load for more experienced users.

We’ve also seen performance improvements when it comes to data wrangling. For example, in one of our test cases, 100 percent of the DataChat users were able to at least attempt every question in a set of data wrangling problems, compared to 73 percent of Python users.

Overall, we think we’ve already pushed the envelope considerably and are continuing to do so as we add more functionality and features to the platform.

Q: Who will use DataChat?

The biggest challenge we’re trying to solve is how we can democratize data science and bring those data science tools to more users. Everybody has data, but not everybody can pay a data scientist to work with it. Or has the time to learn the tools themselves.

Q: How can DataChat improve the data science field?

Obviously, making data science more accessible is a huge win for everybody. Business users feel empowered, giving more time back to their data teams. Rather than chasing tickets, they can spend more time digging deeper into their data to find more novel insights for their organization.

When we were first developing the platform, we ran some user studies with users who had data science knowledge ranging from nothing at all to intermediate Python experience. We found that DataChat improved those user’s time to the first model by 10 times (~2 minutes in DataChat vs. ~20 minutes in Python).

We also saw that DataChat users were able to train more models (an average of six compared to an average of two in Python) and 100 percent of the DataChat users were able to train at least one model (compared to 80 percent of Python users). Models trained in DataChat were also more accurate (with F1 scores in the 0.7-0.8 range compared to 0.3 to 0.8 in Python).

Q: Has the platform met your initial expectations since spinning out from UW-Madison?

Overall, we think we’ve already pushed the envelope considerably and are continuing to do so as we add more functionality and features to the platform. We’ve also made an impact for our customers from day one. They’re more productive, they’re finding insights they couldn’t before, and they’re unlocking tools that were out of their reach before.