In June, we had the privilege of presenting our groundbreaking research paper, "DataChat: An Intuitive and Collaborative Data Analytics Platform," at the SIGMOD conference. Our paper introduced the DataChat platform, designed to provide an intuitive, powerful, and accessible data science approach to all users. Since its unveiling, our approach has garnered substantial interest from various industries, prompting us to continuously enhance our platform. In this blog post, we revisit the experiments we conducted earlier this year to showcase our significant improvements.
Our research methodology involves assessing the capabilities of DataChat by using the dev split from the Spider benchmark dataset. However, we identified certain limitations within the Spider dataset during our initial experiments. We found that a majority of the samples in the Spider dev set were relatively easy to reason with. To make our study more comprehensive, we introduced two crucial metrics, Misalignment (M) and Degree of Composition (C), to better understand the dataset's complexity.
Misalignment (M) measures how tokens in a natural language query are disconnected from table identifiers and other semantic concepts relevant to the analytics task. Degree of Composition (C), on the other hand, assesses the functional complexity of the ground truth SQL program. The distribution of data samples across these metrics is illustrated in Figure 7.
Figure 7: Distribution of all samples from the Spider dev
split characterized by misalignment (M) and the degree of
composition (C), annotated with the number of points in each
zone. Based on the distribution of the scores, the thresholds
for M and C were chosen to be 0.4 and 30 respectively. Most
samples are characterized as (low, low).
To ensure a more balanced dataset, we sampled equitability from the four partitions characterized by the metrics M and C. This allowed us to comprehensively evaluate the performance of our method across varying levels of dataset complexity.
To assess the effectiveness of our approach, we leveraged DataChat Ask to generate result tables by inputting the natural language query and the corresponding datasets into Ask. Subsequently, we compared the generated result tables with the ground truth tables, which were derived from executing the provided SQL queries against the input datasets. The results are presented in the updated table below.
Our comprehensive analysis of the dataset yielded a remarkable 14.3% overall improvement in the performance of DataChat. Notably, the most substantial enhancements were observed in the high-complexity category, showcasing DataChat's improved ability to handle complex data analytics tasks effectively.
It's worth emphasizing that our method relies exclusively on schema information and does not consider the data samples themselves. The significant improvements can be attributed to various factors, including:
Semantic Grounding in the Program Checker: Enabling more precise static code repairs.
Enhanced Semantic Layer: Allowing the definition of custom concepts, providing users with greater flexibility.
In-House Error Analysis: Continuous optimization of factors such as API coverage and functional improvements.
The recent improvements to the DataChat platform highlight our commitment to delivering a powerful and intuitive data analytics solution for all users and our dedication to advancing the field of data science and analytics. As we continue to refine and innovate, DataChat remains the foremost platform for accessible and effective data analytics.
DataChat is a cohesive analytics platform that uses natural language to make a broad range of data science tools, including data wrangling, preparation, exploration, visualization, and predictive modeling, accessible to everyone to improve business outcomes. Contact us or schedule a demo to learn more about how DataChat can help you improve your business outcomes.