Data analytics is crucial for business growth and problem-solving, demanding accurate and consistent data. Outliers can greatly impact your analysis and skew your findings, so addressing them early in your analytical process is essential. Outliers often signal measurement errors, entry errors, poor sampling, and other issues. This underscores the necessity of outlier management. In this example, we'll compare sunshine hours' effect on life expectancy in various cities, illustrating outlier and null value handling.
Step One: Load the Data
The initial step involves loading your dataset into a DataChat session. You can access the dataset through this link. Once loaded, your dataset will resemble the following:

With the data successfully loaded, the next task is to address the outliers.
Step Two: Detect Outliers
Creating visualizations is a powerful way to investigate your data and can be a useful tool to help catch outliers. We recommend using scatter, boxplot, or violin charts to detect outliers. In this example, we’ll use a scatter chart. We can use DataChat’s Chart Builder to create a visualization that looks something like this:

At first glance, we can see a distinct outlier near the bottom of the chart. Hovering over this point provides more insight, revealing values for each available column. It appears that Johannesburg might be an outlier with a life expectancy of 56.30 years:

To validate suspected outliers, we recommend utilizing the Detect skill. The Detect skill leverages DataChat’s machine-learning capabilities to identify outliers in numerical columns. Using the Isolation Forest Method's anomaly score, it assigns scores and rankings to each data point. Scores range from -1 to 1, with lower scores indicating a higher likelihood of an outlier.
Implementing the Detect skill yields a modified dataset resembling the following:

The output confirms that Johannesburg(row 1), holds the highest outlier score.
Step Three: Remove Outliers
With Johannesburg identified as a significant outlier in our dataset, it's advisable to remove this outlier to avoid distorting the rest of the data.To achieve this, we can use the Drop skill to remove Johannesburg from the dataset. Alternatively, you could consider substituting the outlier with an average value
Step Four: Identify Null Values
Null values represent another crucial form of outliers to monitor. In DataChat, nulls can be spotted by sorting a column in descending order, placing null values at the top of the dataset. In this example, we focus on "SunshineHoursCity" and "LifeExpectancyyearsCountry" columns.
Sorting the columns in descending order, reveals a single null value in the "SunshineHoursCity" column.

Step Five: Replace Null Values
Given that machine learning algorithms typically cannot handle null values, addressing these gaps in the dataset becomes pivotal to ensure accurate data analysis. While the "LifeExpectancyyearsCountry" column has no nulls, there's one in the "SunshineHoursCity" column, specifically for "Geneva."
Based on the generated chart, it appears this data adheres to a uniform distribution, with "LifeExpectancyyearsCountry"spanning the 70-85 range. We recommend using the average “SunshineHoursCity” to replace the null with a value that is predictably similar. We can continue to use the data without losing important information from the other columns. This replacement can be facilitated using the Clean skill. The outcome manifests as follows:

The null value is replaced with the column’s average (2,224.95) resulting in a revised dataset.
Using DataChat, we were able to detect outliers and replace null values, thus preparing us for a more accurate analysis and more dependable results.
DataChat is a cohesive analytics platform that uses natural language to make a broad range of data science tools, including data wrangling, preparation, exploration, visualization, and predictive modeling, accessible to everyone to improve business outcomes. Contact us or schedule a demo to learn more about how DataChat can help you improve your business outcomes.