Data analytics is essential for business growth and problem-solving, so making sure that your data is as accurate and consistent as it can be is key. Outliers can greatly impact your analysis and skew your findings, so it’s important to handle them before you get too deep into your work. Outliers often represent measurement errors, entry errors, poor sampling, and more, which is why it’s important to understand how to handle them. In this example, we’ll compare how sunshine hours affect life expectancy in different cities to show you how to find and handle numerical outliers and null values.
Step One: Load Your Data
As always, the first step is to load your data into a DataChat session. If you’d like to follow along, you can find this dataset here. We end up with a dataset that looks something like this:
With our data loaded, we’re ready to start handling outliers.
Step Two: Detect Outliers
Creating visualizations is a powerful way to investigate your data and can be a useful tool to help catch outliers. We recommend using scatter, boxplot, or violin charts to detect outliers. In this example, we’ll use a scatter chart. We can use DataChat’s Chart Builder to create a visualization that looks something like this:
At first glance, we can see a distinct outlier near the bottom of the chart. Hovering over the data point gives us a bit more information, showing us the values for each available column in the dataset. It looks like Johannesburg might be an outlier with a life expectancy of 56.30 years:
We recommend using the Detect skill to confirm suspected outliers. The Detect skill leverages DataChat’s machine-learning capabilities to help you identify outliers in numerical columns using the Isolation Forest Method’s anomaly score to provide outlier scores and ranking for each data point. The score can range from -1 to 1; the lower the score, the more likely it is that a given data point is an outlier.
Using Detect results in a dataset that looks something like this:
The output confirms that Johannesburg, row 1, has the highest outlier score.
Step Three: Remove Outliers
Now that we’ve confirmed that Johannesburg is a substantial outlier in our dataset, we can remove the outlier so we don’t skew the rest of our data. In this case, we can use the Drop skill to remove Johannesburg from the dataset. In some cases, you might also consider replacing this outlier with an average value.
Step Four: Find Null Values
Nulls values are another important type of outlier to watch. In DataChat, you can view a column’s null by simply sorting the column in descending order, placing the null values at the top of the dataset. In this example, we’re focusing on two specific target columns, “SunshineHoursCity” and “LifeExpectancyyearsCountry”. After sorting the columns in descending order, we can see that “SunshineHoursCity” has a single null value:
Step Five: Replace Null Values
Since machine learning algorithms don’t typically support data with null values, handling the null values in this column is critical to prevent drawing inaccurate inferences about our data. There are no null values in the LifeExpectancyyearsCountry column, but there is one null value in the SunshineHoursCity column where “City” is “Geneva”.
Based on the chart we generated, this data seems to follow a uniform distribution – where “LifeExpectancyyearsCountry” remains between the range of 70-85. We recommend using the average “SunshineHoursCity” to replace the null with a value that is predictably similar. We can continue to use the data without losing important information from the other columns. We can replace the null with the average of “SunshineHoursCity” using the Clean skill. The result looks like this:
The null value is replaced with the column’s average (2,224.95) and a new dataset is created. Using DataChat, we were able to detect outliers and replace null values in our data, preparing us for a far more accurate analysis and more reliable results.
DataChat is a cohesive analytics platform that uses natural language to make a broad range of data science tools, including data wrangling, preparation, exploration, visualization, and predictive modeling, accessible to everyone to improve business outcomes. Contact us or schedule a demo to learn more about how DataChat can help you improve your business outcomes.