Data Cleaning in Data Mining

Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most data can be incorrect due to a large number of reasons like due to hardware error/failure, network error or human error. So it is compulsory to clean the data before mining.

What is importance and benefits of data cleaning

1. Data Cleaning removes major errors.
2. Data Cleaning ensures happier customers, more sales, and more accurate decision.
3. Data Cleaning removes inconsistencies that are most likely to occur when multiple sources of data are stored into one data-set.
4. Data Cleaning makes the data-set more efficient, more reliable and more accurate

Sources of Missing Values

There are many sources of missing data. Let’s see some major sources of missing data.
User forgot to fill the data in a field.
It can be a programming error.
Data can be lost when we transferring the data manually from a legacy database.

Dirty data	Examples
Incomplete data	salary=” “
Inconsistent data	Age =”5 years”, Birthday =”06/06/1990″, Current Year =”2024″
Noisy data	Salary = “-5000”, Name = “123”
Intentional error	Sometimes applications a lot auto value to attribute. e.g some application put gender value as male by default. gender=”male”

How to Handle incomplete/Missing Data?

Ignore the tuple
Fill in the missing value manually
Fill the values automatically by
- Getting the attribute mean
- Getting the constant value if any constant value is there.
- Getting the most probable value by Bayesian formula or decision tree

How to Handle Noisy Data?

Binning
Regression
Clustering
Combined computer and human inspection.

What is Binning?

Binning is a technique in which first of all we sort the data and then partition the data into equal frequency bins.

Bin 1	2, 3, 6, 8
Bin 2	14,16,18,24
Bin 3	26,28,30,32

Types of binning:

There are many types of binning. Some of them are as follows;

Smooth by getting the bin means

Bin 1	4.75, 4.75, 4.75, 4.75
Bin 2	18,18,18,18
Bin 3	29,29,29,29

Smooth by getting the bin median

Smooth by getting the bin boundaries, etc.

Data cleaning steps

There are six major steps for data cleaning.
1. Monitoring the Errors
It is very important to monitor the source of errors and to monitor that which is the source that is the reason for most of the errors.
2. Standardization of the mining Processes
We standardize the point of entry and check the importance. When we standardize the data process, then it leads to a