Data Preprocessing

Badal Kumar
6 min readMay 15, 2020

--

  • Data Preprocessing is a process to convert raw data into meaningful data using different techniques.
  • Data preprocessing/preparation/cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset, or and refers to identifying incorrect, incomplete, irrelevant parts of the data and then modifying, replacing, or deleting the dirty or coarse data.
data preprocessing

What is Data Preprocessing?

When we talk about data, we usually think of some large datasets with huge number of rows and columns. While that is a likely scenario, it is not always the case — data could be in so many different forms: Structured Tables, Images, Audio files, Videos etc..

Machines don’t understand free text, image or video data as it is, they understand 1s and 0s. So it probably won’t be good enough if we put on a slideshow of all our images and expect our machine learning model to get trained just by that!

Why Data Preprocessing is important ?

Data in the real world is dirty

  • Incomplete
  • Noisy
  • Inconsistent
  • Duplicate

Major steps in Data Preprocessing

  • Data Cleaning
  • Data Integration
  • Data Reduction
  • Data Transformation
  • Data Discretization

Data Cleaning

Data Cleaning means fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.

Data Integration

Data Integration is a technique to merges data from multiple sources into a coherent data store, such as a data warehouse.

Data Reduction

Data Reduction is a technique use to reduce the data size by aggregating, eliminating redudant features,or clustering, for instance.

Data Transformation

Data Transformation means data are transformed or consolidated into forms appropriate for ML model training, such as normalization, my be applied where data are scald to fall within a smaller range like 0.0 t0 1.0.

Data Discretization

  • Data Discretization technique transforms numeric data by mapping values to interval or concept labels.
  • it can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals.

Features in Machine Learning

  • Feature is an attribute or property shared by all of the independent units on which analysis or prediction is to be done.
  • Feature Engineering is the process to create/extract the feature from existing features by domain knowledge to increase the performance of Machine Learning Model.

There are different types of features that we can come across when we deal with data.

Statistical Data Types

Let’s take some examples of Data and then we will look at each of these categories.

So, here is an example from e-Commerce where we have an e-Catalog where say we have information about shirts(so these shirts are the objects we are interested in)

Each of these shirts can be described by various attributes, they could be described by color, pattern, size, rating, price, and discount. And we can see that there is variety in these small set of attributes over here, some of them are numbers, some of them are labels, some of them are very special kind of labels like size which has some increasing or decreasing order. So, all of these needs to be categorized into different buckets which will do below starting with Categorical Features.

Categorical

Categorical features are those which describe the object under consideration using a finite set of discrete classes. So, color can be divided into one of ’n’ categories, similarly, the pattern can be a bunch of categories but a fixed set of categories, we cannot have infinite patterns here. Similarly, sizes have a fixed set of categories and ratings belong to a fixed set of categories.

Within these, we again need to distinguish between nominal and ordinal. Let’s look at Color and Pattern, these are just classes and the interesting thing here is that “There is no natural ordering in these attributes”. We can not say that Red is greater than Blue which in turn is greater than green or Plain is greater than Checkered or is greater than Striped in terms of numerical order that we have say 5 is greater than 4. So, this type of natural ordering is not possible for these attributes like Color, Pattern and so on and therefore these type of attributes as categorized as Nominal Features.

  • Nominal : Nominal features are those categorical features in which there is no natural ordering in the values that an attribute can take.

Now, compare this with the other kind of Categorical Features that we have which are Size and Rating. They are again a fixed set of labels, these are again categories but there is a natural ordering in these categories. So, we know that small is less than the medium which in turn is less than large. Similarly, for ratings, we know that the poor is less than okay which is less than good and so on. So, the Qualitative attributes for which there is natural ordering are classified as Ordinal attributes.

  • Ordinal : Ordinal features are those categorical features in which there is a natural ordering in the values that a features can take.

Numerical

Numerical features are those which have numerical values and which are used to count or measure certain properties of a population. for example, if we take Discount, it could lie anywhere from 0.01 to 0.05 to 0.1 to 1, 1.2 and so on. It can take actually an infinite number of values and that’s the key thing we need to understand here.

All above Features have numerical values.

Within these attributes, if we look at Number of buttons and Days for Delivery which in this data happens to be non-fractional(as all are integers). So, the data which can take on only a finite set of numerical values(these are integers), such data is known as Interval data(no fractions, we just have whole numbers and integers).

  • Interval: Interval features are those numerical features which can take on only a finite number of numerical values(Integer).

Now compare this with ratio data, so here we have fractions. Price could say $23.99, Rs. 525.50. Similarly, the discount could also be fractional numbers. So, such data that could take fractional values also is known as Ratio Numerical Data.

  • Ratio: Ratio features refers to numerical features which can take on fractional values(Real numbers).

It is not necessary that all values of Price would be fractional but as long as the features can take some values which are fractional, we would call that attribute to be a ratio features.

So, as is clear from above discussion, knowing the data types helps us to perform the correct analysis on the data.

In this article, I wanted to give a solid introduction to the concepts of data preprocessing which is a crucial step in any Machine Learning process. I hope this was useful to you.

References : PadhAI

--

--

Badal Kumar

I'm working as a System Engineer at Infosys Limited