Creating Data Transformation Pipelines

Swapnil Sharma
2 min readApr 24, 2021

--

Often during analysis we come across situations where it is required to apply multiple transformations to the data before it is ready to be consumed by any algorithm. For example we often need to scale or need to impute missing values and even apply one-hot encoding . If we try to do this one by one it will be tedious and we may forget to apply any one of the required transformations.

One of the efficient way is to create a transformation pipeline — below we will see how to create a transformation pipeline using scikit learn library

Let me illustrate a process to create a pipeline for imputing missing values and applying one hot encoding

First we need to import the imputer , encoder and column transformers from the sklearn library

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

Let say we have a data frame ‘car_sales’ that has following attributes: Price, Odometer_Reading, Make, Color, No_of_Doors

where Price and Odometer_Reading are numrical and Make, Color, No_of_Doors are categorical

so we will need to create lists for numerical attributes and categorical attributes

num_features = ['Price', 'Odometer_Reading']
cat_features = ['Make','Color','No_of_Doors']

Now lets create the instance of the transformer pipeline

transformer = ColumnTransformer([("num",SimpleImputer(strategy = 'median'),num_features),
("one_hot", OneHotEncoder(),cat_features)])

above pipeline does 2 things one it imputes the missing values with the median values for numerical attributes and then it performs one hot-encoding for the categorical

Now we just have to fit the transformation and apply the transformation

transformed_car_sales = transformer.fit_transform(car_sales)

thats it! now your ‘transformed_car_sales’ gets all imputations and one-hot encoding , and is ready for the algorithm

In fact in a pipeline you can define as many transformation as needed like we could have also included a scaling transformation , however care should be taken that these are in correct order.

--

--