Python HOW: Scikit-learn 0.20 Optimal Pipeline and Best Practices

At the end of this article you’ll be a master Sklearn plumber. You’ll know how to pipe in numerical and categorical attributes without having to use Pandas get_dummies or Sklearn FeatureUnion
1. Install/Update
At the time of writing this post, 0.21.2 was the latest release of sklearn. Check the docs for dependencies and either install or update
conda install scikit-learn==0.21.2
conda update scikit-learn==0.21.2
2. Toy dataset
We’ll use a sample dataset of audience churn with 1000 instances, and 19 attributes, 10 numerical and 9 categorical. You can download AudienceChurn.dataSample.csv
from here (click clone or download > Download ZIP > extract), and you can read its description here
Let’s read the csv file into a DataFrame and print its information:
You can see that there’re 11 numerical attributes (9 float64 and 2 int64) with ‘churned’ being the target class, and 9 categorical attributes (9 object). Both num/cat attributes have some nulls
3. Split Data
Next, we will separate the target class from the rest of the attributes, and split the data 70/30 for training/testing. The key point here is to keep the data as pandas Series/DataFrame:
4. Build Pipelines for Attributes
Let’s build 2 separate transformation pipelines, one for numerical attributes and another for categorical attributes, as different steps are needed for each (you can build as many pipelines as you need really)
For the numerical attributes transformer, we will use a list of 2 steps:
- Step 1: impute the data using
SimpleImputer
. I call this step‘imputer’
- Step 2: scale the data using
StandardScaler
. I call this step‘scaler’
For the categorical attributes transformer, we will also use a list of 2 steps:
- Step 1: impute the data using
SimpleImputer
. I call this step‘imputer’
- Step 2: encode the data using
OneHotEncoder
. I call this step‘onehot’
Note: you can chose any name you like for your steps. If you would rather this is done for you, you can use
make_pipeline
instead, which names the steps after the functions automatically
5. Compose Pipelines into One using Column Transformer
Now we’ve both pipelines, we can compose them into one using ColumnTransformer
with 2 transformers (new in version 0.20):
- Transformer 1, which I call
‘num’
, consists of the numerical attributes pipelinenum_transformer
, and a list of the numerical attributes we want to transform (which I got from X by selecting non-object data types) - Transformer 2, which I call
‘cat’
, consists of the categorical attributes pipelinecat_transformer
, and a list of the categorical attributes we want to transform (which I got from X by selecting object data types)
Note: you can choose specific attributes to feed into any transformer by passing them as a list (e.g.
['productions', 'tickets' , …]
). In this case, theremainder
argument tells theColumnTransformer
what to do with the rest of the attributes in the DataFrame that are not fed (‘drop'
will drop them, whilepassthrough
will keep them without applying any transformations)
6. Build a Full Pipeline for the Model and the Composed Attributes
Let’s fit a simple DecisionTreeClassifier
into our data (decision trees don’t really require feature scaling or centring). To do this, we can simply add a second Pipeline
, with a list of 2 steps:
- Step 1, which I call
‘preprocessor’
, has the previous composed attributespreprocessor
- Step 2, which I call
‘classifier’
, has the model object we createdtree
presort=True
to speed up training (don’t use for larger training sets as it will slow down training considerably)7. Create the Ultimate Grid Search!
The last part of the jigsaw is to create a GridSearchCV
with the previous pipeline
, and a dictionary of parameters params
for the exhaustive grid search with cross validation (a process known as hyper-parameters tuning, which you can read more about here)
roc_auc
(Area Under the Receiver Operating Characteristic Curve) as a metric for scoringPay attention to the way the params
dictionary is written:
- A key is a string of the name of the model + 2 underscores + the specific parameter to search, for example
classifier__criterion
- A value is a list of all the parameters values you want to search
In fact, you can also search possible parameters for any of the steps in the initial Pipeline
in Sec. 4 by traversing back to that step, and separating transformer/step names using 2 underscores. For example, to search the strategy
parameter for the imputer
step in num_transformer
we can write: ‘preprocessor__num__imputer__strategy’: [‘median’, ‘mean’]
Note: you can create
RandomizedSearchCV
in a similar way toGridSearchCV
8. Fit Model
Now you can call the fit
method on classifier_gs
using the training DataFrames. This will call the entire pipeline to transform the training data then fit it with the model (and save the transformation vector to later transform any test data)
9. Access Results
Most of the results are saved as GridSearchCV
attributes. The one’s we are interested in are best_score_
, best_params_
, and best_estimator_
But how can we access a step/transformer attributes in any of the previous pipelines?
- A step in a
Pipeline
is accessed by its name using thenamed_steps
method. For example, to accessOneHotEncoder
we can use:named_steps['onehot']
- A transformer is accessed by its name using the
named_transformers_
method. For example, to accesscat_transformer
we can use:named_transformers_['cat']
For example, to get the encoded features names from the onehot
step in cat_transformer
in Sec. 4, we traverse back to that step from the best_estimator_
such as follows:
And to plot the decision_tree
in the classifier
step in Sec. 6 (you need to first download graphviz and make sure the executable is in your system PATH):

The decision tree
with proper encoded features names (default X0, …, Xn)TL;DR: full code
The full code to show the optimal Pipeline is on my GitHub here. It uses a decision tree classifier with a sample Churn data as an example.
Happy coding!