Python HOW: Scikit-learn 0.20 Optimal Pipeline and Best Practices

Image for post
Image for post
Photo by Pixabay from Pexels

At the end of this article you’ll be a master Sklearn plumber. You’ll know how to pipe in numerical and categorical attributes without having to use Pandas get_dummies or Sklearn FeatureUnion

1. Install/Update

conda install scikit-learn==0.21.2
conda update scikit-learn==0.21.2

2. Toy dataset

3. Split Data

4. Build Pipelines for Attributes

5. Compose Pipelines into One using Column Transformer

6. Build a Full Pipeline for the Model and the Composed Attributes

Considering the dataset is small, we can set presort=True to speed up training (don’t use for larger training sets as it will slow down training considerably)

7. Create the Ultimate Grid Search!

I used roc_auc (Area Under the Receiver Operating Characteristic Curve) as a metric for scoring

8. Fit Model

9. Access Results

Image for post
Image for post
The decision tree with proper encoded features names (default X0, …, Xn)

TL;DR: full code

Written by

I’m an End-to-End data scientist and a Python educator. Most of my articles start after saying “I wish someone has written about this!”, maybe I should?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store