Member-only story

Python HOW: Scikit-learn Optimal Pipeline and Best Practices

4 min readJun 10, 2019

At the end of this article you’ll be a master Sklearn plumber. You’ll know how to pipe in numerical and categorical attributes without having to use Pandas get_dummies or Sklearn FeatureUnion

TL;DR: full code

1. Install/Update

At the time of writing this post, 0.21.2 was the latest release of sklearn. Check the docs for dependencies and either install or update

conda install scikit-learn==0.21.2
conda update scikit-learn==0.21.2

2. Toy dataset

We’ll use a sample dataset of audience churn with 1000 instances, and 19 attributes, 10 numerical and 9 categorical. You can download AudienceChurn.dataSample.csv from here (click clone or download > Download ZIP > extract), and you can read its description here

Let’s read the csv file into a DataFrame and print its information:

Python HOW: Scikit-learn Optimal Pipeline and Best Practices

1. Install/Update

2. Toy dataset

Written by Gabriel Harris Ph.D.

Responses (1)