Member-only story

Python HOW: Scikit-learn Optimal Pipeline and Best Practices

Gabriel Harris Ph.D.
4 min readJun 10, 2019

--

Photo by Pixabay from Pexels

At the end of this article you’ll be a master Sklearn plumber. You’ll know how to pipe in numerical and categorical attributes without having to use Pandas get_dummies or Sklearn FeatureUnion

TL;DR: full code

1. Install/Update

At the time of writing this post, 0.21.2 was the latest release of sklearn. Check the docs for dependencies and either install or update

conda install scikit-learn==0.21.2
conda update scikit-learn==0.21.2

2. Toy dataset

We’ll use a sample dataset of audience churn with 1000 instances, and 19 attributes, 10 numerical and 9 categorical. You can download AudienceChurn.dataSample.csv from here (click clone or download > Download ZIP > extract), and you can read its description here

Let’s read the csv file into a DataFrame and print its information:

--

--

Gabriel Harris Ph.D.
Gabriel Harris Ph.D.

Written by Gabriel Harris Ph.D.

I’m an End-to-End Lead Data Scientist and Data Science Manager. My articles are love letters to my future self

Responses (1)

Write a response