{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dimensionality Reduction. UMAP\n",
"\n",
"## Introduction\n",
"\n",
"This notebook is intended to be a gentle introduction into the topic of dimensionality reduction. This is a powerful technique used to **explore the structure of high-dimensional data (i.e. lots of features) in a lower dimensional subspace**.\n",
"\n",
"For example, if a data set has 1000 dimensions/features, there is no way for us to visualise that data in 1000 dimenions because as humans we live and interact in a 3D world. However, we will see that there are ways to plot a meaningful representation of the data in 2 or 3 dimensions.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## UMAP\n",
"\n",
"UMAP, which stands for **Uniform Manifold Approximation and Projection** was proposed by Leland McInnes, John Healy and James Melville in their 2018 paper: https://arxiv.org/abs/1802.03426\n",
"\n",
"It learns a non-linear mapping that preserves clusters but its main advantage is that it is significantly faster that alternatives like t-SNE. It also tends to do better at preserving global structure of the data compared to t-SNE and PCA.\n",
"\n",
"UMAP is designed to be compatible with scikit-learn, making use of the same API and able to be added to sklearn pipelines. If you are already familiar with sklearn you should be able to use UMAP as a drop in replacement for t-SNE and other dimension reduction classes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install seaborn\n",
"#!pip install umap-learn"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.datasets import load_digits\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import pandas as pd\n",
"%matplotlib inline\n",
"\n",
"# for better graphics:\n",
"sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Penguin dataset\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next step is to get some data to work with. To ease us into things we’ll start with the [penguin dataset](https://github.com/allisonhorst/palmerpenguins). It is small both in number of points and number of features, and will let us get an idea of what the dimension reduction is doing."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | species | \n", "island | \n", "bill_length_mm | \n", "bill_depth_mm | \n", "flipper_length_mm | \n", "body_mass_g | \n", "sex | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "Adelie | \n", "Torgersen | \n", "39.1 | \n", "18.7 | \n", "181.0 | \n", "3750.0 | \n", "male | \n", "2007 | \n", "
1 | \n", "Adelie | \n", "Torgersen | \n", "39.5 | \n", "17.4 | \n", "186.0 | \n", "3800.0 | \n", "female | \n", "2007 | \n", "
2 | \n", "Adelie | \n", "Torgersen | \n", "40.3 | \n", "18.0 | \n", "195.0 | \n", "3250.0 | \n", "female | \n", "2007 | \n", "
3 | \n", "Adelie | \n", "Torgersen | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "2007 | \n", "
4 | \n", "Adelie | \n", "Torgersen | \n", "36.7 | \n", "19.3 | \n", "193.0 | \n", "3450.0 | \n", "female | \n", "2007 | \n", "
UMAP(random_state=0, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
UMAP(random_state=0, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})
\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"