Text Analysis with UMAP#

We are going to use UMAP to embed text. We are going to use the 20 newsgroups dataset which is a collection of forum posts labelled by topic. We are going to embed these documents and see that similar documents (i.e. posts in the same subforum) will end up close together.

You can use this embedding for other downstream tasks, such as visualizing your corpus, or run a clustering algorithm.

import pandas as pd
import umap
import umap.plot

# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
Loading BokehJS ...
dataset = fetch_20newsgroups(subset='all',
                             shuffle=True, random_state=42)
print(f'{len(dataset.data)} documents')
print(f'{len(dataset.target_names)} categories')
18846 documents
20 categories

Here are the categories of documents. As you can see many are related to one another (e.g. ‘comp.sys.ibm.pc.hardware’ and ‘comp.sys.mac.hardware’) but they are not all correlated (e.g. ‘sci.med’ and ‘rec.sport.baseball’).

dataset.target_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Let’s check a sample of them:

for idx, document in enumerate(dataset.data[:2]):
    category = dataset.target_names[dataset.target[idx]]

    print(f'Category: {category}')
    print('---------------------------')
    # Print the first 500 characters of the post
    print(document[:500])
    print('---------------------------')
Category: rec.sport.hockey
---------------------------
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killin
---------------------------
Category: comp.sys.ibm.pc.hardware
---------------------------
From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local 
---------------------------

Now we will create a dataframe with the target labels to be used in plotting. This will allow us to see the newsgroup when we hover over the plotted points (if using interactive plotting). This will help us evaluate (by eye) how good the embedding looks.

category_labels = [dataset.target_names[x] for x in dataset.target]
hover_df = pd.DataFrame(category_labels, columns=['category'])

Representing text in Machine Learning#

We need to convert the text into a numerical representation. There are many ways to do this, but we will use a simple bag of words representation. This is a simple count of the number of times each word appears in a document, without considering the order of the words.

bow

We will use sklearns CountVectorizer function to do this for us along with a couple other preprocessing steps:

  1. Split the text into tokens (i.e. words) by splitting on whitespace.

  2. Remove english stopwords (the, and, etc), to remove noise from the data.

  3. Remove all infrequent words (which occur less than 5 times) in the entire corpus (via the min_df parameter).

Reference: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)

Question: How many words are there in our final vocabulary?

Now we are going to do dimension reduction using UMAP to reduce the matrix from 34880 dimensions to 2 dimensions (since n_components=2). We need a distance metric and will use Hellinger distance which measures the similarity between two probability distributions. Each document has a set of counts generated by a multinomial distribution where we can use Hellinger distance to measure the similarity of these distributions.

TLDR: Hellinger distance is a good distance metric for comparing texts processed with bag of words.

embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Let’s plot the embedding

f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=3)
show(f)

As you can see this does reasonably well. There is some separation and groups that you would expect to be similar (such as ‘rec.sport.baseball’ and ‘rec.sport.hockey’) are close together. The big clump in the middle corresponds to a lot of extremely similar newsgroups like ‘comp.sys.ibm.pc.hardware’ and ‘comp.sys.mac.hardware’.

Applications#

Now that we have an embedding, there are several things we can do with it:

  • Explore/visualize your corpus to identify topics/trends

  • Cluster the embedding to find groups of related documents

  • Look for nearest neighbours to find related documents

  • Look for anomalous documents

Exercise 1#

Use the DisneyLand reviews dataset (csv found in the Blackboard) to create a UMAP embedding of the text of the reviews and visualize it.

  • As the color of the plot, you can use the review rating.

  • You may need to modify the hover_df structure to include the Review_Text, so you can read the text of the review when hovering over the plot.

Can you see any patterns in the plot?

Exercise 2#

Use the Philosophy text collection (.zip file found in the Blackboard) to create a UMAP embedding of the text of the documents and visualize it.

  • As the color of the plot, you can use the author, which is the first word of the filename.

  • You may need to modify the hover_df structure to include the filename, so you can read the name of the book when hovering over the plot.

Can you identify schools of philosophy in the plot?

import glob

import pandas as pd

text_files = glob.glob('./phil_txts/*.txt')

data = []

for text_file in text_files:
    print(text_file)
    with open(text_file, 'r', encoding='iso-8859-1') as f:
        
        text = f.read()

        data.append({
            "text": text,
            "author": text_file.split('/')[-1].split('_')[0]
        })
./phil_txts/husserl_cartesian_meditations.txt
./phil_txts/wittgenstein_certainty.txt
./phil_txts/beauvoir_second_sex.txt
./phil_txts/lewis_papers_1.txt
./phil_txts/lenin_essential_works.txt
./phil_txts/foucault_birth_of_clinic.txt
./phil_txts/lewis_papers_2.txt
./phil_txts/hobbes_leviathan.txt
./phil_txts/foucault_history_of_madness.txt
./phil_txts/deleuze_difference_repetition.txt
./phil_txts/kant_pure_reason.txt
./phil_txts/kant_critique_judgement.txt
./phil_txts/derrida_writing_difference.txt
./phil_txts/anselm_proslogion.txt
./phil_txts/wittgenstein_tractatus.txt
./phil_txts/marx_manifesto.txt
./phil_txts/keynes_theory_of_employment.txt
./phil_txts/husserl_crisis_of_euro_sciences.txt
./phil_txts/wittgenstien_philosophical_investigations.txt
./phil_txts/hegel_phenomenology_of_spirit.txt
./phil_txts/kierkegaard_fear_and_trembling.txt
./phil_txts/merleau-ponty_phenomenology_of_perception.txt
./phil_txts/marx_kapital.txt
./phil_txts/heidegger_off_the_beaten_track.txt
./phil_txts/hegel_science_of_logic.txt
./phil_txts/sartre_being_and_nothingness.txt
./phil_txts/davis_women_race_class.txt
./phil_txts/anselm_de_veritate.txt
./phil_txts/heidegger_being_and_time.txt
./phil_txts/kripke_naming_necessity.txt
./phil_txts/kant_critique_practical_reason.txt
./phil_txts/descartes_meditations.txt
./phil_txts/fichte_system_of_ethics.txt
./phil_txts/hegel_elements_of_right.txt
./phil_txts/plato_complete_works.txt
./phil_txts/kripke_philosophical_troubles.txt
./phil_txts/popper_logic_of_science.txt
./phil_txts/popper_open_society.txt
./phil_txts/smith_wealth_of_nations.txt
./phil_txts/quine_quintessence.txt
./phil_txts/deleuze_guattari_anti-oedipus.txt
./phil_txts/leibniz_monadology.txt
./phil_txts/foucault_order_of_things.txt
./phil_txts/chuang_tzu_book.txt
./phil_txts/aristotle_complete_works_v1.txt
./phil_txts/malebranche_search_truth.txt
./phil_txts/aristotle_complete_works_v2.txt
./phil_txts/husserl_idea_of_phenomenology.txt
df = pd.DataFrame(data)
df['text'].replace('\n', "", inplace=True)
category_labels = df['author'].tolist()
hover_df = pd.DataFrame(category_labels, columns=['author'])
vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(df['text'])
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
/Users/victorgallego/miniforge3/lib/python3.9/site-packages/umap/umap_.py:126: UserWarning: A few of your vertices were disconnected from the manifold.  This shouldn't cause problems.
Disconnection_distance = 1 has removed 94 edges.
It has only fully disconnected 1 vertices.
Use umap.utils.disconnected_vertices() to identify them.
  warn(
f = umap.plot.interactive(embedding, labels=df['author'], hover_data=hover_df, point_size=10)
show(f)