Exercise 1: Eurovision Song Lyrics Analysis

Exercise 1: Eurovision Song Lyrics Analysis#

In this exercise, we will analyze the lyrics of the songs that participated in the Eurovision Song Contest from 1956 to 2023.

Let’s first have look at the data:

import pandas as pd

dataset_df = pd.read_csv('eurovision_lyrics.csv')
dataset_df.tail()
host_city lyrics artist language country eurovision_number year title lyrics_english host_country
1716 Liverpool YA EA EA YA EA EA (Ole!) Ay ven a mí niño mío.... Blanca Paloma Spanish Spain 67 2023 Eaea Ya ea Ya ea (Ole!) Oh, come to me, child of mi... United Kingdom
1717 Liverpool I don't wanna go But baby we both know This is... Loreen English Sweden 67 2023 Tattoo I don't wanna go But baby we both know This is... United Kingdom
1718 Liverpool When we were boys We played pretend Army tanks... Remo Forrer English Switzerland 67 2023 Watergun When we were boys We played pretend Army tanks... United Kingdom
1719 Liverpool Sometimes gotta let it go Sometimes gotta look... TVORCHI English/Ukrainian Ukraine 67 2023 Heart Of Steel Sometimes gotta let it go Sometimes gotta look... United Kingdom
1720 Liverpool When you said you were leavin' To work on your... Mae Muller English United Kingdom 67 2023 I Wrote A Song When you said you were leavin' To work on your... United Kingdom

a) Do a UMAP visualization of the song lyrics, in 2D:

  • As the text column, use lyrics_english.

  • For the color of the points in the plot, use the country column.

  • For the hover text in the plot, use the artist column

You may have to remove missing data.

category_labels = dataset_df['artist'].tolist()
hover_df = pd.DataFrame(category_labels, columns=['title'])
dataset_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1721 entries, 0 to 1720
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   host_city          1721 non-null   object
 1   lyrics             1721 non-null   object
 2   artist             1721 non-null   object
 3   language           1721 non-null   object
 4   country            1721 non-null   object
 5   eurovision_number  1721 non-null   int64 
 6   year               1721 non-null   int64 
 7   title              1721 non-null   object
 8   lyrics_english     1719 non-null   object
 9   host_country       1721 non-null   object
dtypes: int64(2), object(8)
memory usage: 134.6+ KB
dataset_df.dropna(inplace=True)
dataset_df.reset_index(drop=True, inplace=True)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(min_df=1, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset_df['lyrics_english'])
import umap
import umap.plot

embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
import matplotlib.pyplot as plt
%matplotlib inline
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
Loading BokehJS ...
f = umap.plot.interactive(embedding, labels=dataset_df['country'], hover_data=hover_df, point_size=10)
show(f)

b) Is there any relationship between the contents of the lyrics of a song, and its country?

c) Describe how, given a song, how could you find the most similar song.

dataset_df['artist'] == 'Loreen'
query = song_features[1716]

# l2 normalize
import numpy as np
query /= np.linalg.norm(query)
song_features /= np.linalg.norm(song_features, axis=1)[:, None]

(query @ song_features.T).argsort()[::-1]
dataset_df.iloc[(query @ song_features.T).argsort()[:10]]