Exercise 1: Eurovision Song Lyrics Analysis

Exercise 1: Eurovision Song Lyrics Analysis#

In this exercise, we will analyze the lyrics of the songs that participated in the Eurovision Song Contest from 1956 to 2023.

Let’s first have look at the data:

import pandas as pd

dataset_df = pd.read_csv('eurovision_lyrics.csv')

dataset_df.tail()

	host_city	lyrics	artist	language	country	eurovision_number	year	title	lyrics_english	host_country
1716	Liverpool	YA EA EA YA EA EA (Ole!) Ay ven a mí niño mío....	Blanca Paloma	Spanish	Spain	67	2023	Eaea	Ya ea Ya ea (Ole!) Oh, come to me, child of mi...	United Kingdom
1717	Liverpool	I don't wanna go But baby we both know This is...	Loreen	English	Sweden	67	2023	Tattoo	I don't wanna go But baby we both know This is...	United Kingdom
1718	Liverpool	When we were boys We played pretend Army tanks...	Remo Forrer	English	Switzerland	67	2023	Watergun	When we were boys We played pretend Army tanks...	United Kingdom
1719	Liverpool	Sometimes gotta let it go Sometimes gotta look...	TVORCHI	English/Ukrainian	Ukraine	67	2023	Heart Of Steel	Sometimes gotta let it go Sometimes gotta look...	United Kingdom
1720	Liverpool	When you said you were leavin' To work on your...	Mae Muller	English	United Kingdom	67	2023	I Wrote A Song	When you said you were leavin' To work on your...	United Kingdom

a) Do a UMAP visualization of the song lyrics, in 2D:

As the text column, use lyrics_english.
For the color of the points in the plot, use the country column.
For the hover text in the plot, use the artist column

You may have to remove missing data.

category_labels = dataset_df['artist'].tolist()
hover_df = pd.DataFrame(category_labels, columns=['title'])

dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1721 entries, 0 to 1720
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   host_city          1721 non-null   object
 1   lyrics             1721 non-null   object
 2   artist             1721 non-null   object
 3   language           1721 non-null   object
 4   country            1721 non-null   object
 5   eurovision_number  1721 non-null   int64 
 6   year               1721 non-null   int64 
 7   title              1721 non-null   object
 8   lyrics_english     1719 non-null   object
 9   host_country       1721 non-null   object
dtypes: int64(2), object(8)
memory usage: 134.6+ KB

dataset_df.dropna(inplace=True)
dataset_df.reset_index(drop=True, inplace=True)

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = CountVectorizer(min_df=1, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset_df['lyrics_english'])

import umap
import umap.plot

embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)

import matplotlib.pyplot as plt
%matplotlib inline
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)

Loading BokehJS ...

f = umap.plot.interactive(embedding, labels=dataset_df['country'], hover_data=hover_df, point_size=10)
show(f)

b) Is there any relationship between the contents of the lyrics of a song, and its country?

c) Describe how, given a song, how could you find the most similar song.

dataset_df['artist'] == 'Loreen'

query = song_features[1716]

# l2 normalize
import numpy as np
query /= np.linalg.norm(query)

song_features /= np.linalg.norm(song_features, axis=1)[:, None]

(query @ song_features.T).argsort()[::-1]

dataset_df.iloc[(query @ song_features.T).argsort()[:10]]