Exercise 1: Eurovision Song Lyrics Analysis#
In this exercise, we will analyze the lyrics of the songs that participated in the Eurovision Song Contest from 1956 to 2023.
Let’s first have look at the data:
import pandas as pd
dataset_df = pd.read_csv('eurovision_lyrics.csv')
dataset_df.tail()
| host_city | lyrics | artist | language | country | eurovision_number | year | title | lyrics_english | host_country | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1716 | Liverpool | YA EA EA YA EA EA (Ole!) Ay ven a mí niño mío.... | Blanca Paloma | Spanish | Spain | 67 | 2023 | Eaea | Ya ea Ya ea (Ole!) Oh, come to me, child of mi... | United Kingdom |
| 1717 | Liverpool | I don't wanna go But baby we both know This is... | Loreen | English | Sweden | 67 | 2023 | Tattoo | I don't wanna go But baby we both know This is... | United Kingdom |
| 1718 | Liverpool | When we were boys We played pretend Army tanks... | Remo Forrer | English | Switzerland | 67 | 2023 | Watergun | When we were boys We played pretend Army tanks... | United Kingdom |
| 1719 | Liverpool | Sometimes gotta let it go Sometimes gotta look... | TVORCHI | English/Ukrainian | Ukraine | 67 | 2023 | Heart Of Steel | Sometimes gotta let it go Sometimes gotta look... | United Kingdom |
| 1720 | Liverpool | When you said you were leavin' To work on your... | Mae Muller | English | United Kingdom | 67 | 2023 | I Wrote A Song | When you said you were leavin' To work on your... | United Kingdom |
a) Do a UMAP visualization of the song lyrics, in 2D:
As the text column, use
lyrics_english.For the color of the points in the plot, use the
countrycolumn.For the hover text in the plot, use the
artistcolumn
You may have to remove missing data.
category_labels = dataset_df['artist'].tolist()
hover_df = pd.DataFrame(category_labels, columns=['title'])
dataset_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1721 entries, 0 to 1720
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 host_city 1721 non-null object
1 lyrics 1721 non-null object
2 artist 1721 non-null object
3 language 1721 non-null object
4 country 1721 non-null object
5 eurovision_number 1721 non-null int64
6 year 1721 non-null int64
7 title 1721 non-null object
8 lyrics_english 1719 non-null object
9 host_country 1721 non-null object
dtypes: int64(2), object(8)
memory usage: 134.6+ KB
dataset_df.dropna(inplace=True)
dataset_df.reset_index(drop=True, inplace=True)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(min_df=1, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset_df['lyrics_english'])
import umap
import umap.plot
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
import matplotlib.pyplot as plt
%matplotlib inline
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
f = umap.plot.interactive(embedding, labels=dataset_df['country'], hover_data=hover_df, point_size=10)
show(f)
b) Is there any relationship between the contents of the lyrics of a song, and its country?
c) Describe how, given a song, how could you find the most similar song.
dataset_df['artist'] == 'Loreen'
query = song_features[1716]
# l2 normalize
import numpy as np
query /= np.linalg.norm(query)
song_features /= np.linalg.norm(song_features, axis=1)[:, None]
(query @ song_features.T).argsort()[::-1]
dataset_df.iloc[(query @ song_features.T).argsort()[:10]]