Exercise 1: Eurovision Song Lyrics Analysis#
In this exercise, we will analyze the lyrics of the songs that participated in the Eurovision Song Contest from 1956 to 2023.
Let’s first have look at the data:
import pandas as pd
dataset_df = pd.read_csv('eurovision_lyrics.csv')
dataset_df.tail()
host_city | lyrics | artist | language | country | eurovision_number | year | title | lyrics_english | host_country | |
---|---|---|---|---|---|---|---|---|---|---|
1716 | Liverpool | YA EA EA YA EA EA (Ole!) Ay ven a mí niño mío.... | Blanca Paloma | Spanish | Spain | 67 | 2023 | Eaea | Ya ea Ya ea (Ole!) Oh, come to me, child of mi... | United Kingdom |
1717 | Liverpool | I don't wanna go But baby we both know This is... | Loreen | English | Sweden | 67 | 2023 | Tattoo | I don't wanna go But baby we both know This is... | United Kingdom |
1718 | Liverpool | When we were boys We played pretend Army tanks... | Remo Forrer | English | Switzerland | 67 | 2023 | Watergun | When we were boys We played pretend Army tanks... | United Kingdom |
1719 | Liverpool | Sometimes gotta let it go Sometimes gotta look... | TVORCHI | English/Ukrainian | Ukraine | 67 | 2023 | Heart Of Steel | Sometimes gotta let it go Sometimes gotta look... | United Kingdom |
1720 | Liverpool | When you said you were leavin' To work on your... | Mae Muller | English | United Kingdom | 67 | 2023 | I Wrote A Song | When you said you were leavin' To work on your... | United Kingdom |
a) Do a UMAP visualization of the song lyrics, in 2D:
As the text column, use
lyrics_english
.For the color of the points in the plot, use the
country
column.For the hover text in the plot, use the
artist
column
You may have to remove missing data.
category_labels = dataset_df['artist'].tolist()
hover_df = pd.DataFrame(category_labels, columns=['title'])
dataset_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1721 entries, 0 to 1720
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 host_city 1721 non-null object
1 lyrics 1721 non-null object
2 artist 1721 non-null object
3 language 1721 non-null object
4 country 1721 non-null object
5 eurovision_number 1721 non-null int64
6 year 1721 non-null int64
7 title 1721 non-null object
8 lyrics_english 1719 non-null object
9 host_country 1721 non-null object
dtypes: int64(2), object(8)
memory usage: 134.6+ KB
dataset_df.dropna(inplace=True)
dataset_df.reset_index(drop=True, inplace=True)
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(min_df=1, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset_df['lyrics_english'])
import umap
import umap.plot
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
import matplotlib.pyplot as plt
%matplotlib inline
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
f = umap.plot.interactive(embedding, labels=dataset_df['country'], hover_data=hover_df, point_size=10)
show(f)
b) Is there any relationship between the contents of the lyrics of a song, and its country?
c) Describe how, given a song, how could you find the most similar song.
dataset_df['artist'] == 'Loreen'
query = song_features[1716]
# l2 normalize
import numpy as np
query /= np.linalg.norm(query)
song_features /= np.linalg.norm(song_features, axis=1)[:, None]
(query @ song_features.T).argsort()[::-1]
dataset_df.iloc[(query @ song_features.T).argsort()[:10]]