Exercise 1: Hate Spech Detection in X (aka Twitter) 🤬#
In this exercise, you will be working with a dataset of tweets. Your task is to build a model that can detect hate speech for each tweet.
a) Load the data (file hate_speech_tweets.csv
) and inspect the columns. Answer the following questions:
How many classes do we have in the dataset? How many examples do we have for each class?
Which class refeers to hate speech?
import pandas as pd
dataset = pd.read_csv('hate_speech_tweets.csv')
b) Perform a suitable train-test split, with 10% of the data for testing.
from sklearn.model_selection import train_test_split
X = dataset['tweet']
y = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, stratify=y)
c) Build an appropiate scikit-learn pipeline to process the text features (one column) and train a model.
Note: for the processing part: do not filter out any words from the text, as infrequent words are important for this task.
Note 2: you can just try one combination of components, as later, in e), you will be given the opportunity to change something to improve the metrics.
Discuss your approach when building the pipeline.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
model = Pipeline([
('tfidf', CountVectorizer()),
('clf', LogisticRegression())
])
d) Evaluate your model using the test set.
Which is the average F1-score?
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/l_/k13w4mhd5hv4bddxwqz8qdfw0000gn/T/ipykernel_5000/3509469981.py in <module>
----> 1 model.fit(X_train, y_train)
2
3 y_pred = model.predict(X_test)
4
5 from sklearn.metrics import classification_report
NameError: name 'X_train' is not defined
e) Try to improve the model. Can you change any component/s in your previous pipeline to improve the metrics?