Name		Name	Last commit message	Last commit date
parent directory ..
22CS60R70 Assignment 8.pdf		22CS60R70 Assignment 8.pdf
22CS60R70_Assignment_8.ipynb		22CS60R70_Assignment_8.ipynb
COLAB_LINK.txt		COLAB_LINK.txt
README.md		README.md
Task_1.ipynb		Task_1.ipynb
Task_2.ipynb		Task_2.ipynb
Task_3.ipynb		Task_3.ipynb
final_stopwords.txt		final_stopwords.txt
unique.txt		unique.txt

README.md

Abusive Text Classification

Problem Statement

This assignment is divided into 3 parts on the basis pf the machine learning model used for classification of abusive text. The first part is the implementation of the model using the KNN algorithm. The second part is the implementation of the model using the LSTM Neural Network. The third part is the implementation of the model using mBert and MURiL.

Dataset

The dataset contains text messages in Hindi language with labels. The labels are 0 and 1. 0 represents non-abusive text and 1 represents abusive text.

Size of dataset: (20184, 2)

Data Preprocessing

Created a set of stop words in Hindi languages from several resources.
Removed the stop words from the dataset sentences
Removed punctuation marks from sentences
Converted emojis to text equivalent representation using emot library
Removed digits from text

Part 1: KNN

Used the preprocessed data to tokenize and calculate tf-idf values using TfIdfVectorizer
Split dataset into 80:20 split for training and testing
Fitted the train data and tested on test data

Part 2: LSTM

Created a class for LSTM architecture with the following layers, activation function, and dimensions

LSTM(
    (embedding): Embedding(29941, 300)
    (lstm): LSTM(300, 600, num_layers=2, batch_first=True, dropout=0.3)
    (fc): Linear(in_features=600, out_features=1, bias=True)
    (dropout): Dropout(p=0.3, inplace=False)
    (sig): Sigmoid()
)

Created vectorized dataset using word_tokenizer()
Padded all the sentences to a maximum length
Split the dataset into an 80:20 ratio

Trained the model with the below hyper-parameters

vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 600
num_layers = 2
epochs = 10
lr = 0.001

Also embedded the logic of early stopping by maintaining a counter.

Part 3: mBert and MURiL

Tokenized and encoded the dataset using hugging face's “bert-base-multilingual-cased” and “google/muril-base-cased” tokenizer.
Fine-tuned prebuilt model for the same mBert and MURiL architecture

Results

Model	Accuracy	Macro F1 Score
KNN (k=18)	64	62
LSTM	78.30	78.09
mBert	82.44	82.17
MURiL	85.29	84.98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abusive text classification

Abusive text classification

README.md

Abusive Text Classification

Problem Statement

Dataset

Data Preprocessing

Part 1: KNN

Part 2: LSTM

Part 3: mBert and MURiL

Results

Files

Abusive text classification

Directory actions

More options

Directory actions

More options

Latest commit

History

Abusive text classification

Folders and files

parent directory

README.md

Abusive Text Classification

Problem Statement

Dataset

Data Preprocessing

Part 1: KNN

Part 2: LSTM

Part 3: mBert and MURiL

Results