Who likes spam? No one! This project implements a spam detection system using generative learning (Gen AI) by means of the Naïve Bayes algorithm. It preprocesses raw email data, extracts features, and evaluates the model's performance in Java. We also have a simple GUI here just to test if the model works as expected.
- Source: SpamAssassin Public Corpus
- Format: CSV with HTML formatted email content and labels (
0 = ham
,1 = spam
)
-
Preprocessing:
- Removes headers and metadata from emails.
- Strips HTML tags, special characters, and stopwords.
-
Feature Extraction:
- Converts email content into a bag-of-words representation.
- Optionally uses TF-IDF for weighting.
-
Model:
- Naive Bayes classifier with Laplace smoothing.
- Accuracy: 98.79%
- Precision: 100%
- Recall: 96.31%
- F1-Score: 98.12%
- Clone the repository:
git clone https://github.com/username/spam-detector.git
cd spam-detector
- Compile the code:
javac src/main/*.java
- Run the model training program or the GUI to test visually the model (we need 4GB heap from the huge dataset):
java -Xmx4G -cp src/main ModelTrainer
java -Xmx4G -cp src/main SpamDetectorGUI
Special thanks to the SpamAssassin dataset for providing valuable email data for this project.
This project has a MIT license.
Let me know if you need help with anything else! 🎯