This project aims to develop a phishing detection system utilizing Natural Language Processing (NLP) techniques. The goal is to identify potentially malicious content within emails and messages, providing an additional layer of security for users.
- Python
- Natural Language Processing (NLP) libraries (e.g., NLTK, spaCy)
- Machine Learning algorithms (e.g., SVM, Random Forest)
- Hyperopt (hyperparameter optimization)
- Mlflow (for model and artifacts version control)
The model was trained on a diverse dataset comprising of both phishing and legitimate messages. The dataset was carefully curated to ensure a representative sample.
- Text Preprocessing
- Baseline Model Training
- Creating and setting up a MLFlow experiment
- Creating Text Preprocessing Pipeline
- Hyper-parameter ptimization with hyperopt
- Registering best model
- Getting Prediction from best model
The best model achieved an accuracy of about 98% on the test dataset, demonstrating its effectiveness in identifying phishing attempts.