This project focuses on classifying emails into categories based on their content, utilizing a Multinomial Naive Bayes classifier. The goal is to preprocess and clean email data, then train a model to predict labels for new emails. The analysis includes text preprocessing, feature extraction using CountVectorizer, and evaluation of model performance with a classification report.
Python
Programming LanguagePandas
– Data handling and manipulationNLTK
– Natural language processing (text preprocessing)Scikit-learn
– Machine learning (model building and evaluation)Google Colab
– Cloud environment for executing and running the codeThe results of the email classification project show that the Multinomial Naive Bayes model performs with 50% accuracy on the test dataset. The classification report indicates that the model has high precision for class 0 “Normal Email” (1.00) but poor recall (0.33), and vice versa for class 1 “Spam Email” (high recall of 1.00 but low precision of 0.33). The F1-scores for both classes are 0.50, indicating an imbalance between precision and recall. The weighted average F1-score is also 0.50, reflecting the model’s overall performance on the small dataset. While the model provides some useful insights, it suggests room for improvement, especially in handling class imbalance and refining text preprocessing methods.
nltk.download('stopwords') #Downloaded stop words
emails = pd.read_csv("emails.tsv", sep='\t') #Loaded the TSV file containing my emails.
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def preprocess_text(text):
emails['text'] = emails['text'].fillna("")
text = str(text).lower()
text = re.sub(r'\W', ' ', text)
words = text.split()
stop_words = set(stopwords.words('english'))
words = [stemmer.stem(word) for word in words if word not in stop_words]
#words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
emails['processed_text'] = emails['text'].apply(preprocess_text) #Applies the pre processing for the emals.
X_train, X_test, y_train, y_test = train_test_split(emails['processed_text'], emails['label'], test_size=0.3, random_state=42)
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
model = MultinomialNB() #Load Multinomial model.
model.fit(X_train_vec, y_train)
y_pred = model.predict(X_test_vec)
print("Classification Report:")
print(classification_report(y_test, y_pred))
This project demonstrates the application of machine learning for text classification, utilizing natural language processing techniques and a simple probabilistic model to classify emails. It showcases the effectiveness of Multinomial Naive Bayes for text data and provides a foundation for future improvements and more advanced approaches.