Banknote Authentication Using PCA and KNN

Overview

This project focuses on detecting counterfeit banknotes using a combination of Principal Component Analysis (PCA) for dimensionality reduction and K-Nearest Neighbors (KNN) for classification. The goal is to preprocess banknote data, reduce feature complexity, and optimize KNN for improved accuracy in distinguishing genuine from fake banknotes.

Methodology

Data Preprocessing

Dimensionality Reduction with PCA

KNN Model Optimization

Technologies Used

Python – Programming Language

Pandas – Data Handling and Manipulation

Scikit-learn – Machine Learning (PCA, KNN, Model Evaluation)

Matplotlib & Seaborn – Data Visualization

Key Insights

Results & Evaluation

PCA Component Loadings

The given matrix represents the principal component loadings, showing how much each original feature contributes to each principal component. Key takeaways:

This suggests that features related to the height, length, and margins of the banknotes play a crucial role in distinguishing genuine from counterfeit banknotes.

Best K-Value for KNN

Confusion Matrix Analysis

Actual False (Genuine) 108 2 Actual True (Counterfeit) 0 190

Classification Report Analysis

Overall Accuracy: 99% (297/300 correct classifications), indicating the model performs exceptionally well.

Visualization

PCA

Best_K

Summary

This project demonstrates how dimensionality reduction and hyperparameter tuning can enhance classification models for fraud detection. PCA reduces feature complexity, and KNN’s optimization ensures high accuracy in banknote authentication. The PCA transformation successfully reduced dimensionality while preserving key information, improving efficiency without loss of classification power. KNN with best_k = 6 provided optimal performance, maximizing recall and minimizing misclassification. The model is highly effective for banknote authentication, making it a strong candidate for real-world fraud detection applications.

Potential improvements: Since performance is already near perfect, further enhancements could explore different classification models (e.g., SVM, ensemble methods) to see if results hold across different techniques.

Code

features = ["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]
X = dollar[features]
dollar = dollar.dropna(subset=['is_genuine'])
y = dollar["is_genuine"]
X = X.fillna(X.mean())

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(PCA(n_components=len(X.columns)).fit(X_scaled).explained_variance_ratio_), marker='o')
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance by Principal Components")
plt.grid()
plt.show()

print(pca.components_)

pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2", "PC3", "PC4", "PC5"])

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

k_values = range(5, 21)
accuracy_scores = []
#recall_scores = []
#precision_scores = []

for k in k_values:
    knn_cv = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn_cv, X_train, y_train, cv=5)  # 5-fold cross-validation
    accuracy_scores.append(scores.mean())

    #scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='precision').mean()
    #precision_scores.append(score)

    #scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='recall').mean()
    #recall_scores.append(score)

best_k = k_values[accuracy_scores.index(max(accuracy_scores))]
#best_k = k_values[precision_scores.index(max(precision_scores))]
#best_k = k_values[recall_scores.index(max(recall_scores))]

print(f"Best k value: {best_k}")

best_k = k_values[accuracy_scores.index(max(accuracy_scores))]
print(f"Best k based on recall: {best_k}")

plt.plot(k_values, accuracy_scores, marker='o')
#plt.plot(k_values, precision_scores, marker='o')
#plt.plot(k_values, recall_scores, marker='o')

plt.xlabel('k value')
plt.ylabel('Cross-Validated Accuracy')
plt.title('Cross-Validation for Optimal k in KNN')
plt.grid()
plt.show()

knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Back