This project focuses on detecting counterfeit banknotes using a combination of Principal Component Analysis (PCA) for dimensionality reduction and K-Nearest Neighbors (KNN) for classification. The goal is to preprocess banknote data, reduce feature complexity, and optimize KNN for improved accuracy in distinguishing genuine from fake banknotes.
Python
– Programming Language
Pandas
– Data Handling and Manipulation
Scikit-learn
– Machine Learning (PCA, KNN, Model Evaluation)
Matplotlib
& Seaborn
– Data Visualization
The given matrix represents the principal component loadings, showing how much each original feature contributes to each principal component. Key takeaways:
This suggests that features related to the height, length, and margins of the banknotes play a crucial role in distinguishing genuine from counterfeit banknotes.
Actual False (Genuine) 108 2 Actual True (Counterfeit) 0 190
Overall Accuracy: 99% (297/300 correct classifications), indicating the model performs exceptionally well.
This project demonstrates how dimensionality reduction and hyperparameter tuning can enhance classification models for fraud detection. PCA reduces feature complexity, and KNN’s optimization ensures high accuracy in banknote authentication. The PCA transformation successfully reduced dimensionality while preserving key information, improving efficiency without loss of classification power. KNN with best_k = 6 provided optimal performance, maximizing recall and minimizing misclassification. The model is highly effective for banknote authentication, making it a strong candidate for real-world fraud detection applications.
Potential improvements: Since performance is already near perfect, further enhancements could explore different classification models (e.g., SVM, ensemble methods) to see if results hold across different techniques.
features = ["diagonal", "height_left", "height_right", "margin_low", "margin_up", "length"]
X = dollar[features]
dollar = dollar.dropna(subset=['is_genuine'])
y = dollar["is_genuine"]
X = X.fillna(X.mean())
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 5))
plt.plot(np.cumsum(PCA(n_components=len(X.columns)).fit(X_scaled).explained_variance_ratio_), marker='o')
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance by Principal Components")
plt.grid()
plt.show()
print(pca.components_)
pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2", "PC3", "PC4", "PC5"])
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
k_values = range(5, 21)
accuracy_scores = []
#recall_scores = []
#precision_scores = []
for k in k_values:
knn_cv = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn_cv, X_train, y_train, cv=5) # 5-fold cross-validation
accuracy_scores.append(scores.mean())
#scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='precision').mean()
#precision_scores.append(score)
#scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='recall').mean()
#recall_scores.append(score)
best_k = k_values[accuracy_scores.index(max(accuracy_scores))]
#best_k = k_values[precision_scores.index(max(precision_scores))]
#best_k = k_values[recall_scores.index(max(recall_scores))]
print(f"Best k value: {best_k}")
best_k = k_values[accuracy_scores.index(max(accuracy_scores))]
print(f"Best k based on recall: {best_k}")
plt.plot(k_values, accuracy_scores, marker='o')
#plt.plot(k_values, precision_scores, marker='o')
#plt.plot(k_values, recall_scores, marker='o')
plt.xlabel('k value')
plt.ylabel('Cross-Validated Accuracy')
plt.title('Cross-Validation for Optimal k in KNN')
plt.grid()
plt.show()
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))