Car quality classification

Comparative analysis of ML classifiers (Decision Tree, Random Forest, SVM, k-NN & Naive Bayes) on the UCI Car Evaluation dataset, focusing on class imbalance and performance metrics.

🔗View on GitHub

🚀 Launch Notebook

📄 README

The dataset contains 1,728 car instances evaluated across six categorical features: purchase price, maintenance cost, number of doors, seating capacity, luggage size, and safety rating. The target variable is car "acceptability," which falls into one of four classes: unacc, acc, good, or vgood.

A preliminary inspection of the data shows a strong class imbalance. The majority class, unacc, accounts for approximately 70% of all observations, while the other three classes — especially good and vgood — are significantly underrepresented (each below 5%). This imbalance has implications for model evaluation: using accuracy alone would favor predictions biased toward the dominant class. To address this, I used additional metrics such as precision, recall, and F1-score, which better reflect a model's performance on minority classes.

The Decision Tree classifier achieved a surprisingly high accuracy of 99%, with strong precision and recall across all classes, including the minority ones. While this performance is impressive, it's likely due in part to the clean and structured nature of the dataset, which was derived from a rule-based model.

While both models performed exceptionally well on this dataset, the Decision Tree's high performance may stem from overfitting to the structured nature of the data. Random Forest, by averaging across multiple trees, offers more robust generalization and is less sensitive to small data variations. However, in this specific case, the simplicity of the problem means both models reach near-perfect scores. If applied to more complex or noisy datasets, Random Forest would likely outperform the single-tree approach.

While both models performed exceptionally well on this dataset, the Decision Tree's high performance may stem from overfitting to the structured nature of the data. Random Forest, by averaging across multiple trees, offers more robust generalization and is less sensitive to small data variations. However, in this specific case, the simplicity of the problem means both models reach near-perfect scores. If applied to more complex or noisy datasets, Random Forest would likely outperform the single-tree approach.

The k-NN classifier achieved 91% overall accuracy, but struggled significantly with underrepresented classes like vgood (recall 0.38). This is due in part to the imbalanced dataset, as well as the limitations of using ordinal-encoded categorical features with a distance-based model. The performance of k-NN could potentially improve with proper feature engineering or alternative encoding methods, but in its current form, it underperforms compared to tree-based methods for this dataset.

Naive Bayes was the worst-performing model in my comparison, achieving only 64% accuracy and severely misclassifying minority classes. This can be attributed to several mismatches between the algorithm's assumptions and the data. Specifically, Naive Bayes assumes numerical, normally-distributed, and independent features — while the Car Evaluation dataset is fully categorical, interdependent, and better suited to rule-based or tree-based methods.

Among the models tested, the Decision Tree classifier produced the highest overall performance, with 99% accuracy and strong F1-scores across all classes. However, this model also showed signs of potential overfitting due to the structured nature of the dataset. The Random Forest model offered slightly lower accuracy (97%), but was more stable and balanced across minority classes like vgood and good, making it a better choice for generalization.

In contrast, SVM underperformed (93% accuracy), struggling particularly with good, due to its reliance on distance metrics that are not well-suited to ordinal-encoded categorical data. k-NN faced similar issues, showing a strong bias toward the majority class and weaker recall on underrepresented labels.

Naive Bayes performed the worst (64% accuracy), largely due to its inappropriate assumptions of feature independence and normally distributed continuous inputs — which do not match this dataset’s categorical structure.

For baseline comparisons, all models were initially trained using default parameters from scikit-learn. This allowed me to assess their natural performance on the dataset without overfitting to specific hyperparameters. In a real-world deployment, models like Decision Tree and SVM would benefit from grid search or cross-validation tuning (e.g., max depth, kernel type), while k-NN would require careful selection of k to balance sensitivity and generalization.