How to handle imbalanced datasets effectively in classification problems?

Naomi Teng
Updated on March 30, 2026 in

I’m working on a classification problem where one class heavily outweighs the others (around 90:10 ratio). My model is achieving high accuracy, but it’s clearly biased toward the majority class.

Here’s a simplified version:

 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

 

Accuracy looks good, but recall and precision for the minority class are poor.

What I want to understand:

  • What are the best techniques to handle imbalance (SMOTE, class weights, etc.)?
  • When should I prefer resampling vs adjusting model parameters?
  • Which evaluation metrics should I focus on in such cases?

Would appreciate practical advice based on real-world experience.

 
 
  • 1
  • 74
  • 2 weeks ago
 
7 days ago

I’m still learning this, but from what I’ve understood, handling imbalanced datasets is less about just fixing the data and more about choosing the right approach based on the problem.

Some things that seem to work:

  • Resampling techniques like oversampling (SMOTE) or undersampling to balance classes
  • Using appropriate metrics like F1-score, precision-recall instead of just accuracy
  • Class weights in models so the minority class gets more importance
  • Trying ensemble methods that are more robust to imbalance

Also, I’ve noticed that sometimes balancing too aggressively can lead to overfitting, especially with synthetic data.

Would love to know how others decide between resampling vs just adjusting the model

  • Liked by
Reply
Cancel
Loading more replies