How to test multiple machine learning pipelines with just a few lines of Python

Introduction

Managing pipelines

from atom import ATOMClassifier
from sklearn.datasets import make_classification# Create an imbalanced dataset
X, y = make_classification(
    n_samples=5000,
    n_features=30,
    n_informative=20,
    weights=(0.95,),
)# Load the dataset into atom
atom = ATOMClassifier(X, y, test_size=0.2, verbose=2)
<< ================== ATOM ================== >>
Algorithm task: binary classification.

Dataset stats ====================== >>
Shape: (5000, 31)
Scaled: False
Outlier values: 582 (0.5%)
---------------------------------------
Train set size: 4000
Test set size: 1000
---------------------------------------
|    | dataset     | train       | test       |
|---:|:------------|:------------|:-----------|
|  0 | 4731 (17.6) | 3777 (16.9) | 954 (20.7) |
|  1 | 269 (1.0)   | 223 (1.0)   | 46 (1.0)   |
Fitting FeatureSelector...
Performing feature selection...
 --> The RFE selected 12 features from the dataset.
   >>> Dropping feature Feature 2 (rank 3).
   >>> Dropping feature Feature 3 (rank 8).
   >>> Dropping feature Feature 5 (rank 10).
   >>> Dropping feature Feature 7 (rank 17).
   >>> Dropping feature Feature 8 (rank 12).
   >>> Dropping feature Feature 11 (rank 19).
   >>> Dropping feature Feature 13 (rank 13).
   >>> Dropping feature Feature 14 (rank 11).
   >>> Dropping feature Feature 15 (rank 15).
   >>> Dropping feature Feature 17 (rank 4).
   >>> Dropping feature Feature 19 (rank 16).
   >>> Dropping feature Feature 20 (rank 2).
   >>> Dropping feature Feature 21 (rank 6).
   >>> Dropping feature Feature 23 (rank 5).
   >>> Dropping feature Feature 24 (rank 9).
   >>> Dropping feature Feature 25 (rank 18).
   >>> Dropping feature Feature 26 (rank 7).
   >>> Dropping feature Feature 27 (rank 14).
Training ===================================== >>
Models: RF
Metric: balanced_accuracy


Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.5326
Time elapsed: 0.733s
-------------------------------------------------
Total time: 0.733s


Final results ========================= >>
Duration: 0.733s
------------------------------------------
Random Forest --> balanced_accuracy: 0.5326
Branch: master
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RandomForestClassifier(n_jobs=1, random_state=1)
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
 --> Models: RF
New branch oversample successfully created!
Oversampling with SMOTE...
 --> Adding 7102 samples to class: 1.

Training ===================================== >>
Models: RF_os
Metric: balanced_accuracy


Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.7737
Time elapsed: 1.325s
-------------------------------------------------
Total time: 1.325s


Final results ========================= >>
Duration: 1.341s
------------------------------------------
Random Forest --> balanced_accuracy: 0.7737
New branch undersample successfully created!

Undersampling with NearMiss...
 --> Removing 7102 samples from class: 0.
Training ===================================== >>
Models: RF_us
Metric: balanced_accuracy


Results for Random Forest:         
Fit ---------------------------------------------
Train evaluation --> balanced_accuracy: 1.0
Test evaluation --> balanced_accuracy: 0.6888
Time elapsed: 0.189s
-------------------------------------------------
Total time: 0.189s


Final results ========================= >>
Duration: 0.189s
------------------------------------------
Random Forest --> balanced_accuracy: 0.6888
Branch: undersample
 --> Pipeline: 
   >>> FeatureSelector
     --> strategy: RFE
     --> solver: RandomForestClassifier(n_jobs=1, random_state=1)
     --> n_features: 12
     --> max_frac_repeated: 1.0
     --> max_correlation: 1.0
     --> kwargs: {}
   >>> Balancer
     --> strategy: NearMiss
     --> kwargs: {}
 --> Models: RF_us

Conclusion

Original post: https://towardsdatascience.com/how-to-test-multiple-machine-learning-pipelines-with-just-a-few-lines-of-python-1a16cb4686d

Leave a Reply

Your email address will not be published. Required fields are marked *