PyCaret: Your One-Stop-Shop for Your Machine Learning Needs (Student Overview)

Christian Corrales
8 min readFeb 25, 2021

In the world of data science there are a multitude of algorithms to choose from based on your business needs or types of experiments your heart desires to conduct. How about if you’re a data science bootcamp student and not too sure where to start or decision overload creeps in. This is where PyCaret may of service to you. PyCaret is an open- sourced machine learning library in Python to test, train, and deploy supervised and unsupervised machine learning models in a low-code environment. The library can be used to replace hundreds of lines of code with only a few words. PyCaret can be helpful to budding data science students, aspiring data professionals, and even seasoned data scientists to make machine learning experiments faster to set up and more efficient for analysis. Essentially, it’s your one stop shop for data practitioners to spot check standard machine learning algorithms with a single function call.

What is PyCaret

The word caret originally came from a library in R, called Caret (Classification And REgression Training) created by Max Kuhnwho to help streamline the process of model development. In 2019 Moez Ali and the popular shift from R to Python, recognized the need for a similar tool to do the same within a Python environment. And, that’s how PyCaret was born.

PyCaret is a Python wrapper that has machine learning libraries and frameworks such as scikit-learn, XGBoost, Random Forest, and much more. Again, the library performs end-to-end machine learning experiments, whether that’s imputing missing values, encoding categorical data, feature engineering, hyperparameter tuning, or building ensemble models. Also, all operations performed are stored in a pipeline that is ready for deployment.

The goal of the package is to automate major steps for evaluating and comparing machine learning algorithms in a few lines such as:

  • Defining the data transforms to perform (setup())
  • Evaluating and comparing standard models (compare_models())
  • Tuning model hyperparameters (tune_model())

Installation

Using the command line interface or notebook environment, run the below cell of code to install PyCaret.

pip install pycaret

If you are using Azure notebooks or Google Colab, run the below cell of code to install PyCaret.

!pip install pycaret

When you install PyCaret, all dependencies are installed automatically. Click here to see the list of complete dependencies.

Environment Setup

  1. Importing a Module
# importing library
import pycaret
from pycaret.classification import *

Based on the problem you are trying to solve, you will need to import the modules that best fits your problem statement. PyCaret has six different modules that are available: regression, classification, clustering, natural language processing (NLP), anomaly detection, and associate mining rule. In the example above we will use the classification module.

2. Preprocessing

# loading data for preprocessing
classification_setup = classification.setup(data = Your_Data, target = ‘Your_Target’)

PyCaret performs some basic preprocessing tasks, like ignoring the IDs and Date Columns, imputing the missing values, encoding the categorical variables, and splitting the dataset into the train-test split for the rest of the modeling steps. When you run the setup function, it will first confirm the data types, and then if you press enter, it will create the environment for you to go ahead.

Note, by default it will evaluate models using 10-fold cross-validation, sort results by classification accuracy, and return the single best model.

Comparing Models

# comparing models 
best_model = compare_models(sort = ‘Recall’)

This trains a baseline version of each available model type and yields a detailed comparison of metrics for the trained models, and highlights the best results across models.

The evaluation metrics used are:

  • For Classification: Accuracy, AUC, Recall, Precision, F1, Kappa
  • For Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE

Next you can also look take a look under the hood to see the hyperparamaters of the best model.

# print to view hyperparamaters
print(best_model)

Creating the Model

# create a model post model comparison
dt = create_model(‘dt’)

Check the hyperparamaters of the preferred model

# print hyperparamaters to compare to vanilla model
print(dt)

You can also check objects by using period(.) after the variable.

Model Tuning

We can tune the hyperparameters of a machine learning model by just using the tune_model

# creating a tuned model
tuned_dt = tune_model(dt)

Again, PyCaret will evaluate the model using a 10-fold cross validation and returns a table of k-fold cross validated scores of the trained model. Then we can print the hyperparamaters to see what the optimal params the model has confirmed as the best params to use.

# print to view tuned hyperparamaters
print(tuned_dt)

Combining Models (Optional)

We can combine our trained models in various ways. We can create ensemble models with methods such as bagging (bootstrap aggregating) and boosting. Both bagging and boosting are invoked with the ensemble_model function. We can further apply blending and stacking methods to combine diverse models, or estimators — a list of estimators can be passed to blend_models or stack_models. If desired, one could create ensemble models and combine them via blending or stacking, all in a single line of code. Below are a few examples:

# Creating a bagged decision tree ensemble model
bagged_dt = ensemble_model(dt)
# Creating a boosted decision tree ensemble model
boosted_dt = ensemble_model(dt, method = ‘Boosting’)
# Blending estimators
blender = blend_models(estimator_list = [boosted_dt, bagged_dt, tuned_rf], method = 'soft')
# Stacking bagged, boosted, and tuned estimators
stacker = stack_models(estimator_list = [boosted_dt,bagged_dt,tuned_rf], meta_model=rf)

AutoML (Optional)

We can tune for specific metrics by using the automl function. Also, AutoML techniques generally reduce the human oversight of the model selection process, which may not be ideal or appropriate in many contexts. However, they can be a useful tool to quickly identify the highest performing option for a particular problem, such as recall or precision.

# Select the best model based on the chosen metric
best = automl(optimize = 'AUC')

Model Analysis and Plotting

After tuning and deciding on a model we can also visualize performance and evaluate the model by using the plot_model function, here are few examples:

# AUC plot
plot_model(adaboost, plot = 'auc')
# Decision Boundary
plot_model(adaboost, plot = 'boundary')
# Precision Recall Curve
plot_model(adaboost, plot = 'pr')
# Validation Curve
plot_model(adaboost, plot = 'vc')

Or, if you want a to see a variety of plots from confusion matrix, AUC, feature importance, and much more simply by using the evaluate_model function, see below for an example

Result Interpretation

PyCaret implements SHAP using interpret_model function. It helps in debugging the model by analyzing what the model thinks is important.

# model interpretation with XGBoost
interpret_model(classification_xgb)
# correlation plot
interpret_model(xgboost, plot = ‘correlation’)
# interpretation of a particular data point
interpret_model(xgboost, plot = ‘reason’, observation = 0)

Predict using the Model

Now we can predict on unseen data or test/hold-out dataset by using the predict_model function. Make sure it is in the same format as we provided while setting up the environment earlier. PyCaret builds a pipeline of all the steps and will pass the unseen data into the pipeline and give us the results.

# create a model
rf = create_model('rf')
# predict test / hold-out dataset
rf_holdout_pred = predict_model(rf)

rf_holdout_pred = predict_model(rf)

Saving Model and Deployment

Once all the training is completed the entire pipeline with all preprocessing transformations and tuning can be saved and load the model. See below for example

# save the model
classification.save_model(classification_dt, ‘decision_tree_1’)
# load model
dt_model = classification.load_model(model_name='decision_tree_1')

Final Thoughts

Although, PyCaret is a one stop shop for your machine learning pipeline, we should not be fully reliant on one package alone. As students and practitioners of data, it is inherent we must still apply domain knowledge and technical understanding to fully leverage tools such as PyCaret. Moreover, having an efficient and low code tool could be an invaluable resource in saving both cost and time. As a student I used both PyCaret and hard coding algorithms in a recent classification project. Having the ability to leverage PyCaret’s evaluation and interpretation functions between my bespoke models versus PyCaret’s built-in models helped me greatly — in better understanding which models worked best and how to better optimize my own machine learning project pipeline.

I’d like to leave you with this one last thought — you can have a hammer, screwdriver, and screws in your tool bag — but if you don’t know how to use them and when to use them, it’ll be like using a screwdriver to hammer a nail to the wall.

--

--