Grid search, random search, and Bayesian optimization are techniques for machine learning model hyperparameter tuning. This tutorial covers how to tune XGBoost hyperparameters using Python. You will learn

- What are the differences between grid search, random search, and Bayesian optimization?
- How to use grid search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use random search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use Bayesian optimization Hyperopt to tune the hyperparameters for the XGBoost model?
- How to compare the results from grid search, random search, and Bayesian optimization Hyperopt?

**Resources for this post:**

- Python code is at the end of the post. Click here for the notebook.
- More video tutorials on hyperparameter tuning
- More blog posts on hyperparameter tuning
- If you prefer the video version of the tutorial, watch the video below on YouTube.

Let’s get started!

### Step 0: Grid Search Vs. Random Search Vs. Bayesian Optimization

Grid search, random search, and Bayesian optimization have the same goal of choosing the best hyperparameters for a machine learning model. But they have differences in algorithm and implementation. Understanding these differences is essential for deciding which algorithm to use.

- Grid search is an exhaustive way to search hyperparameters. It evaluates every combination of hyperparameters for the model. Therefore, it can take a long time to run when there are a lot of hyperparameter combinations to compare.
- Random search pick a fixed number of hyperparameter combinations randomly, so not every single combination is evaluated. Therefore, a more comprehensive range of values and a longer list of hyperparameters can be assessed within a given time. The downside is that sometimes the random selection may not include top performance hyperparameter combinations.
- Bayesian optimization utilizes the results from the previous step to decide which hyperparameter combination to evaluate next. The major difference between Bayesian optimization and grid/random search is that grid search and random search consider each hyperparameter combination independently, while Bayesian optimization is dependent on the previous evaluation results.

### Step 1: Install And Import Libraries

In the first step, let’s import the Python libraries needed for this tutorial.

For this tutorial, we will need to import `datasets`

to get the breast cancer dataset. `pandas`

and `numpy`

are for data processing. `StandardScaler’is for standardizing the dataset.

`train_test_split`

, `XGBClassifier`

and `precision_recall_fscore_support`

are for model training and performance evaluation.

`GridSearchCV`

, `RandomizedSearchCV`

, and `hyperopt`

are the hyperparameter tuning algorithms. `StratifiedKFold`

and `cross_val_score`

are for the cross-validation.

# Dataset from sklearn import datasets # Data processing import pandas as pd import numpy as np # Standardize the data from sklearn.preprocessing import StandardScaler # Model and performance evaluation from sklearn.model_selection import train_test_split from xgboost import XGBClassifier from sklearn.metrics import precision_recall_fscore_support as score # Hyperparameter tuning from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, RandomizedSearchCV from hyperopt import tpe, STATUS_OK, Trials, hp, fmin, STATUS_OK, space_eval

### Step 2: Read In Data

In the second step, the breast cancer data from `sklearn`

library is loaded and transformed into a pandas dataframe.

The information summary shows that the dataset has 569 records and 31 columns.

# Load the breast cancer dataset data = datasets.load_breast_cancer() # Put the data in pandas dataframe format df = pd.DataFrame(data=data.data, columns=data.feature_names) df['target']=data.target # Check the data information df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mean radius 569 non-null float64 1 mean texture 569 non-null float64 2 mean perimeter 569 non-null float64 3 mean area 569 non-null float64 4 mean smoothness 569 non-null float64 5 mean compactness 569 non-null float64 6 mean concavity 569 non-null float64 7 mean concave points 569 non-null float64 8 mean symmetry 569 non-null float64 9 mean fractal dimension 569 non-null float64 10 radius error 569 non-null float64 11 texture error 569 non-null float64 12 perimeter error 569 non-null float64 13 area error 569 non-null float64 14 smoothness error 569 non-null float64 15 compactness error 569 non-null float64 16 concavity error 569 non-null float64 17 concave points error 569 non-null float64 18 symmetry error 569 non-null float64 19 fractal dimension error 569 non-null float64 20 worst radius 569 non-null float64 21 worst texture 569 non-null float64 22 worst perimeter 569 non-null float64 23 worst area 569 non-null float64 24 worst smoothness 569 non-null float64 25 worst compactness 569 non-null float64 26 worst concavity 569 non-null float64 27 worst concave points 569 non-null float64 28 worst symmetry 569 non-null float64 29 worst fractal dimension 569 non-null float64 30 target 569 non-null int64 dtypes: float64(30), int64(1) memory usage: 137.9 KB

The target variable distribution shows 63% of ones and 37% of zeros in the dataset. One means the patient has breast cancer, and 0 represents the patient does not have breast cancer.

# Check the target value distribution df['target'].value_counts(normalize=True)

1 0.627417 0 0.372583 Name: target, dtype: float64

### Step 3: Train Test Split

In step 3, we split the dataset into 80% training and 20% testing dataset. random_state makes the random split results reproducible.

# Train test split X_train, X_test, y_train, y_test = train_test_split(df[df.columns.difference(['target'])], df['target'], test_size=0.2, random_state=42) # Check the number of records in training and testing dataset. print(f'The training dataset has {len(X_train)} records.') print(f'The testing dataset has {len(X_test)} records.')

The training dataset has 455 records, and the testing dataset has 114 records.

The training dataset has 455 records. The testing dataset has 114 records.

### Step 4: Standardization

Standardization is to rescale the features to the same scale. It is calculated by extracting the mean and divided by the standard deviation. After standardization, each feature has zero mean and unit standard deviation.

Standardization should be fit on the training dataset only to prevent test dataset information from leaking into the training process. Then, the test dataset is standardized using the fitting results from the training dataset.

There are different types of scalers. StandardScaler and MinMaxScaler are most commonly used. For a dataset with outliers, we can use RobustScaler.

In this tutorial, we will use `StandardScaler`

.

# Initiate scaler sc = StandardScaler() # Standardize the training dataset X_train_transformed = pd.DataFrame(sc.fit_transform(X_train),index=X_train.index, columns=X_train.columns) # Standardized the testing dataset X_test_transformed = pd.DataFrame(sc.transform(X_test),index=X_test.index, columns=X_test.columns) # Summary statistics after standardization X_train_transformed.describe().T

We can see that after using StandardScaler, all the features have zero mean and unit standard deviation.

Let’s get the summary statistics for the training data before standardization as well, and we can see that the mean and standard deviation can be very different in scale. For example, the area error has a mean value of 40 and a standard deviation of 47. On the other hand, the compactness error has a mean of about 0.023 and a standard deviation of 0.019.

# Summary statistics before standardization X_train.describe().T

### Step 5: XGBoost Classifier With No Hyperparameter Tuning

In step 5, we will create an XGBoost classification model with default hyperparameters. This serves as a baseline model to compare against.

This is a list of the hyperparameters we can tune. Usually, a subset of essential hyperparameters will be tuned.

`base_score`

is the starting prediction score for all the instances at the model initiation. This number does not have much impact on the final results when there is a sufficient number of iterations. Therefore,`base_score`

is not a good choice for hyperparameter tuning.`booster`

specifies which booster to use for the model. Booster`gbtree`

and`dart`

use tree-based models, and booster`gblinear`

uses linear functions.`colsample_bylevel`

is the subsample ratio of columns for each depth level from the set of columns for the current tree.`colsample_bynode`

is the subsample ratio of columns for each node(split) from the set of columns for the current level.`colsample_bytree`

is the subsample ratio of columns for each tree from the set of all columns in the training dataset.`gamma`

is a value greater than or equal to zero. It is the minimum loss reduction required for a split.`learning_rate`

is also called`eta`

. It is a value between 0 and 1. It is the step size shrinkage for the feature weights to make the boosting process more conservative.`max_delta_step`

puts an absolute regularization weight capping before applying`eta`

correction. The default value of 0 means that there is no restriction on the maximum value of the weight. A positive number might help for the dataset with highly imbalanced classes. A value between 1 to 10 is usually used but it can take any value greater than or equal to 0.`max_depth`

is the maximum depth of a tree and it can take the value of any integer greater than or equal to 0. 0 means no limit to the tree depth. A larger value for`max_depth`

builds more complex models and tends to overfit.`min_child_weight`

is the minimum sum of instance weight needed in a child for partitioning. It takes the value greater than or equal to 0.`missing`

is the value in the input data that needs to be considered as a missing value. The default value is`None`

, meaning that only`np.nan`

is considered to be missing values.`n_estimators`

is the number of gradient boosted trees.`n_jobs`

takes in the number of parallel threads for the model.`n_jobs=-1`

means using all the available cores for parallel processing.`nthread`

is the number of parallel threads for running XGBoost.`'objective': 'binary:logistic'`

means that the logistic regression for binary classification is used as the learning objective and the model output probability.`random_state`

sets a seed for model reproducibility.`reg_alpha`

provides L1 regularization to the weight. Higher values result in more conservative models. The default value of 0 means no L1 regularization.`reg_lambda`

provides L2 regularization to the weight. Higher values result in more conservative models. XGBoost applies L2 regularization by default.`scale_pos_weight`

controls the balance of positive and negative weights. It’s useful for unbalanced classes.`seed`

sets a random number seed.`silent`

decides whether to print out information during model training.`subsample`

is the percentage of randomly sampled training data before growing trees. It happens in every boosting iteration. It is greater than 0 and less than or equal to 1. The default value of 1 means all the data in the training dataset will be used to build trees. A value of less than 1 helps to prevent overfitting.`verbosity`

controls how many messages are printed. The valid values are 0 (silent), 1 (warning), 2 (info), and 3 (debug).

# Initiate XGBoost Classifier xgboost = XGBClassifier() # Print default setting xgboost.get_params()

{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'n_jobs': 1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'verbosity': 1}

When training the model, `seed=0`

makes sure that we get reproducible results. After running the baseline XGBoost model, we predicted the testing dataset using `.predict`

and calculated the predicted probabilities using `.predict_proba`

.

# Train the model xgboost = XGBClassifier(seed=0).fit(X_train_transformed,y_train) # Make prediction xgboost_predict = xgboost.predict(X_test_transformed) # Get predicted probability xgboost_predict_prob = xgboost.predict_proba(X_test)[:,1]

We want to capture as many actual cancer patients as possible for this particular dataset, so we will use recall as the performance metric to optimize.

# Get performance metrics precision, recall, fscore, support = score(y_test, xgboost_predict) # Print result print(f'The recall value for the baseline xgboost model is {recall[1]:.4f}')

The baseline XGBoost model gave us a recall of 97.18%.

The recall value for the baseline xgboost model is 0.9718

### Step 6: Grid Search for XGBoost

In step 6, we will use grid search to find the best hyperparameter combinations for the XGBoost model. Grid search is an exhaustive hyperparameter search method. It trains models for every combination of specified hyperparameter values. Therefore, it can take a long time to run if we test out more hyperparameters and values.

For this reason, we would like to have a grid search space relatively small so the process can finish in a reasonable timeframe. The search space includes the hyperparameters, and their values grid search builds models for. We had three hyperparameters for grid search in this example.

`colsample_bytree`

is the percentage of columns to be randomly sampled for each tree.`reg_alpha`

provides l1 regularization to the weight. Higher values result in more conservative models.`reg_lambda`

provides l2 regularization to the weight. Higher values result in more conservative models.

Scoring is the metric to evaluate the cross-validation results for each model. Since recall is the evaluation metric for the model, we set `scoring = ['recall']`

. The scoring option can take more than one metric in the list.

`StratifiedKFold`

is used for the cross-validation. It helps us keep the class ratio in the folds the same as the training dataset. `n_splits=3`

means we are doing 3-fold cross-validation. `shuffle=True`

means the data are shuffled before splitting. `random_state=0`

makes the shuffle reproducible.

# Define the search space param_grid = { # Percentage of columns to be randomly samples for each tree. "colsample_bytree": [ 0.3, 0.5 , 0.8 ], # reg_alpha provides l1 regularization to the weight, higher values result in more conservative models "reg_alpha": [0, 0.5, 1, 5], # reg_lambda provides l2 regularization to the weight, higher values result in more conservative models "reg_lambda": [0, 0.5, 1, 5] } # Set up score scoring = ['recall'] # Set up the k-fold cross-validation kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

We specified a few options for `GridSearchCV`

.

`estimator=xgboost`

means we are using XGBoost as the model.`param_grid=param_grid`

takes our pre-defined search space for the grid search.`scoring=scoring`

set the performance evaluation metric. Because we set the scoring to ‘recall’, the model will use recall as the evaluation metric.`refit='recall'`

enables refitting the model with the best parameters on the whole training dataset.`n_jobs=-1`

means parallel processing using all the processors.`cv=kfold`

takes the`StratifiedKFold`

we defined.`verbose`

controls the number of messages returned by the grid search. The higher the number, the more information is returned.`verbose=0`

means silent.

After fitting `GridSearchCV`

on the training dataset, we will have 48 hyperparameter combinations. Since 3-fold cross-validation is used, there are 144 models trained in total.

# Define grid search grid_search = GridSearchCV(estimator=xgboost, param_grid=param_grid, scoring=scoring, refit='recall', n_jobs=-1, cv=kfold, verbose=0) # Fit grid search grid_result = grid_search.fit(X_train_transformed, y_train) # Print grid search summary grid_result

# Print the best score and the corresponding hyperparameters print(f'The best score is {grid_result.best_score_:.4f}') print('The best score standard deviation is', round(grid_result.cv_results_['std_test_recall'][grid_result.best_index_], 4)) print(f'The best hyperparameters are {grid_result.best_params_}')

The grid search cross-validation results show that 80% of features, using l1 regularization with 0.5 penalty coefficient and no l2 regularization gave us the best results. The best recall is 98.95%, and the standard deviation of the score is 0.86%.

The best score is 0.9895 The best score standard deviation is 0.0086 The best hyperparameters are {'colsample_bytree': 0.8, 'reg_alpha': 0.5, 'reg_lambda': 0}

# Make prediction using the best model grid_predict = grid_search.predict(X_test_transformed) # Get predicted probabilities grid_predict_prob = grid_search.predict_proba(X_test_transformed)[:,1] # Get performance metrics precision, recall, fscore, support = score(y_test_transformed, grid_predict) # Print result print(f'The recall value for the xgboost grid search is {recall[1]:.4f}')

We can see that the grid search recall value is the same as the baseline XGBoost model at 97.18%.

The recall value for the xgboost grid search is 0.9718

### Step 7: Random Search for XGBoost

In step 7, we are using a random search for XGBoost hyperparameter tuning. Since random search randomly picks a fixed number of hyperparameter combinations, we can afford to try more hyperparameters and more values. Therefore, we added three more parameters to the search space.

`learning_rate`

shrinks the weights to make the boosting process more conservative.`max_depth`

is the maximum depth of the tree. Increasing it increases the model complexity.`gamma`

specifies the minimum loss reduction required to do a split.

If at least one of the parameters is a distribution, sampling with replacement is used for a random search. If all parameters are provided as a list, sampling without replacement is used. Each list is treated as a uniform distribution.

# Define the search space param_grid = { # Learning rate shrinks the weights to make the boosting process more conservative "learning_rate": [0.0001,0.001, 0.01, 0.1, 1] , # Maximum depth of the tree, increasing it increases the model complexity. "max_depth": range(3,21,3), # Gamma specifies the minimum loss reduction required to make a split. "gamma": [i/10.0 for i in range(0,5)], # Percentage of columns to be randomly samples for each tree. "colsample_bytree": [i/10.0 for i in range(3,10)], # reg_alpha provides l1 regularization to the weight, higher values result in more conservative models "reg_alpha": [1e-5, 1e-2, 0.1, 1, 10, 100], # reg_lambda provides l2 regularization to the weight, higher values result in more conservative models "reg_lambda": [1e-5, 1e-2, 0.1, 1, 10, 100]} # Set up score scoring = ['recall'] # Set up the k-fold cross-validation kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

The same scoring metric and cross-validation values used in grid search are used for the random search. But for a random search, we need to specify a value for `n_iter`

, the number of parameter combinations sampled. So we are randomly testing 48 combinations for this example.

# Define random search random_search = RandomizedSearchCV(estimator=xgboost, param_distributions=param_grid, n_iter=48, scoring=scoring, refit='recall', n_jobs=-1, cv=kfold, verbose=0) # Fit grid search random_result = random_search.fit(X_train_transformed, y_train) # Print grid search summary random_result

# Print the best score and the corresponding hyperparameters print(f'The best score is {random_result.best_score_:.4f}') print('The best score standard deviation is', round(random_result.cv_results_['std_test_recall'][random_result.best_index_], 4)) print(f'The best hyperparameters are {random_result.best_params_}')

The best score is 0.9895 The best score standard deviation is 0.0086 The best hyperparameters are {'reg_lambda': 0.1, 'reg_alpha': 0.01, 'max_depth': 15, 'learning_rate': 0.1, 'gamma': 0.1, 'colsample_bytree': 0.5}

After finishing the random search cross-validation, we printed out the best score, standard deviation, and the best parameters. Although the best parameters are different from the grid search, the best score and standard deviation for the cross-validation are very close.

# Make prediction using the best model random_predict = random_search.predict(X_test_transformed) # Get predicted probabilities random_predict_prob = random_search.predict_proba(X_test_transformed)[:,1] # Get performance metrics precision, recall, fscore, support = score(y_test, random_predict) # Print result print(f'The recall value for the xgboost random search is {recall[1]:.4f}')

The random search recall value on the test dataset is creased from 97.18% to 98.59%.

The recall value for the xgboost random search is 0.9859

### Step 8: Bayesian Optimization For XGBoost

In step 8, we will apply Hyperopt Bayesian optimization on XGBoost hyperparameter tuning. According to the documentation on Hyperopt github page, there are four key elements for Hyperopt:

- the space over which to search
- the objective function to minimize
- the database in which to store all the point evaluations of the search
- the search algorithm to use

For the search space, the same space as the random search is used for the Hyperopt Bayesian optimization.

# Space space = { 'learning_rate': hp.choice('learning_rate', [0.0001,0.001, 0.01, 0.1, 1]), 'max_depth' : hp.choice('max_depth', range(3,21,3)), 'gamma' : hp.choice('gamma', [i/10.0 for i in range(0,5)]), 'colsample_bytree' : hp.choice('colsample_bytree', [i/10.0 for i in range(3,10)]), 'reg_alpha' : hp.choice('reg_alpha', [1e-5, 1e-2, 0.1, 1, 10, 100]), 'reg_lambda' : hp.choice('reg_lambda', [1e-5, 1e-2, 0.1, 1, 10, 100]) }

`StratifiedKFold`

is used to split the training dataset into k folds and keep the ratio between the classes in each fold the same as the training dataset. It is used for the cross-validation.

`n_splits=3`

means that the training dataset is split into 3 folds. This is because our dataset is small. For a larger dataset, usually 5 or 10 folds are used.`shuffle=True`

means that the dataset will be shuffled before splitting into folds. Note that the samples within each split will not be shuffled.`random_state=0`

make the split reproducible.

# Set up the k-fold cross-validation kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

Then an objective function is defined.

`XGBClassifier`

is used as the model algorithm.`seed=0`

makes the model results reproducible.`**params`

takes in the hyperparameter values.`cross_val_score`

produces k scores, one for each of the k folds. We get the mean of the k scores and output the average value.`estimator`

takes the estimator to fit the data.`X`

takes the training dataset feature matrix and`y`

takes the target variable for the training dataset.`cv`

determines the cross-validation splitting strategy. We set`cv=kfold`

, which is the output from the`StratifiedKFold`

.`scoring='recall'`

means that`recall`

is the key metric for the model.`n_jobs=-1`

enables parallel model training.

- Next,
`loss`

is defined. Because the model’s goal is to maximize recall, it is the same as minimizing negative recall, so we set`loss = - score`

. - The function returns a dictionary with
`loss`

,`params`

, and`status`

.

# Objective function def objective(params): xgboost = XGBClassifier(seed=0, **params) score = cross_val_score(estimator=xgboost, X=X_train_transformed, y=y_train, cv=kfold, scoring='recall', n_jobs=-1).mean() # Loss is negative score loss = - score # Dictionary with information for evaluation return {'loss': loss, 'params': params, 'status': STATUS_OK}

`fmin`

is the function to search the best hyperparameters with the smallest loss value.

`fn`

takes in the objective function.`space`

is for the search space of the hyperparameters.`algo`

is for the type of search algorithms. Hyperopt currently has three algorithms, random search, Tree of Parzen Estimators (TPE), and adaptive TPE. We are using TPE as the search algorithm.`max_evals`

specifies the maximum number of evaluations.`trials`

stores the information for the evaluations.

# Optimize best = fmin(fn = objective, space = space, algo = tpe.suggest, max_evals = 48, trials = Trials())

Output:

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 48/48 [00:11<00:00, 4.23it/s, best loss: -0.9859649122807017]

After the Bayesian optimization search, we get the best loss of -0.99, meaning that the recall value is about 99%.

We can print out the index for the parameters using `print(best)`

. To get the values of the best parameters, we can use the `space_eval`

and pass in the search space and `best`

.

# Print the index of the best parameters print(best) # Print the values of the best parameters print(space_eval(space, best))

Output

{'colsample_bytree': 1, 'gamma': 4, 'learning_rate': 0, 'max_depth': 5, 'reg_alpha': 0, 'reg_lambda': 1} {'colsample_bytree': 0.4, 'gamma': 0.4, 'learning_rate': 0.0001, 'max_depth': 18, 'reg_alpha': 1e-05, 'reg_lambda': 0.01}

Next, we apply the best hyperparameters to the `XGBClassifier`

and make predictions.

# Train model using the best parameters xgboost_bo = XGBClassifier(seed=0, colsample_bytree=space_eval(space, best)['colsample_bytree'], gamma=space_eval(space, best)['gamma'], learning_rate=space_eval(space, best)['learning_rate'], max_depth=space_eval(space, best)['max_depth'], reg_alpha=space_eval(space, best)['reg_alpha'], reg_lambda=space_eval(space, best)['reg_lambda'] ).fit(X_train_transformed,y_train) # Make prediction using the best model bayesian_opt_predict = xgboost_bo.predict(X_test_transformed) # Get predicted probabilities bayesian_opt_predict_prob = xgboost_bo.predict_proba(X_test_transformed)[:,1] # Get performance metrics precision, recall, fscore, support = score(y_test, bayesian_opt_predict) # Print result print(f'The recall value for the xgboost Bayesian optimization is {recall[1]:.4f}')

Output:

The recall value for the xgboost Bayesian optimization is 0.9859

The recall value on the test dataset is 98.59%, the same as the random search result.

### Summary

In this tutorial, we covered how to tune XGBoost hyperparameters using Python. You learned

- What are the differences between grid search, random search, and Bayesian optimization?
- How to use grid search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use random search cross-validation to tune the hyperparameters for the XGBoost model?
- How to use Bayesian optimization to tune the hyperparameters for the XGBoost model?
- How to compare the results from grid search, random search, and Bayesian optimization?

In practice, random search and Bayesian optimization usually have better performance than the grid search because they can tune more parameters on wider ranges of values.

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended Tutorials

- GrabNGoInfo Machine Learning Tutorials Inventory
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- Neural Network Model Balanced Weight For Imbalanced Classification In Keras
- Isolation Forest For Anomaly Detection
- Sentiment Analysis Without Modeling: TextBlob Vs VADER Vs Flair