MLflow is an open-source platform for machine learning lifecycle management. MLflow on Databricks offers an integrated experience for running, tracking, and serving machine learning models. In this tutorial, we will cover:

- How to use MLflow to track different model versions?
- How to retrieve MLflow experiment information programmatically?
- How to retrieve MLflow information using Databricks UI?

**Resources for this post:**

- Databricks notebook with code
- More video tutorials on Databricks and PySpark
- More blog posts on Databricks and PySpark
- If you prefer the video version of the tutorial, please check out the video on YouTube

### Step 1: Import Libraries

In step 1, we will import the libraries. `pandas`

, `numpy`

, and pyspark SQL functions are for data processing.

`matplotlib`

is for visualization.

`make_regression`

is for creating synthetic modeling datasets.

From the `pyspark.ml`

library, we imported `VectorAssembler`

for feature formatting, `LinearRegression`

for model training, `RegressionEvaluator`

for model evaluation, and `Pipeline`

for pipeline creation and loading.

We also imported `mlflow`

, `mlflow.spark`

for the spark flavor, and `MlflowClient`

to query model information. If you are using the Databricks runtime for machine learning, MLflow is installed. Otherwise, the MLflow package can be installed from PyPI.

# Data processing import pandas as pd import numpy as np from pyspark.sql.functions import log, col, exp # Visulaization import matplotlib.pyplot as plt # Create synthetic dataset from sklearn.datasets import make_regression # Modeling from pyspark.ml.feature import VectorAssembler from pyspark.ml.regression import LinearRegression from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml import Pipeline # MLflow import mlflow import mlflow.spark from mlflow.tracking import MlflowClient

### Step 2: Create Dataset For Linear Regression

In step 2, we will create a synthetic dataset for the linear regression model.

Using `make_regression`

, a dataset with one million records is created. The dataset has two features, and both of them are informative. It has a noise of 10 and a bias of 2. `random_state`

ensures the randomly created dataset is reproducible. The random state does not have to be 42. It can be any number.

# Create a synthetic dataset X, y = make_regression(n_samples=1000000, n_features=2, n_informative=2, noise=10, bias=2, random_state=42)

After the dataset is created, we can scale the values to the desired ranges. In this example, the first feature is scaled to values between 1 and 100, the second feature is scaled to values 1000 and 5000, and the target is scaled to values between 3000 and 80,000.

# Scale feature 1 to values between 1 and 100 X[:, 0] = np.interp(X[:, 0], (X[:, 0].min(), X[:, 0].max()), (1, 100)) # Scale feature 2 to values between 1000 and 5000 X[:, 1] = np.interp(X[:, 1], (X[:, 1].min(), X[:, 1].max()), (1000, 5000)) # Scale dependent variable to values between 3000 and 80000 y = np.interp(y, (y.min(), y.max()), (3000, 80000))

The output of `make_regression`

is in array format. We convert it into a pandas dataframe, then the pandas dataframe is converted into a spark dataframe.

# Convert the data from numpy array to a pandas dataframe pdf = pd.DataFrame({'feature1': X[:, 0], 'feature2': X[:, 1], 'dependent_variable': y}) # Convert pandas dataframe to spark dataframe sdf = spark.createDataFrame(pdf) # Check data summary statistics display(sdf.summary())

`summary()`

gives us the summary statistics of the dataset.

Next, let’s check the scatterplot between the features and the dependent variable. The visualization was created using the Databricks notebook built-in functionality. To learn more about it, please check out my previous tutorial on Databricks Dashboard For Big Data.

display(sdf.select('dependent_variable', 'feature1'))

display(sdf.select('dependent_variable', 'feature2'))

We can see that both features’ scatterplots show positive trends, but their shape and slope are different.

### Step 3: Train Test Split

After creating the modeling dataset, in step 3, we will make the train test split.

Using `randomSplit`

, we split the dataset into 80% training and 20% validation. `seed=42`

makes the random split results reproducible. However, we need to make sure that the same cluster and partition number are used when reproducing the split.

We got 800,299 in the training dataset and 199,701 in the testing dataset after the split.

# Train test split trainDF, testDF = sdf.randomSplit([.8, .2], seed=42) # Print the number of records print(f'There are {trainDF.cache().count()} records in the training dataset.') print(f'There are {testDF.cache().count()} records in the testing dataset.')

### Step 4: Linear Regression With Raw Data – Model 1

In step 4, we will create the first model using linear regression. In this model, the features and the dependent variable created in the synthetic dataset will be used directly. So let’s give it the run name of `LR-Raw-Data`

.

Firstly, a linear regression model is trained using spark ML. To learn more details, please check out my previous tutorial on Databricks Linear Regression With Spark ML.

Then the parameter of the model is logged. For this experiment, we plan to create two versions of the model, one with the raw dependent variable, the other with the logarithm of the dependent variable. Model 1 uses the raw dependent variable. To log this information, we saved the name of the dependent variable into a parameter called `target_variable`

.

We also logged the parameter for elastic net. 0.5 means 50% of LASSO (L1) regularization and 50% of Ridge (L2) regularization. To learn more about regularization, please check out my previous tutorial on LASSO (L1) Vs Ridge (L2) Vs Elastic Net Regularization.

After that, we logged the model for this run, made predictions on the testing dataset, and saved the prediction csv file as an artifact.

When doing model performance evaluation, we calculated RMSE, R Square, MSE, and MAE. And they are logged as the experiment metrics.

Finally, a plot for the dependent variable distribution is created and logged as an artifact.

with mlflow.start_run(run_name="LR-Raw-Data") as run: # Define pipeline vecAssembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="features") lr = LinearRegression(featuresCol="features", labelCol="dependent_variable", predictionCol="prediction",elasticNetParam=0.5) pipeline = Pipeline(stages=[vecAssembler, lr]) pipelineModel = pipeline.fit(trainDF) # Log parameters mlflow.log_param("target_variable", "dependent_variable") mlflow.log_param("elasticNetParam", 0.5) # Log the model for this run mlflow.spark.log_model(pipelineModel, "SparkML-linear-regression") # Make predictions predDF = pipelineModel.transform(testDF) # Save the prediction as csv predDF.toPandas().to_csv('predictions.csv', index=False) # Log the saved prediction as artifact mlflow.log_artifact("predictions.csv") # Evaluate predictions regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="dependent_variable") rmse = regressionEvaluator.setMetricName("rmse").evaluate(predDF) r2 = regressionEvaluator.setMetricName("r2").evaluate(predDF) mse = regressionEvaluator.setMetricName("mse").evaluate(predDF) mae = regressionEvaluator.setMetricName("mae").evaluate(predDF) # Log metrics mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.log_metric("mse", mse) mlflow.log_metric("mae", mae) # Create a plot for the testing dataset testDF.toPandas().hist(column="dependent_variable", bins=100) # Log artifact plt.savefig("dependent_variable.png") mlflow.log_artifact("dependent_variable.png")

### Step 5: Linear Regression With Log Target – Model 2

Taking logarithm is a commonly used technique for data transformation. It is usually used to transform non-normal distributed data to a normal distribution. Our synthetic data is already normally distributed, so it does not need the logarithm. But we would like to create a model version with log transformation for illustration purposes.

Firstly, we name the run `LR-Log-Target`

, indicating that the target variable is the logarithm transformation of the raw data.

Then the logarithm of the dependent variable is created for both the training dataset and the testing dataset.

The rest of the code is very similar to the previous model, except for three changes. The first change is the information related to the dependent variable is updated to the logarithm version. The second change is that the elasticNetParam is changed from 0.5 to 0.8. The third change is related to the model prediction. Because the target label for the model training is in logarithm form, the predicted values are in logarithm too. So we need to take the exponential of the predictions before doing model performance evaluation.

with mlflow.start_run(run_name="LR-Log-Target") as run: # Take the log of the target variable logTrainDF = trainDF.withColumn("log_dv", log(col("dependent_variable"))) logTestDF = testDF.withColumn("log_dv", log(col("dependent_variable"))) # Define pipeline vecAssembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="features") lr = LinearRegression(featuresCol="features", labelCol="log_dv", predictionCol="log_prediction", elasticNetParam=0.8) pipeline = Pipeline(stages=[vecAssembler, lr]) pipelineModel = pipeline.fit(logTrainDF) # Log parameters mlflow.log_param("target_variable", "log_dv") mlflow.log_param("elasticNetParam", 0.8) # Log model mlflow.spark.log_model(pipelineModel, "SparkML-linear-regression") # Make predictions predDF = pipelineModel.transform(logTestDF) expDF = predDF.withColumn("prediction", exp(col("log_prediction"))) # Save the prediction as csv predDF.toPandas().to_csv('predictions.csv', index=False) # Log the samved prediction as artifact mlflow.log_artifact("predictions.csv") # Evaluate predictions regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="dependent_variable") rmse = regressionEvaluator.setMetricName("rmse").evaluate(expDF) r2 = regressionEvaluator.setMetricName("r2").evaluate(expDF) mse = regressionEvaluator.setMetricName("mse").evaluate(expDF) mae = regressionEvaluator.setMetricName("mae").evaluate(expDF) # Log metrics mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.log_metric("mse", mse) mlflow.log_metric("mae", mae) # Create a plot for the testing dataset logTestDF.toPandas().hist(column="log_dv", bins=100) # Log artifact figPath = 'grabngoinfo_' + "logDv.png" plt.savefig("log_dependent_variable.png") mlflow.log_artifact("log_dependent_variable.png")

### Step 6: Get Model Experiments Information Programmatically

In step 6, we will get model experiments information programmatically.

`list_experiments()`

from `MlflowClient()`

lists all the experiments and their IDs.

# List MLflow experiments MlflowClient().list_experiments()

There are two types of experiments, workspace experiments and notebook experiments.

- Workspace experiments are created from the Databricks Machine Learning UI or the MLflow API. Any notebook can log runs to a workspace experiment by experiment ID or experiment name. We can access a workspace experiment from the workspace menu.
- Notebook experiments are associated with specific notebooks. The experiment is automatically created by Databricks. We can access a notebook experiment from the notebook.

The experiment in this tutorial is a notebook experiment.

To get all the runs for an experiment, use `mlflow.search_runs`

and pass in the `experiment_id`

.

# Get all runs for a given experiment experiment_id = run.info.experiment_id runs_df = mlflow.search_runs(experiment_id) # Display information runs_df.T

We can see that the model with the raw data has better performance than the log version of the data.

When there are a lot of runs in an experiment, we may want to sort the runs in descending time order, and get the information for the latest runs. `max_results`

controls the number of runs to keep.

# Get the the latest run runs = MlflowClient().search_runs(experiment_id, order_by=["attributes.start_time desc"], max_results=1) # Get the metrics from the latest run runs[0].data.metrics

Out[33]: {'mae': 1166.6333634351645, 'mse': 2702618.876696308, 'r2': 0.9609755978839164, 'rmse': 1643.9643781713485}

### Step 7: Get Model Experiments Information Using UI

In step 7, we will talk about how to use Databricks UI to retrieve the experiment information.

#### Step 7.1: Access the experiment information within a notebook

To access the experiment information within the notebook, we can click the **Experiment** icon on the upper right of the notebook. The two versions of the models are shown on the right sidebar of the notebook.

#### Step 7.2: Access Experiment UI

To open the full experiment UI on a new page, click the blue **Experiment UI** at the bottom of the sidebar.

#### Step 7.3: Access Experiment Run Information

To check the details of a run, click either the blue **Start Time** (6 hours ago in this example) or the blue spark under **Models** for the run.

We can get detailed information about this run by clicking the list of options on the left-hand side of the UI. It also provides the code for prediction using Spark and pandas dataframe.

### Summary

In this tutorial, we talked about how to use MLflow to track the spark ML linear regression models. We covered:

- How to use MLflow to track different model versions?
- How to retrieve MLflow experiment information programmatically?
- How to retrieve MLflow information using Databricks UI?

For more information about data science and machine learning, please check out myÂ YouTube channelÂ andÂ Medium PageÂ or follow me onÂ LinkedIn.

### Recommended For You

- GrabNGoInfo Machine Learning Tutorials Inventory
- One-Class SVM For Anomaly Detection
- Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python
- Hyperparameter Tuning For XGBoost
- Recommendation System: User-Based Collaborative Filtering
- Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python
- How to detect outliers | Data Science Interview Questions and Answers
- LASSO (L1) Vs Ridge (L2) Vs Elastic Net Regularization For Classification Model