Hyperparameter Tuning with MLflow and HyperOpt

Hyperparameters are parameters that control model training and unlike other parameters (like node weights) they are not learned. Examples of such parameters are the learning rate or the number of layers in a Neural Network.

Choosing the right values for those Hyperparameters is crucial for good training but it is not easy to just guess them. Hyperparameter tuning (or Optimization) is the process of optimizing the hyperparameter to maximize an objective (e.g. model accuracy on validation set). Different approaches can be used for this:

There are many tools that automate this process given an evaluation/objective function (e.g. maximize accuracy).

The rest of this article explores using HyperOpt with MLflow to find the best hyperparameter to use for training a TensorFlow model.

Hyperparameters Tuning with MLflow

The best practices for organizing runs in MLflow and tracking for hyperparameter tuning looks like this (source):

Tuning MLflow runs MLflow logging
Hyperparameter tuning algorithm Parent run Metadata, e.g., numFolds for CrossValidator
Fit & evaluate model with hyperparameter setting #1 Child run 1 Hyperparameters #1, evaluation metric #1
Fit & evaluate model with hyperparameter setting #2 Child run 2 Hyperparameters #2, evaluation metric #2

This translates to an MLflow project with the following steps:

The resulting MLproject file looks like this

# MLproject

name: HyperparameterTF

conda_env: conda.yaml

  # Step for model training
      data: {type: string, default: "./datasets/xyz.csv"}
      epochs: {type: int, default: 10}
      batch_size: {type: int, default: 32}
      learning_rate: {type: float, default: 3e-4}
    command: "python train.py {data}
                --batch-size {batch_size}
                --epochs {epochs}
                --learning-rate {learning_rate}"
  # Main step for launching hyper-parameters tuning
      data: {type: string, default: "./datasets/xyz.csv"}
      max_runs: {type: int, default: 12}
      epochs: {type: int, default: 32}
      metric: {type: string, default: "rmse"}
      algo: {type: string, default: "tpe.suggest"}
    command: "python search.py {training_data}
                --max-runs {max_runs}
                --epochs {epochs}
                --metric {metric}
                --algo {algo}"

The conda.yaml file referenced in the MLproject is simply used to declare all the needed dependencies, it may look like this:

# conda.yaml

name: hyperparam_tensorflow
  - defaults
  - python=3.6
  - numpy=1.14.3
  - pandas=0.22.0
  - pip:
    - mlflow>=1.0
    - hyperopt==0.1
    - tensorflow==2.0.0

Train step

The train step is implemented by the train.py where the hyperprameters are used to build the model and train it. On a high level, this is what it does:

The train.py looks like this:

# train.py

import pandas as pd
import tensorflow as tf
import mlflow.tensorflow

# Enable auto-logging to MLflow

@click.command(help="Trains a TensorFlow model on CSV input.")
@click.option("--epochs", type=click.INT, default=10, help="Number of train steps.")
@click.option("--batch-size", type=click.INT, default=32, help="Batch size.")
@click.option("--learning-rate", type=click.FLOAT, default=1e-2, help="Learning rate.")
def run(data, epochs, batch_size, learning_rate):
  # Read data and split on train/validation sets
  df = pd.read_csv(data)
  train, valid = train_test_split(df, random_state=31)
  train_x, train_y = ... # separate label and features
  valid_x, valid_y = ... # separate label and features
  # Build and train the model
  with mlflow.start_run():
    # build model architecture
    model = _create_model()
    # train model
    model.fit(train_x, train_y,
      validation_data=(valid_x, valid_y),

if __name__ == '__main__':

Main step

In the main step is where most of the interesting stuff happening and the actual best practices described earlier are implemented. On a high level, it does the following:

The search.py file implements the logic for the main step and look this this:

# search.py

from hyperopt import fmin, hp, tpe, rand

import mlflow.projects
from mlflow.tracking.client import MlflowClient

@click.command(help="Hyperparameter search with Hyperopt.")
@click.option("--max-runs", type=click.INT, default=10, help="Maximum number of runs.")
@click.option("--epochs", type=click.INT, default=500, help="Number of train steps.")
@click.option("--metric", type=click.STRING, default="rmse", help="Metric to optimize.")
@click.option("--algo", type=click.STRING, default="tpe.suggest", help="Search algorithm.")
def search(data, max_runs, epochs, metric, algo):
  tracking_client = mlflow.tracking.MlflowClient()
  # initial value for the parameter to be optimized
  _inf = np.finfo(np.float64).max
  # define the search space for hyper-parameters
  space = [
    hp.uniform('lr', 1e-5, 1e-1),
  with mlflow.start_run() as run:
    exp_id = run.info.experiment_id
    # run the optimization algorithm
    best = fmin(
      fn=train_fn(epochs, exp_id, _inf, _inf),
      algo=tpe.suggest if algo == "tpe.suggest" else rand.suggest,
    mlflow.set_tag("best params", str(best))
    # find all runs generated by this search
    client = MlflowClient()
    query = "tags.mlflow.parentRunId = '{run_id}' ".format(run_id=run.info.run_id)
    runs = client.search_runs([exp_id], query)
    # iterate over all runs to find best one
    best_train, best_valid = _inf, _inf
    best_run = None
    for r in runs:
      if r.data.metrics["val_rmse"] < best_val_valid:
        best_run = r
        best_train = r.data.metrics["train_rmse"]
        best_valid = r.data.metrics["val_rmse"]
    # log best run metrics as the final metrics of this run.
    mlflow.set_tag("best_run", best_run.info.run_id)
      "train_{}".format(metric): best_train,
      "val_{}".format(metric): best_valid

def train_fn(epochs, null_train_loss, null_valid_loss):

if __name__ == '__main__':

The definition of the train_fn where the call to the train step is perfomed looks like this:

# search.py

def train_fn(epochs, null_train_loss, null_valid_loss):
  # Actual training function
  def train(params):
    lr = params
    with mlflow.start_run(nested=True) as child_run:
      # run the training Step and wait it finishes
      p = mlflow.projects.run(
          "data": data,
          "epochs": str(epochs),
          "learning_rate": str(lr)
      succeeded = p.wait()
    # If training finished successfully log the metrics
    if succeeded:
      training_run = tracking_client.get_run(p.run_id)
      metrics = training_run.data.metrics
      train_loss = metrics["train_{}".format(metric)]
      valid_loss = metrics["val_{}".format(metric)]
      # reported failed run
      tracking_client.set_terminated(p.run_id, "FAILED")
      train_loss = null_train_loss
      valid_loss = null_valid_loss
    # log the metrics from this run
      "train_{}".format(metric): train_loss,
      "val_{}".format(metric)  : valid_loss
    # return validation loss which will be used by the optimization algorithm
    return valid_loss
  return train


After running the project, e.g. directly call the train step as follows, we can visualize the results (in this case the validation losses) of every run from the MLflow UI.

$ mlflow run -e train --experiment-name hyperopt .
Training vs. validation losses for every MLflow run