WANNA Job#

class wanna.core.models.training_custom_job.BaseCustomJobModel(*, name, project_id, zone=None, region, labels=None, description=None, service_account=None, network=None, bucket, tags=None, metadata=None, enable_web_access=False, base_output_directory=None, tensorboard_ref=None, timeout_seconds=86400, encryption_spec=None, env_vars=None)

name - [str] Custom name for this instance
project_id' - [str] (optional) Overrides GCP Project ID from thegcp_profile` segment
zone - [str] (optional) Overrides zone from the gcp_profile segment
region - [str] (optional) Overrides region from the gcp_profile segment
labels- [dict[str, str]] (optional) Custom labels to apply to this instance
service_account - [str] (optional) Overrides service account from the gcp_profile segment
network - [str] (optional) Overrides network from the gcp_profile segment
tags- [dict[str, str]] (optional) Tags to apply to this instance
metadata- [str] (optional) Custom metadata to apply to this instance
enable_web_access - [bool] Whether you want Vertex AI to enable interactive shell access to training containers. Default is False
bucket - [str] Overrides bucket from the gcp_profile segment
base_output_directory - [str] (optional) Path to where outputs will be saved
tensorboard_ref - [str] (optional) Name of the Vertex AI Experiment
timeout_seconds - [int] Job timeout. Defaults to 60 * 60 * 24 s = 24 hours
encryption_spec- [str] (optional) The Cloud KMS resource identifier. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key The key needs to be in the same region as where the compute resource is created
env_vars - dict[str, str] (optional) Environment variables to be propagated to the job

Hyper-parameter tuning#

class wanna.core.models.training_custom_job.HyperparameterTuning(*, metrics, parameters, max_trial_count=15, parallel_trial_count=3, search_algorithm=None, encryption_spec=None)

metrics - Dictionary of type [str, Literal["minimize", "maximize"]]
parameters - list[HyperParamater] defined per var_name, type, min, max, scale
max_trial_count - [int] defaults to 15
parallel_trial_count - [int] defaults to 3
search_algorithm - [str] (optional) Can be "grid" or "random"
encryption_spec - [str] (optional) The Cloud KMS resource identifier. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key The key needs to be in the same region as where the compute resource is created

A custom job can be simply converted to a hyper-parameter tuning job just by adding one extra parameter called hp_tuning. This will start a series of jobs (instead of just one job) and try to find the best combination of hyper-parameters in regard to a target variable that you specify.

Read the official documentation for more information.

In general, you have to set which hyper-parameters are changeable, which metric you want to optimize over and how many trials you want to run. You also need to adjust your training script so it would accept hyper-parameters as script arguments and report the optimized metric back to Vertex-Ai.

Setting hyper-parameter space#

Your code should accept a script arguments with name matching wanna.yaml config. For example, if you want to fine-tune the learning rate in your model:

In wanna.yaml config:

    hp_tuning:
      parameters:
        - var_name: learning_rate
          type: double
          min: 0.001
          max: 1
          scale: log

And the python script should accept the same argument with the same type:

  parser = argparse.ArgumentParser()
  parser.add_argument(
      '--learning_rate',
      required=True,
      type=float,
      help='learning rate')

Currently, you can use parameters of type double, integer, discrete and categorical. Each of them must be specified by var_name, type and additionaly:

double: min, max and scale (linear / log)
integer: min, max and scale (linear / log)
discrete: values (list of possible values) and scale (linear / log)
categorical: values (list of possible values)

Setting target metric#

You can choose to either maximize or minimize your optimized metric. Example in wanna.yaml:

    hp_tuning:
      metrics: {'accuracy':'maximize'}
      parameters:
        ...

Your python script must report back the metric during training, you should use cloudml-hypertune library.

import hypertune

hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag='accuracy',
    metric_value=0.987,
    global_step=1000)

Setting number of trials and search algorithm#

The number of trials can be influenced by max_trial_count and parallel_trial_count.

Search through hyper-parameter space can be grid, random or if not any of those two are set, the default Bayesian Optimization will be used.