WANNA Job#
wanna.core.models.training_custom_job.BaseCustomJobModel
(*, name, project_id, zone=None, region, labels=None, description=None, service_account=None, network=None, bucket, tags=None, metadata=None, enable_web_access=False, base_output_directory=None, tensorboard_ref=None, timeout_seconds=86400, encryption_spec=None, env_vars=None)name
- [str] Custom name for this instanceproject_id' - [str] (optional) Overrides GCP Project ID from the
gcp_profile` segmentzone
- [str] (optional) Overrides zone from thegcp_profile
segmentregion
- [str] (optional) Overrides region from thegcp_profile
segmentlabels
- [Dict[str, str]] (optional) Custom labels to apply to this instanceservice_account
- [str] (optional) Overrides service account from thegcp_profile
segmentnetwork
- [str] (optional) Overrides network from thegcp_profile
segmenttags
- [Dict[str, str]] (optional) Tags to apply to this instancemetadata
- [str] (optional) Custom metadata to apply to this instanceenable_web_access
- [bool] Whether you want Vertex AI to enable interactive shell access to training containers. Default is Falsebucket
- [str] Overrides bucket from thegcp_profile
segmentbase_output_directory
- [str] (optional) Path to where outputs will be savedtensorboard_ref
- [str] (optional) Name of the Vertex AI Experimenttimeout_seconds
- [int] Job timeout. Defaults to 60 * 60 * 24 s = 24 hoursencryption_spec
- [str] (optional) The Cloud KMS resource identifier. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key The key needs to be in the same region as where the compute resource is createdenv_vars
- Dict[str, str] (optional) Environment variables to be propagated to the job
Hyper-parameter tuning#
wanna.core.models.training_custom_job.HyperparameterTuning
(*, metrics, parameters, max_trial_count=15, parallel_trial_count=3, search_algorithm=None, encryption_spec=None)metrics
- Dictionary of type [str, Literal["minimize", "maximize"]]parameters
- List[HyperParamater] defined per var_name, type, min, max, scalemax_trial_count
- [int] defaults to 15parallel_trial_count
- [int] defaults to 3search_algorithm
- [str] (optional) Can be "grid" or "random"encryption_spec
- [str] (optional) The Cloud KMS resource identifier. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key The key needs to be in the same region as where the compute resource is created
A custom job can be simply converted to a hyper-parameter tuning job just by adding
one extra parameter called hp_tuning
. This will start a series of jobs (instead of just one job)
and try to find the best combination of hyper-parameters in regard to a target variable that you specify.
Read the official documentation for more information.
In general, you have to set which hyper-parameters are changeable, which metric you want to optimize over and how many trials you want to run. You also need to adjust your training script so it would accept hyper-parameters as script arguments and report the optimized metric back to Vertex-Ai.
Setting hyper-parameter space#
Your code should accept a script arguments with name matching wanna.yaml
config.
For example, if you want to fine-tune the learning rate in your model:
In wanna.yaml
config:
hp_tuning:
parameters:
- var_name: learning_rate
type: double
min: 0.001
max: 1
scale: log
And the python script should accept the same argument with the same type:
parser = argparse.ArgumentParser()
parser.add_argument(
'--learning_rate',
required=True,
type=float,
help='learning rate')
Currently, you can use parameters of type double
, integer
, discrete
and categorical
.
Each of them must be specified by var_name
, type
and additionaly:
double
:min
,max
andscale
(linear
/log
)integer
:min
,max
andscale
(linear
/log
)discrete
:values
(list of possible values) andscale
(linear
/log
)categorical
:values
(list of possible values)
Setting target metric#
You can choose to either maximize or minimize your optimized metric. Example in wanna.yaml
:
hp_tuning:
metrics: {'accuracy':'maximize'}
parameters:
...
Your python script must report back the metric during training, you should use cloudml-hypertune library.
import hypertune
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='accuracy',
metric_value=0.987,
global_step=1000)
Setting number of trials and search algorithm#
The number of trials can be influenced by max_trial_count
and parallel_trial_count
.
Search through hyper-parameter space can be grid
, random
or if not any of those two are set,
the default Bayesian Optimization will be used.