WANNA Managed notebook#
It offers a simple way of deploying Jupyter Notebooks on GCP, with minimum environment set up, automatic mount of all GCS buckets in the project, connection do Dataproc clusters and more.
class
wanna.core.models.notebook.ManagedNotebookModel
(*, name, project_id, zone=None, region=None, labels=None, description=None, service_account=None, network=None, bucket=None, tags=None, metadata=None, owner=None, machine_type='n1-standard-4', gpu=None, data_disk=None, kernel_docker_image_refs=None, tensorboard_ref=None, subnet=None, internal_ip_only=True, idle_shutdown=True, idle_shutdown_timeout=180)name
- [str] Custom name for this instanceproject_id' - [str] (optional) Overrides GCP Project ID from the
gcp_profile` segmentzone
- [str] (optional) Overrides zone from thegcp_profile
segmentregion
- [str] (optional) Overrides region from thegcp_profile
segmentlabels
- [Dict[str, str]] (optional) Custom labels to apply to this instanceservice_account
- [str] (optional) Overrides service account from thegcp_profile
segmentnetwork
- [str] (optional) Overrides network from thegcp_profile
segmenttags
- [Dict[str, str]] (optional) Tags to apply to this instancemetadata
- [Optional[Dict[str, Any]]] (optional) Custom metadata to apply to this instanceowner
- [str] This can be either a single user email address and that would be the only one able to access the notebook. Or service account and then everyone who has the iam.serviceAccounts.actAs permission on the specified service account will be able to connect.machine_type
- [str] (optional) GCP Compute Engine machine typegpu
- [GPU] (optional) The hardware GPU accelerator used on this instance.data_disk
- [Disk] (optional) Data disk configuration to attach to this instance.kernels
- [List[str]] (optional) Custom kernels given as links to container registrytensorboard_ref
- [str] (optional) Reference to Vertex Experimetessubnet
- [str] (optional) Subnetwork of a given networkinternal_ip_only
- [bool] (optional) Public or private (default) IP addressidle_shutdown
- [bool] (optional) Turning off the notebook after the timeout, can be true (default) or falseidle_shutdown_timeout
- [int] (optional) Time in minutes, between 10 and 1440, defaults to 180
Dataproc clusters and metastore#
If you want to run Spark jobs on a Dataproc cluster and also have a Hive Metastore service available as your default Spark SQL engine:
- Create a Dataproc Metastore in your GCP project & region in the Google Cloud UI
- Create a Dataproc cluster connected to this metastore, with a subnet specified. e.g.:
gcloud dataproc clusters create cluster-test --enable-component-gateway --region europe-west1 --subnet cloud-lab --zone europe-west1-b --single-node --optional-components JUPYTER --dataproc-metastore projects/cloud-lab-304213/locations/europe-west1/services/jacek-test
- Run your managed notebook. As kernel use Pyspark on the remote Dataproc cluster that you have just created
- Test your spark Session for access. This example creates a database in your metastore:
spark = SparkSession \
.builder \
.appName("MetastoreTest") \
.getOrCreate()
query = """CREATE DATABASE testdb"""
spark.sql(query)
Tensorboard integration#
tb-gcp-uploader
is needed to upload the logs to the tensorboard instance. A detailed
tutorial on this tool can be found here.
If you set the tensorboard_ref
in the WANNA yaml config, we will export the tensorboard resource name
as AIP_TENSORBOARD_LOG_DIR
.
Example#
managed-notebooks:
- name: example
owner: jacek.hebda@avast.com
machine_type: n1-standard-1
labels:
notebook_usecase: wanna-notebook-sample
tags:
metadata:
gpu:
count: 1
accelerator_type: NVIDIA_TESLA_T4
data_disk:
disk_type: pd_standard
size_gb: 100
tensorboard_ref:
kernels:
network:
subnet:
idle_shutdown: True
idle_shutdown_timeout: 180