Skip to content

WANNA Managed notebook#

It offers a simple way of deploying Jupyter Notebooks on GCP, with minimum environment set up, automatic mount of all GCS buckets in the project, connection do Dataproc clusters and more.

class wanna.core.models.notebook.ManagedNotebookModel(*, name, project_id, zone=None, region=None, labels=None, description=None, service_account=None, network=None, bucket=None, tags=None, metadata=None, owner=None, machine_type='n1-standard-4', gpu=None, data_disk=None, kernel_docker_image_refs=None, tensorboard_ref=None, subnet=None, internal_ip_only=True, idle_shutdown=True, idle_shutdown_timeout=180)
  • name- [str] Custom name for this instance
  • project_id' - [str] (optional) Overrides GCP Project ID from thegcp_profile` segment
  • zone - [str] (optional) Overrides zone from the gcp_profile segment
  • region - [str] (optional) Overrides region from the gcp_profile segment
  • labels- [Dict[str, str]] (optional) Custom labels to apply to this instance
  • service_account - [str] (optional) Overrides service account from the gcp_profile segment
  • network - [str] (optional) Overrides network from the gcp_profile segment
  • tags- [Dict[str, str]] (optional) Tags to apply to this instance
  • metadata- [Optional[Dict[str, Any]]] (optional) Custom metadata to apply to this instance
  • owner - [str] This can be either a single user email address and that would be the only one able to access the notebook. Or service account and then everyone who has the iam.serviceAccounts.actAs permission on the specified service account will be able to connect.
  • machine_type - [str] (optional) GCP Compute Engine machine type
  • gpu- [GPU] (optional) The hardware GPU accelerator used on this instance.
  • data_disk - [Disk] (optional) Data disk configuration to attach to this instance.
  • kernels - [List[str]] (optional) Custom kernels given as links to container registry
  • tensorboard_ref - [str] (optional) Reference to Vertex Experimetes
  • subnet- [str] (optional) Subnetwork of a given network
  • internal_ip_only - [bool] (optional) Public or private (default) IP address
  • idle_shutdown - [bool] (optional) Turning off the notebook after the timeout, can be true (default) or false
  • idle_shutdown_timeout - [int] (optional) Time in minutes, between 10 and 1440, defaults to 180

Dataproc clusters and metastore#

If you want to run Spark jobs on a Dataproc cluster and also have a Hive Metastore service available as your default Spark SQL engine:

  • Create a Dataproc Metastore in your GCP project & region in the Google Cloud UI
  • Create a Dataproc cluster connected to this metastore, with a subnet specified. e.g.:
gcloud dataproc clusters create cluster-test --enable-component-gateway --region europe-west1 --subnet cloud-lab --zone europe-west1-b --single-node --optional-components JUPYTER --dataproc-metastore projects/cloud-lab-304213/locations/europe-west1/services/jacek-test
  • Run your managed notebook. As kernel use Pyspark on the remote Dataproc cluster that you have just created
  • Test your spark Session for access. This example creates a database in your metastore:
spark = SparkSession \
    .builder \
    .appName("MetastoreTest") \
    .getOrCreate()

query = """CREATE DATABASE testdb"""
spark.sql(query)

Tensorboard integration#

tb-gcp-uploader is needed to upload the logs to the tensorboard instance. A detailed tutorial on this tool can be found here.

If you set the tensorboard_ref in the WANNA yaml config, we will export the tensorboard resource name as AIP_TENSORBOARD_LOG_DIR.

Example#

managed-notebooks:
  - name: example
    owner: jacek.hebda@avast.com
    machine_type: n1-standard-1
    labels:
      notebook_usecase: wanna-notebook-sample
    tags:
    metadata:
    gpu:
      count: 1
      accelerator_type: NVIDIA_TESLA_T4
    data_disk:
      disk_type: pd_standard
      size_gb: 100
    tensorboard_ref:
    kernels:
    network:
    subnet: 
    idle_shutdown: True
    idle_shutdown_timeout: 180