WANNA Managed notebook#

It offers a simple way of deploying Jupyter Notebooks on GCP, with minimum environment set up, automatic mount of all GCS buckets in the project, connection do Dataproc clusters and more.

class wanna.core.models.notebook.ManagedNotebookModel(*, name, project_id, zone=None, region=None, labels=None, description=None, service_account=None, network=None, bucket=None, tags=None, metadata=None, owner=None, machine_type='n1-standard-4', gpu=None, data_disk=None, kernel_docker_image_refs=None, tensorboard_ref=None, subnet=None, internal_ip_only=True, idle_shutdown=True, idle_shutdown_timeout=180)

name- [str] Custom name for this instance
project_id' - [str] (optional) Overrides GCP Project ID from thegcp_profile` segment
zone - [str] (optional) Overrides zone from the gcp_profile segment
region - [str] (optional) Overrides region from the gcp_profile segment
labels- [Dict[str, str]] (optional) Custom labels to apply to this instance
service_account - [str] (optional) Overrides service account from the gcp_profile segment
network - [str] (optional) Overrides network from the gcp_profile segment
tags- [Dict[str, str]] (optional) Tags to apply to this instance
metadata- [Optional[Dict[str, Any]]] (optional) Custom metadata to apply to this instance
owner - [str] This can be either a single user email address and that would be the only one able to access the notebook. Or service account and then everyone who has the iam.serviceAccounts.actAs permission on the specified service account will be able to connect.
machine_type - [str] (optional) GCP Compute Engine machine type
gpu- [GPU] (optional) The hardware GPU accelerator used on this instance.
data_disk - [Disk] (optional) Data disk configuration to attach to this instance.
kernels - [List[str]] (optional) Custom kernels given as links to container registry
tensorboard_ref - [str] (optional) Reference to Vertex Experimetes
subnet- [str] (optional) Subnetwork of a given network
internal_ip_only - [bool] (optional) Public or private (default) IP address
idle_shutdown - [bool] (optional) Turning off the notebook after the timeout, can be true (default) or false
idle_shutdown_timeout - [int] (optional) Time in minutes, between 10 and 1440, defaults to 180

Dataproc clusters and metastore#

If you want to run Spark jobs on a Dataproc cluster and also have a Hive Metastore service available as your default Spark SQL engine:

Create a Dataproc Metastore in your GCP project & region in the Google Cloud UI
Create a Dataproc cluster connected to this metastore, with a subnet specified. e.g.:

gcloud dataproc clusters create cluster-test --enable-component-gateway --region europe-west1 --subnet cloud-lab --zone europe-west1-b --single-node --optional-components JUPYTER --dataproc-metastore projects/cloud-lab-304213/locations/europe-west1/services/jacek-test

Run your managed notebook. As kernel use Pyspark on the remote Dataproc cluster that you have just created
Test your spark Session for access. This example creates a database in your metastore:

spark = SparkSession \
    .builder \
    .appName("MetastoreTest") \
    .getOrCreate()

query = """CREATE DATABASE testdb"""
spark.sql(query)

Tensorboard integration#

tb-gcp-uploader is needed to upload the logs to the tensorboard instance. A detailed tutorial on this tool can be found here.

If you set the tensorboard_ref in the WANNA yaml config, we will export the tensorboard resource name as AIP_TENSORBOARD_LOG_DIR.

Example#

managed-notebooks:
  - name: example
    owner: jacek.hebda@avast.com
    machine_type: n1-standard-1
    labels:
      notebook_usecase: wanna-notebook-sample
    tags:
    metadata:
    gpu:
      count: 1
      accelerator_type: NVIDIA_TESLA_T4
    data_disk:
      disk_type: pd_standard
      size_gb: 100
    tensorboard_ref:
    kernels:
    network:
    subnet: 
    idle_shutdown: True
    idle_shutdown_timeout: 180