diff --git a/pathwaysutils/experimental/shared_pathways_service/README.md b/pathwaysutils/experimental/shared_pathways_service/README.md index a46dcd4..baa3e2d 100644 --- a/pathwaysutils/experimental/shared_pathways_service/README.md +++ b/pathwaysutils/experimental/shared_pathways_service/README.md @@ -8,54 +8,123 @@ service that manages scheduling and error handling. ## Requirements -Make sure that your GKE cluster is running the Resource Manager and Worker pods. -You can follow the steps -here -to confirm the status of these pods. If you haven't started the Pathways pods -yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml). -Make sure to modify the following values to deploy these pods: - -- A unique Jobset name for the cluster's Pathways pods +### 1. Create a GKE cluster with TPUs + +You have a GKE cluster with at least 1 TPU slice (v5e, v5p or v6e). + + + +### 2. Deploy the Pathways head pod + +Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml). +Make sure to modify the following values to deploy the Pathways pods: + +- A unique Jobset name for the head pod - GCS bucket path - TPU type and topology - Number of slices -These fields are highlighted in the YAML file with trailing comments for easier -understanding. +### 3. Verify that the pods created in [Step#2](#2-deploy-the-pathways-head-pod) are running -## Instructions +Verify that the Shared Pathways Service components are started, specifically the Pathways resource manager (RM) and +Pathways workers. + +```shell +# Set the environment variables. +$ PROJECT= +$ CLUSTER_NAME= +$ REGION= # e.g., us-central2 + +# Get credentials for your cluster. +$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default +``` + +#### Option 1: List all pods + +```shell +$ kubectl get pods + +# Sample expected output (1 Head pod and 1 or more Worker pods) +NAME READY STATUS RESTARTS AGE +pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD +pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0 +pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1 +``` + +#### Option 2: Check the status of the specific pods that belong to your Pathways Service + +```shell +# e.g., pathways-cluster +$ JOBSET_NAME= # same as you used in [pw-service-example.yaml](#pw-service-yaml) -1. Clone `pathwaysutils`. +# e.g., pathways-cluster-pathways-head-0-0-zzmn2 +$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head) -`git clone https://github.com/AI-Hypercomputer/pathways-utils.git` +# e.g., pathways-cluster-worker-0-0-bdzq4 +$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-') +``` -2. Install portpicker +#### Option 3: Check project logs -`pip install portpicker` +Find the detailed instructions +here). -3. Import `isc_pathways` and move your workload under -`with isc_pathways.connect()` statement. Refer to -[run_connect_example.py](run_connect_example.py) for reference. Example code: + +### 4. Find the Pathways service address +Find the address of the Pathways service from the logs. We check the worker pod logs in the below command. +```shell +$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address" +I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001' ``` - from pathwaysutils.experimental.shared_pathways_service import isc_pathways - - with isc_pathways.connect( - cluster="my-cluster", - project="my-project", - region="region", - gcs_bucket="gs://user-bucket", - pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001", - expected_tpu_instances={"tpuv6e:2x2": 2}, - ) as tm: - import jax.numpy as jnp - import pathwaysutils - import pprint - - pathwaysutils.initialize() - orig_matrix = jnp.zeros(5) - ... + +## Instructions + +### 1. Clone `pathwaysutils` and install the +[listed requirements](https://github.com/AI-Hypercomputer/pathways-utils/blob/main/requirements.txt). + +```shell +git clone https://github.com/AI-Hypercomputer/pathways-utils.git +``` + +### 2. Use the `isc_pathways` Context Manager + +In your script, + +1. Import `isc_pathways` +2. Add `with isc_pathways.connect(...)` statement. The function takes the below values: + - Cluster name + - Project name + - Region + - GCS bucket name + - Pathways Service (See instructions to find the RM address [here](#4-find-the-pathways-service-address)) + +3. Write your ML code under this context manager (the `with` block) to run your JAX code on the underlying TPUs. + +See [run_connect_example.py](run_connect_example.py) for reference. Example code: + +```shell + +python3 pathwaysutils/experimental/shared_pathways_service/run_connect_example.py \ +--cluster="my-cluster" \ +--project="my-project" \ +--region="cluster-region" \ +--gcs_bucket="gs://user-bucket" \ +--pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001" \ +--tpu_type="tpuv6e:2x2" \ +--tpu_count=1 # number of slices ``` The connect block will deploy a proxy pod dedicated to your client and connect your local runtime environment to the proxy pod via port-forwarding. + +4. You can start another client that uses the same `pathways_service` (similar to [Step#3](#ml-code)). If the Shared Pathways +Service finds available TPU(s) that match your request, your workload will start running on these available resources. +However, if all TPUs are occupied, you can expect your script to halt until the TPUs are available again. + +## Troubleshooting +- Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways) +if your Pathways pods do not come up! + +- Known errors: The cleanup process of the service is not as clean right now. You can safely ignore the +`Segmentation fault` error, if you see any, after your ML job completes. diff --git a/pathwaysutils/sidecar/python/requirements.txt b/pathwaysutils/sidecar/python/requirements.txt index 6606607..517a98a 100644 --- a/pathwaysutils/sidecar/python/requirements.txt +++ b/pathwaysutils/sidecar/python/requirements.txt @@ -3,3 +3,4 @@ jax>=0.5.1 tensorflow-datasets tiktoken grain-nightly>=0.0.1 +portpicker