Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 104 additions & 35 deletions pathwaysutils/experimental/shared_pathways_service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,54 +8,123 @@ service that manages scheduling and error handling.

## Requirements

Make sure that your GKE cluster is running the Resource Manager and Worker pods.
You can follow the steps
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>
to confirm the status of these pods. If you haven't started the Pathways pods
yet, you can use [pw-service-example.yaml](yamls/pw-service-example.yaml).
Make sure to modify the following values to deploy these pods:

- A unique Jobset name for the cluster's Pathways pods
### 1. Create a GKE cluster with TPUs

You have a GKE cluster with at least 1 TPU slice (v5e, v5p or v6e).

<a name="pw-service-yaml"></a>

### 2. Deploy the Pathways head pod

Start the Shared Pathways Service by using [pw-service-example.yaml](yamls/pw-service-example.yaml).
Make sure to modify the following values to deploy the Pathways pods:

- A unique Jobset name for the head pod
- GCS bucket path
- TPU type and topology
- Number of slices

These fields are highlighted in the YAML file with trailing comments for easier
understanding.
### 3. Verify that the pods created in [Step#2](#2-deploy-the-pathways-head-pod) are running

## Instructions
Verify that the Shared Pathways Service components are started, specifically the Pathways resource manager (RM) and
Pathways workers.

```shell
# Set the environment variables.
$ PROJECT=<your-project>
$ CLUSTER_NAME=<your-cluster>
$ REGION=<cluster-region> # e.g., us-central2

# Get credentials for your cluster.
$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default
```

#### Option 1: List all pods

```shell
$ kubectl get pods

# Sample expected output (1 Head pod and 1 or more Worker pods)
NAME READY STATUS RESTARTS AGE
pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD
pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0
pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1
```

#### Option 2: Check the status of the specific pods that belong to your Pathways Service

```shell
# e.g., pathways-cluster
$ JOBSET_NAME=<your-jobset-name> # same as you used in [pw-service-example.yaml](#pw-service-yaml)

1. Clone `pathwaysutils`.
# e.g., pathways-cluster-pathways-head-0-0-zzmn2
$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)

`git clone https://github.com/AI-Hypercomputer/pathways-utils.git`
# e.g., pathways-cluster-worker-0-0-bdzq4
$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')
```

2. Install portpicker
#### Option 3: Check project logs

`pip install portpicker`
Find the detailed instructions
<a href="https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways#health_monitoring" target="_blank">here</a>).

3. Import `isc_pathways` and move your workload under
`with isc_pathways.connect()` statement. Refer to
[run_connect_example.py](run_connect_example.py) for reference. Example code:
<a name="find-pw-service"></a>
### 4. Find the Pathways service address
Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
```shell
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"

I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'
```
from pathwaysutils.experimental.shared_pathways_service import isc_pathways

with isc_pathways.connect(
cluster="my-cluster",
project="my-project",
region="region",
gcs_bucket="gs://user-bucket",
pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001",
expected_tpu_instances={"tpuv6e:2x2": 2},
) as tm:
import jax.numpy as jnp
import pathwaysutils
import pprint

pathwaysutils.initialize()
orig_matrix = jnp.zeros(5)
...

## Instructions

### 1. Clone `pathwaysutils` and install the
[listed requirements](https://github.com/AI-Hypercomputer/pathways-utils/blob/main/requirements.txt).

```shell
git clone https://github.com/AI-Hypercomputer/pathways-utils.git
```

### 2. Use the `isc_pathways` Context Manager

In your script,

1. Import `isc_pathways`
2. Add `with isc_pathways.connect(...)` statement. The function takes the below values:
- Cluster name
- Project name
- Region
- GCS bucket name
- Pathways Service (See instructions to find the RM address [here](#4-find-the-pathways-service-address))
<a name="ml-code"></a>
3. Write your ML code under this context manager (the `with` block) to run your JAX code on the underlying TPUs.

See [run_connect_example.py](run_connect_example.py) for reference. Example code:

```shell

python3 pathwaysutils/experimental/shared_pathways_service/run_connect_example.py \
--cluster="my-cluster" \
--project="my-project" \
--region="cluster-region" \
--gcs_bucket="gs://user-bucket" \
--pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001" \
--tpu_type="tpuv6e:2x2" \
--tpu_count=1 # number of slices
```

The connect block will deploy a proxy pod dedicated to your client and connect
your local runtime environment to the proxy pod via port-forwarding.

4. You can start another client that uses the same `pathways_service` (similar to [Step#3](#ml-code)). If the Shared Pathways
Service finds available TPU(s) that match your request, your workload will start running on these available resources.
However, if all TPUs are occupied, you can expect your script to halt until the TPUs are available again.

## Troubleshooting
- Refer to [this guide](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/troubleshooting-pathways)
if your Pathways pods do not come up!

- Known errors: The cleanup process of the service is not as clean right now. You can safely ignore the
`Segmentation fault` error, if you see any, after your ML job completes.
1 change: 1 addition & 0 deletions pathwaysutils/sidecar/python/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ jax>=0.5.1
tensorflow-datasets
tiktoken
grain-nightly>=0.0.1
portpicker