Distributed Workload PoC

Scalable orchestration for distributed ETL, ML Training, Hyperparameter Tuning and ML Inference workloads

Overview

PoC demonstrating distributed workload orchestration using Ray as the primary compute framework with Prefect for workflow orchestration, supporting cloud-native deployments (Kubernetes).

NOTE: This PoC focuses on local development (Docker Compose) and production-like testing (Kubernetes on Kind). For actual production deployments, managed Kubernetes services (AWS EKS, GCP GKE, Azure AKS) with Ray are recommended for their auto-scaling, reliability and operational simplicity. See Deployment Comparison for details.

Quick Start

Choose your deployment environment based on your needs:

Local Development (Docker Compose)

Prerequisites: Docker + Docker Compose. Consider existing dev container for local setup

Option 1: Direct Ray Execution

# 1. Start distributed cluster
make compose-start

# 2. Run workloads directly on Ray
make compose-etl-ray            # Distributed ETL
make compose-train-ray          # Distributed ML training
make compose-tune-ray           # Distributed hyperparameter tuning
make compose-serve-start-ray    # Inference serving
make test-inference-api         # Test inference service API

# 3. View dashboards
make open-ray                   # Ray Dashboard (http://localhost:8265)
make open-mlflow                # MLflow UI (http://localhost:5000)

# 4. Stop and cleanup
make compose-clean

Option 2: Prefect-Orchestrated Workflows

# 1. Start distributed cluster (includes Prefect)
make compose-start

# 2. Run orchestrated ML pipeline
make compose-run-pipeline-prefect   # Full pipeline: ETL → Tune → Train
make compose-run-etl-prefect        # Distributed ETL only
make compose-deploy-model-prefect   # Deploy model / Inference serving
make test-inference-api             # Test inference service API

# 3. View dashboards
make open-prefect               # Prefect UI (http://localhost:4200)
make open-ray                   # Ray Dashboard (http://localhost:8265)
make open-mlflow                # MLflow UI (http://localhost:5000)

# 4. Schedule workflows (optional)
make compose-deploy-schedules-prefect  # Deploy daily/hourly schedules

# 5. Stop and cleanup
make compose-clean

Production-like Environment (Kubernetes)

Prerequisites: Docker + Kind cluster. Consider existing dev container for local setup

Option 1: Direct Ray Execution on Kubernetes

# 1. Deploy complete ML stack to Kind cluster
make k8s-deploy

# 2. Port-forward dashboards in separate terminal (required for access)
make k8s-forward

# 3. Run workloads directly on Ray cluster
make k8s-etl-ray                # Distributed ETL
make k8s-train-ray              # Distributed ML training
make k8s-tune-ray               # Distributed hyperparameter tuning
make k8s-serve-start-ray        # Inference serving
make test-inference-api         # Test inference service API

# 4. View dashboards
make open-ray                   # Ray Dashboard (http://localhost:8265)
make open-mlflow                # MLflow UI (http://localhost:5000)

# 5. Cleanup
make k8s-clean

Option 2: Prefect-Orchestrated Workflows on Kubernetes

# 1. Deploy complete ML stack to Kind cluster
make k8s-deploy

# 2. Port-forward dashboards in separate terminal (required for access)
make k8s-forward

# 3. Run orchestrated ML pipeline
make k8s-run-pipeline-prefect   # Full pipeline: ETL → Tune → Train
make k8s-run-etl-prefect        # Distributed ETL only
make k8s-deploy-model-prefect   # Deploy model / Inference serving
make test-inference-api         # Test inference service API

# 4. View dashboards
make open-prefect               # Prefect UI (http://localhost:4200)
make open-ray                   # Ray Dashboard (http://localhost:8265)
make open-mlflow                # MLflow UI (http://localhost:5000)

# 5. Schedule workflows (optional)
make k8s-deploy-schedules-prefect  # Deploy daily/hourly schedules

# 6. Cleanup
make k8s-clean

Architecture

┌─────────────────────────────────────────────────────────────┐
│                  Workflow Orchestration                     │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Prefect Server                                      │   │
│  │  • Workflow scheduling (cron, intervals, events)     │   │
│  │  • DAG management & dependencies                     │   │
│  │  • Retry logic & error handling                      │   │
│  │  • Task monitoring & observability                   │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              ↓
                    (submits jobs via CLI)
                              ↓
┌─────────────────────────────────────────────────────────────┐
│              Distributed Compute Engine                     │
│  ┌──────────────────────────────────────────────────────┐   │
│  │   Ray Cluster (Head + Workers)                       │   │
│  │   • Job submission & scheduling                      │   │
│  │   • Distributed execution (ETL, Train, Tune)         │   │
│  │   • Resource management (CPU/GPU allocation)         │   │
│  │   • Model serving (Ray Serve)                        │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              ↓
                    (logs metrics & artifacts)
                              ↓
┌─────────────────────────────────────────────────────────────┐
│            Experiment Tracking & Storage                    │
│  ┌────────────────────┐  ┌──────────────────────────────┐   │
│  │  MLflow Server     │  │  S3 / LocalStack             │   │
│  │  • Run tracking    │  │  • Model checkpoints         │   │
│  │  • Metrics logging │  │  • Training artifacts        │   │
│  │  • Model registry  │  │  • ETL results               │   │
│  └────────────────────┘  └──────────────────────────────┘   │
│  ┌────────────────────┐                                     │
│  │  PostgreSQL        │                                     │
│  │  • MLflow metadata │                                     │
│  └────────────────────┘                                     │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                 Deployment Targets                          │
│  ┌──────────────┐  ┌─────────────┐                          │
│  │ Docker       │  │ Kubernetes  │                          │
│  │ Compose      │  │ (Kind/EKS/  │                          │
│  │ (Local Dev)  │  │ GKE/AKS)    │                          │
│  └──────────────┘  └─────────────┘                          │
└─────────────────────────────────────────────────────────────┘

Deployment Options

Environment	Use Case	Command	Production Ready
Docker Compose	Local dev, testing	`make compose-start`	❌ No
Kubernetes (Kind)	Production-like testing	`make k8s-deploy`	⚠️ Testing only
Managed Kubernetes	Production workloads	See Deployment comparison	✅ Recommended

Production Deployment

For production workloads, use managed Kubernetes services with KubeRay Operator:

AWS: Amazon EKS
GCP: Google GKE
Azure: Azure AKS

Advantages:

Auto-scaling (nodes + pods)
Self-healing & high availability
Managed control plane
Production-grade observability

See Deployment Comparison for detailed guidance.

Future Enhancements

Out of Scope (Future Work)

The following deployment patterns are not covered in this PoC but are valid production options:

Ray on HPC Clusters (SLURM)

Use case: Maximum throughput for large-scale LLM pretraining
Pattern: Ray's symmetric-run command on SLURM
Network: InfiniBand for multi-node GPU training
Reference: See scripts/ray_on_slurm.sh (example only)

Ray on Virtual Machines

Use case: Fixed-size workloads, maximum flexibility
Pattern: Ray on VMs with Terraform/Ansible/Puppet/Chef
Tradeoff: Manual scaling vs K8s auto-scaling

For more details, see:

Available Commands

Usage: make [target]

Common targets:
  open-ray                    Open Ray Dashboard
  open-mlflow                 Open MLflow Dashboard
  open-prefect                Open Prefect Dashboard
  test-inference-api          Test inference service API

Docker Compose targets:
  compose-start               Start all services
  compose-stop                Stop all services
  compose-rebuild             Rebuild all images
  compose-logs                Show logs
  compose-clean               Stop and remove everything
  compose-etl-ray             Run ETL (dashboard logs)
  compose-train-ray           Run PyTorch training (dashboard logs)
  compose-tune-ray            Run hyperparameter tuning (dashboard logs)
  compose-serve-start-ray     Deploy inference service
  compose-serve-stop-ray      Stop inference service
  compose-run-pipeline-prefect Run ML training pipeline (Prefect)
  compose-deploy-model-prefect Deploy model (Prefect)
  compose-run-etl-prefect     Run ETL only (Prefect)
  compose-deploy-schedules-prefect Deploy Prefect schedules

Kubernetes targets:
  k8s-deploy                  Deploy to Kind cluster
  k8s-clean                   Cleanup Kind cluster
  k8s-forward                 Port-forward dashboards
  k8s-etl-ray                 Run ETL on K8s
  k8s-train-ray               Run PyTorch training on K8s
  k8s-tune-ray                Run hyperparameter tuning on K8s
  k8s-serve-start-ray         Deploy inference service on K8s
  k8s-serve-stop-ray          Stop inference service on K8s
  k8s-run-pipeline-prefect    Run ML pipeline via Prefect on K8s
  k8s-deploy-model-prefect    Deploy model via Prefect on K8s
  k8s-run-etl-prefect         Run ETL only via Prefect on K8s
  k8s-deploy-schedules-prefect Deploy Prefect schedules on K8s

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.devcontainer/kind		.devcontainer/kind
docs		docs
infra		infra
scripts		scripts
workloads		workloads
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Workload PoC

Overview

Quick Start

Local Development (Docker Compose)

Option 1: Direct Ray Execution

Option 2: Prefect-Orchestrated Workflows

Production-like Environment (Kubernetes)

Option 1: Direct Ray Execution on Kubernetes

Option 2: Prefect-Orchestrated Workflows on Kubernetes

Architecture

Deployment Options

Production Deployment

Future Enhancements

Out of Scope (Future Work)

Ray on HPC Clusters (SLURM)

Ray on Virtual Machines

Available Commands

About

Uh oh!

Releases

Packages

Languages

License

MGTheTrain/distributed-workload-poc

Folders and files

Latest commit

History

Repository files navigation

Distributed Workload PoC

Overview

Quick Start

Local Development (Docker Compose)

Option 1: Direct Ray Execution

Option 2: Prefect-Orchestrated Workflows

Production-like Environment (Kubernetes)

Option 1: Direct Ray Execution on Kubernetes

Option 2: Prefect-Orchestrated Workflows on Kubernetes

Architecture

Deployment Options

Production Deployment

Future Enhancements

Out of Scope (Future Work)

Ray on HPC Clusters (SLURM)

Ray on Virtual Machines

Available Commands

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages