Scalable orchestration for distributed ETL, ML Training, Hyperparameter Tuning and ML Inference workloads
PoC demonstrating distributed workload orchestration using Ray as the primary compute framework with Prefect for workflow orchestration, supporting cloud-native deployments (Kubernetes).
NOTE: This PoC focuses on local development (Docker Compose) and production-like testing (Kubernetes on Kind). For actual production deployments, managed Kubernetes services (AWS EKS, GCP GKE, Azure AKS) with Ray are recommended for their auto-scaling, reliability and operational simplicity. See Deployment Comparison for details.
Choose your deployment environment based on your needs:
Prerequisites: Docker + Docker Compose. Consider existing dev container for local setup
# 1. Start distributed cluster
make compose-start
# 2. Run workloads directly on Ray
make compose-etl-ray # Distributed ETL
make compose-train-ray # Distributed ML training
make compose-tune-ray # Distributed hyperparameter tuning
make compose-serve-start-ray # Inference serving
make test-inference-api # Test inference service API
# 3. View dashboards
make open-ray # Ray Dashboard (http://localhost:8265)
make open-mlflow # MLflow UI (http://localhost:5000)
# 4. Stop and cleanup
make compose-clean# 1. Start distributed cluster (includes Prefect)
make compose-start
# 2. Run orchestrated ML pipeline
make compose-run-pipeline-prefect # Full pipeline: ETL → Tune → Train
make compose-run-etl-prefect # Distributed ETL only
make compose-deploy-model-prefect # Deploy model / Inference serving
make test-inference-api # Test inference service API
# 3. View dashboards
make open-prefect # Prefect UI (http://localhost:4200)
make open-ray # Ray Dashboard (http://localhost:8265)
make open-mlflow # MLflow UI (http://localhost:5000)
# 4. Schedule workflows (optional)
make compose-deploy-schedules-prefect # Deploy daily/hourly schedules
# 5. Stop and cleanup
make compose-cleanPrerequisites: Docker + Kind cluster. Consider existing dev container for local setup
# 1. Deploy complete ML stack to Kind cluster
make k8s-deploy
# 2. Port-forward dashboards in separate terminal (required for access)
make k8s-forward
# 3. Run workloads directly on Ray cluster
make k8s-etl-ray # Distributed ETL
make k8s-train-ray # Distributed ML training
make k8s-tune-ray # Distributed hyperparameter tuning
make k8s-serve-start-ray # Inference serving
make test-inference-api # Test inference service API
# 4. View dashboards
make open-ray # Ray Dashboard (http://localhost:8265)
make open-mlflow # MLflow UI (http://localhost:5000)
# 5. Cleanup
make k8s-clean# 1. Deploy complete ML stack to Kind cluster
make k8s-deploy
# 2. Port-forward dashboards in separate terminal (required for access)
make k8s-forward
# 3. Run orchestrated ML pipeline
make k8s-run-pipeline-prefect # Full pipeline: ETL → Tune → Train
make k8s-run-etl-prefect # Distributed ETL only
make k8s-deploy-model-prefect # Deploy model / Inference serving
make test-inference-api # Test inference service API
# 4. View dashboards
make open-prefect # Prefect UI (http://localhost:4200)
make open-ray # Ray Dashboard (http://localhost:8265)
make open-mlflow # MLflow UI (http://localhost:5000)
# 5. Schedule workflows (optional)
make k8s-deploy-schedules-prefect # Deploy daily/hourly schedules
# 6. Cleanup
make k8s-clean┌─────────────────────────────────────────────────────────────┐
│ Workflow Orchestration │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Prefect Server │ │
│ │ • Workflow scheduling (cron, intervals, events) │ │
│ │ • DAG management & dependencies │ │
│ │ • Retry logic & error handling │ │
│ │ • Task monitoring & observability │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
(submits jobs via CLI)
↓
┌─────────────────────────────────────────────────────────────┐
│ Distributed Compute Engine │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Ray Cluster (Head + Workers) │ │
│ │ • Job submission & scheduling │ │
│ │ • Distributed execution (ETL, Train, Tune) │ │
│ │ • Resource management (CPU/GPU allocation) │ │
│ │ • Model serving (Ray Serve) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
(logs metrics & artifacts)
↓
┌─────────────────────────────────────────────────────────────┐
│ Experiment Tracking & Storage │
│ ┌────────────────────┐ ┌──────────────────────────────┐ │
│ │ MLflow Server │ │ S3 / LocalStack │ │
│ │ • Run tracking │ │ • Model checkpoints │ │
│ │ • Metrics logging │ │ • Training artifacts │ │
│ │ • Model registry │ │ • ETL results │ │
│ └────────────────────┘ └──────────────────────────────┘ │
│ ┌────────────────────┐ │
│ │ PostgreSQL │ │
│ │ • MLflow metadata │ │
│ └────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Deployment Targets │
│ ┌──────────────┐ ┌─────────────┐ │
│ │ Docker │ │ Kubernetes │ │
│ │ Compose │ │ (Kind/EKS/ │ │
│ │ (Local Dev) │ │ GKE/AKS) │ │
│ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
| Environment | Use Case | Command | Production Ready |
|---|---|---|---|
| Docker Compose | Local dev, testing | make compose-start |
❌ No |
| Kubernetes (Kind) | Production-like testing | make k8s-deploy |
|
| Managed Kubernetes | Production workloads | See Deployment comparison | ✅ Recommended |
For production workloads, use managed Kubernetes services with KubeRay Operator:
- AWS: Amazon EKS
- GCP: Google GKE
- Azure: Azure AKS
Advantages:
- Auto-scaling (nodes + pods)
- Self-healing & high availability
- Managed control plane
- Production-grade observability
See Deployment Comparison for detailed guidance.
The following deployment patterns are not covered in this PoC but are valid production options:
- Use case: Maximum throughput for large-scale LLM pretraining
- Pattern: Ray's
symmetric-runcommand on SLURM - Network: InfiniBand for multi-node GPU training
- Reference: See
scripts/ray_on_slurm.sh(example only)
- Use case: Fixed-size workloads, maximum flexibility
- Pattern: Ray on VMs with Terraform/Ansible/Puppet/Chef
- Tradeoff: Manual scaling vs K8s auto-scaling
For more details, see:
Usage: make [target]
Common targets:
open-ray Open Ray Dashboard
open-mlflow Open MLflow Dashboard
open-prefect Open Prefect Dashboard
test-inference-api Test inference service API
Docker Compose targets:
compose-start Start all services
compose-stop Stop all services
compose-rebuild Rebuild all images
compose-logs Show logs
compose-clean Stop and remove everything
compose-etl-ray Run ETL (dashboard logs)
compose-train-ray Run PyTorch training (dashboard logs)
compose-tune-ray Run hyperparameter tuning (dashboard logs)
compose-serve-start-ray Deploy inference service
compose-serve-stop-ray Stop inference service
compose-run-pipeline-prefect Run ML training pipeline (Prefect)
compose-deploy-model-prefect Deploy model (Prefect)
compose-run-etl-prefect Run ETL only (Prefect)
compose-deploy-schedules-prefect Deploy Prefect schedules
Kubernetes targets:
k8s-deploy Deploy to Kind cluster
k8s-clean Cleanup Kind cluster
k8s-forward Port-forward dashboards
k8s-etl-ray Run ETL on K8s
k8s-train-ray Run PyTorch training on K8s
k8s-tune-ray Run hyperparameter tuning on K8s
k8s-serve-start-ray Deploy inference service on K8s
k8s-serve-stop-ray Stop inference service on K8s
k8s-run-pipeline-prefect Run ML pipeline via Prefect on K8s
k8s-deploy-model-prefect Deploy model via Prefect on K8s
k8s-run-etl-prefect Run ETL only via Prefect on K8s
k8s-deploy-schedules-prefect Deploy Prefect schedules on K8s