Skip to content

Use uv for Megatron Setup#560

Open
FurtherAI wants to merge 1 commit intomainfrom
austin/megatron-pyproject-setup
Open

Use uv for Megatron Setup#560
FurtherAI wants to merge 1 commit intomainfrom
austin/megatron-pyproject-setup

Conversation

@FurtherAI
Copy link
Collaborator

  • Move everything possible from projects/art/src/art/megatron/setup.sh to uv. This makes it cleaner to set up a venv for dev.
  • Also fix a bug in Megatron training loop: release the packed tensors before clearing the directory so that it loops and shuts down cleanly.

@FurtherAI FurtherAI force-pushed the austin/megatron-pyproject-setup branch from 345ccb9 to 1a802f6 Compare February 14, 2026 02:33
FurtherAI added a commit that referenced this pull request Feb 18, 2026
## Summary
Trying to add megatron dependencies in uv. This makes CI expensive, so I am trying to build a cached image with megatron related dependencies which does the heavy lifting. This can be updated by a dedicated workflow when megatron dependencies are modified.

This PR is to add that workflow to main so #560 can test the workflow and CI.
@FurtherAI FurtherAI force-pushed the austin/megatron-pyproject-setup branch 4 times, most recently from 0078f51 to 05bf351 Compare February 19, 2026 07:12
- Replace prebuilt-image bootstrap with uv-cache release-asset workflow and fingerprint gating.
- Use compute/build helpers (`compute_uv_fingerprint.py`, `build_and_push_uv_cache.sh`) and document cache refresh in CONTRIBUTING.
- Harden containerized Prek workflow for deterministic execution (git installed before checkout, explicit safe.directory, POSIX-safe shell checks).
- Enforce fingerprint-only cache restore by removing moving `*-current.tar.zst` fallback from restore and publish paths.
- Keep Prek dependency scope at `--all-extras --group dev` to preserve megatron coverage in CI.
- Store full uv cache as fingerprinted chunked release assets (`.tar.zst.part-###`) so CI reuses wheel/build payloads without GH single-asset size limits.
- Download cache parts with parallelism 8 in CI restore path, then reassemble in deterministic order.
- Bump cache fingerprint schema/layout contract for the chunked-asset format.
- Keep immutable cache retention policy (latest 4 fingerprints).
@FurtherAI FurtherAI force-pushed the austin/megatron-pyproject-setup branch from 05bf351 to 30126ec Compare February 19, 2026 08:35
@FurtherAI
Copy link
Collaborator Author

Big problem with the initial version, CI takes forever and lots of memory. But mostly solved. After a number of iterations, the solution I have arrived at is as following:

  • Prebuild uv cache and uv.lock including Megatron dependencies and builds locally on a computer that has more resources than the runner (high number of cores and RAM). This takes 15-20 min
  • Push this as an artifact, keyed with hash of pyproject.toml, uv.lock, base image, and Python version.
  • During CI, warm the UV cache with the prebuilt version, uv sync ... runs quickly so pre-commit checks can run.

This does increase the total CI time from ~2 min -> ~6 min (added installing the docker container and the uv cache). The benefit is keeping our dependencies in line (note we found an issue with conflicting required versions of numpy already) and ensuring reproducibility for our environments.

@FurtherAI FurtherAI requested a review from bradhilton February 19, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments