[DECOUPLED-MODE] Adding Decoupling Logic #2865

gulsumgudukbay · 2025-12-21T06:49:51Z

Description

This PR is the second part of the decoupling support. It adds logic for decoupling support, along with some test modifications for decoupling to be enabled.

Details:

Update decoupled_base_test.yml
Add decoupling locig to src/MaxText/decode.py, src/MaxText/elastic_train.py, src/MaxText/experimental/rl/grpo_trainer.py, src/MaxText/gcp_workload_monitor.py, src/MaxText/max_utils.py, src/MaxText/maxengine.py, src/MaxText/maxengine_config.py, src/MaxText/maxengine_server.py, src/MaxText/metric_logger.py, src/MaxText/prefill_packing.py, src/MaxText/profiler.py, src/MaxText/sft/hooks.py, src/MaxText/sft/sft_trainer.py, src/MaxText/train.py, src/MaxText/utils/gcs_utils.py, src/MaxText/utils/goodput_utils.py, src/MaxText/vertex_tensorboard.py
Update src/MaxText/gcloud_stub.py to add IS_STUB variables, and add google_cloud_mldiagnostics stub
Update tests to support decoupled mode (add markers, update file paths, make them use decoupled_base_test.yml config file).

Tests

All unit tests pass in decoupled mode.
UT results:
== 306 passed, 170 skipped, 25 deselected, 6588 warnings in 975.16s (0:16:15) ==

Train test:
python -m MaxText.train MaxText/configs/base.yml run_name=test hardware=gpu steps=5 model_name=llama2-7b attention=cudnn_flash_te enable_checkpointing=False ici_expert_parallelism=1 ici_fsdp_parallelism=-1 ici_data_parallelism=1 remat_policy=minimal scan_layers=True dataset_type=synthetic logits_dot_in_fp32=False dtype=bfloat16 weight_dtype=bfloat16 per_device_batch_size=1 max_target_length=2048 shardy=False

works.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

(cherry picked from commit e8cc951)

(cherry picked from commit 0b58e96)

(cherry picked from commit 14f0508)

…ts library (cherry picked from commit 6f0b361)

(cherry picked from commit e43e370)

(cherry picked from commit 1c14d6c)

…ck, todo: remove this after updating jax. Configure ICI data parallelism for decoupled mode

tests/test_env_smoke.py

tests/sft_hooks_test.py

src/MaxText/sft/sft_trainer.py

src/MaxText/sft/hooks.py

src/MaxText/experimental/rl/grpo_trainer.py

…re marks are used

SurbhiJainUSC · 2026-01-08T17:09:07Z

__init__.py

@@ -0,0 +1,14 @@
+"""Top-level shim for importing test_utils


License header please.

SurbhiJainUSC · 2026-01-08T17:13:44Z

__init__.py

@@ -0,0 +1,14 @@
+"""Top-level shim for importing test_utils
+
+This shim lets test modules import `maxtext.tests`.


Can this __init__.py be moved inside /tests, since it contains the logic relevant for testing only?

SurbhiJainUSC · 2026-01-08T17:22:44Z

tests/train_smoke_test.py


  def test_tiny_config(self):
    test_tmpdir = os.environ.get("TEST_TMPDIR")  # pylint: disable=unused-variable
+    dataset_path = get_test_dataset_path()


This can be moved to a setUp() method, and it will be called automatically before each individual test methods.

SurbhiJainUSC · 2026-01-08T17:25:55Z

tests/train_int8_smoke_test.py

+    decoupled = is_decoupled()
+    dataset_path = get_test_dataset_path()
+    base_output_directory = (
+        os.environ.get("LOCAL_BASE_OUTPUT", get_test_base_output_directory())


Why don't we use this if logic in train_smoke_test.py?

SurbhiJainUSC · 2026-01-08T17:28:36Z

tests/train_compile_test.py

 from MaxText.globals import MAXTEXT_PKG_DIR
+from maxtext.tests.test_utils import get_test_config_path
+
+pytestmark = [pytest.mark.tpu_only]


These tests are suppose to run on CPUs. Why are we adding tpu_only marker?

SurbhiJainUSC · 2026-01-08T17:30:36Z

tests/tfds_data_processing_test.py

 from MaxText.input_pipeline import input_pipeline_interface
+from maxtext.tests.test_utils import get_test_config_path, get_test_dataset_path, get_test_base_output_directory
+
+MAXTEXT_ASSETS_ROOT = os.path.join("src", MAXTEXT_PKG_DIR, "assets")


MAXTEXT_ASSETS_ROOT is already initialized here: https://github.com/ROCm/maxtext/blob/4799cef7af51bb45272390e9b66bbae58790c0bd/src/MaxText/globals.py#L31

SurbhiJainUSC · 2026-01-08T17:32:55Z

tests/test_env_smoke.py

+_defects: list[str] = []
+
+
+def _import(name: str):


This method is not called anywhere, I guess?

gulsumgudukbay and others added 25 commits December 21, 2025 06:16

adding necessary files

2beebcf

add decoupling logic config patch to tests

f5755bb

add correct ICI parallelism to tests for decoupled mode

ef9c62e

Add tpu_only marker to train compile test

b6be265

fixing more UTs

583eee9

adding decoupling logic, biggest change

91e51d5

add tensorboardX stub

4647288

adding tokamax changes along with some UT fixes

4e7afd6

fixing little UT issues:

ebfef26

fixing train_tests

9bf8105

removing CI workflows for now to upstream decoupling changes

4d6d2b0

making jetstream and tunix optional and add is_stub variables

ad66362

removing tunix from decoupling logic

0e97a31

(cherry picked from commit e8cc951)

removing tunix from decoupled mode logic

227b980

addressing PR comments

06365cb

(cherry picked from commit 0b58e96)

fixing pylint issues

fe14ece

(cherry picked from commit 14f0508)

renaming datasets to local_datasets to avoid confusion with HF datase…

58b135b

…ts library (cherry picked from commit 6f0b361)

pyink fixes

c6d31bd

(cherry picked from commit e43e370)

Rename GCE_MARKERS to GCP_MARKERS

a9bfd32

(cherry picked from commit 1c14d6c)

updating dataset paths and fix gcloud_Stub

b9c7dbf

making decoupled mode work with upstream updates

7476b28

updates for upstream sync UTs

c8d7898

remove context_parallel_strategy config param, todo: add it back later

71f862a

make jax_remove_size_one_mesh_axis_from_type param setting in try blo…

8c27a9e

…ck, todo: remove this after updating jax. Configure ICI data parallelism for decoupled mode

Fix decoupled rampup and mesh configs for bare-metal tests

c5dbfdc

gulsumgudukbay requested review from bvandermoon, gobbleturk, jacoguzo, richjames0 and shralex as code owners December 21, 2025 06:49