Skip to content

Conversation

@Prayer3th
Copy link

Description

support qwen3-next use muon optimizer

Tests

tests/muon_test.py
image

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@google-cla
Copy link

google-cla bot commented Dec 23, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A few minor comments. Thank you for the change!

model_name_arg = sys.argv[1]
scan_layers_arg = sys.argv[2].lower() == "true"
get_model_mdn(model_name_arg, scan_layers_arg, verbose=True)
get_model_mdn(model_name_arg, scan_layers_arg, verbose=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra line, similar for other files.

An instance of `MuonDimensionNumbers` if a specific mapping is found,
`None` for excluded parameters, or a default `mdn` for standard weights.
1. Exclusions: Skip vectors/biases/embeddings (AdamW).
2. MoE: Handle both DeepSeek style (MoeBlock_0) and Qwen3-Next style (routed_experts).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: quote the checkpoint name MoE: Handle both DeepSeek style ("MoeBlock_0") and Qwen3-Next style ("routed_experts").

return any(x in path for x in tuples)


def transform_logic(path: Tuple[str, ...]) -> Optional[mdn]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shuningjin Could we have a refactor PR to pass in model name as follow up? I have seen checkpoint name divergence. It will be better we transform weights based on model, instead of checkpoint path.

Copy link
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pass the screenshot or log, to train Qwen3 with Muon optimizer end-to-end?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants