Skip to content

Conversation

@mresvanis
Copy link

@mresvanis mresvanis commented Jan 15, 2026

This PR adds Fabric Manager (FM) Shared NVSwitch virtualization model support when NVSwitch devices are detected and the newly introduced FABRIC_MANAGER_FABRIC_MODE env var is set to 1 (shared-nvswitch).

No changes introduced when FABRIC_MANAGER_FABRIC_MODE=0 (default FM mode - full-passthrough), which is the current flow when NVSwitch devices are detected.

Relates to: NVIDIA/gpu-operator#2045

Changes

  • Add env var FABRIC_MANAGER_FABRIC_MODE to control fabric manager FABRIC_MODE (defaults to 0 for full-passthrough, 1 for shared-nvswitch).
  • Update fabric manager config to the shared-nvswitch fabric mode.
  • Configure UNIX socket communication instead of TCP.
  • Create GPU physical module ID to PCIe address mapping JSON file via nvidia-smi.
  • Do not start nvidia-persistenced since GPU devices should be bound to vfio-pci by the vfio-manager in the next step.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mresvanis mresvanis force-pushed the fabric-manager-configuration branch 6 times, most recently from 9332a4c to 8dea1b9 Compare January 22, 2026 13:20
The changes include:

- add the `FABRIC_MANAGER_FABRIC_MODE` env var that configures FM with
  either full-passthrough (0) or shared-nvswitch (1) fabric mode. It
  defaults to 0.
- when fabric manager mode is set to 0 no changes to the flow, i.e.
  execute the fabric manager daemon with its default configuration.
- when fabric manager mode is set to 1:
  - edit the fabric manager configuration file and set `FABRIC_MODE=1`.
  - persist mapping of physical GPU module IDs to their PCIe address by
    creating a JSON file on disk (the physical GPU module IDs are
    available through nvidia-smi).
  - disable `nvidia-persistenced`, as the GPU devices should be
    unbound from the NVIDIA driver and bound to vfio-pci (a step
    executed by the vfio-manager).

Signed-off-by: Michail Resvanis <mresvani@redhat.com>
@mresvanis mresvanis force-pushed the fabric-manager-configuration branch from 8dea1b9 to 078ef34 Compare January 22, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant