-
Notifications
You must be signed in to change notification settings - Fork 77
Make IB more configurable #703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances InfiniBand (IB) configurability by adding support for custom port selection, GID index specification, and custom libibverbs library loading. The changes enable more flexible deployment in heterogeneous IB environments where different ports or GID indices may be required.
Key changes:
- Added
port,gidIndex, anddeviceIndexfields toEndpointConfig::Ibstructure - Introduced
MSCCLPP_IBV_SOenvironment variable for custom libibverbs library paths - Added
--ib_gid_indexCLI option to mp_unit_tests for test-time configuration
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| include/mscclpp/core.hpp | Added new IB configuration fields (port, gidIndex, deviceIndex) to EndpointConfig::Ib with default values |
| include/mscclpp/env.hpp | Added ibvSo field to support MSCCLPP_IBV_SO environment variable |
| src/include/ib.hpp | Updated IbQp constructor to accept port and gidIndex; removed port from IbQpInfo struct; added portNum_ and gidIndex_ member variables |
| src/ib.cc | Refactored IbQp initialization and validation to use configurable port and GID index; updated port usability checks |
| src/endpoint.cc | Updated createQp call to pass port and gidIndex from config |
| src/env.cpp | Added reading and logging of MSCCLPP_IBV_SO environment variable; refactored template logic |
| src/ibverbs_wrapper.cc | Implemented custom libibverbs.so loading via MSCCLPP_IBV_SO environment variable |
| test/mp_unit/mp_unit_tests.hpp | Fixed license header capitalization (MIT License) |
| test/mp_unit/mp_unit_tests.cc | Enhanced argument parser with flexible syntax; added --ib_gid_index option; fixed license header |
| test/mp_unit/port_channel_tests.cu | Updated to use EndpointConfig with gidIndex from command-line argument |
| test/mp_unit/ib_tests.cu | Changed IB device selection from division to modulo; updated test data type to uint64_t; added gidIndex parameter |
| python/csrc/core_py.cpp | Added Python bindings for new IB configuration fields with both camelCase and snake_case accessors |
| python/test/_cpp/proxy_test.cpp | Changed nanobind casts from const reference to value type; switched to emplace_back |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (mscclpp::env()->ibvSo != "") { | ||
| void* handle = ::dlopen(mscclpp::env()->ibvSo.c_str(), RTLD_NOW); | ||
| if (handle) { | ||
| globalIBVerbsHandle.reset(handle); |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a custom libibverbs library is specified via MSCCLPP_IBV_SO but fails to load, the code silently falls through to the default library search without any warning or error. This could lead to confusion when the user expects a specific library to be used but a different one is loaded instead. Consider logging a warning or error message if the custom library path is specified but fails to load.
| globalIBVerbsHandle.reset(handle); | |
| globalIBVerbsHandle.reset(handle); | |
| } else { | |
| WARN(NET, "Failed to load custom libibverbs library specified by MSCCLPP_IBV_SO ('", mscclpp::env()->ibvSo, "'): ", std::string(::dlerror())); |
| MSCCLPP_CUDATHROW(cudaSetDevice(cudaDevId)); | ||
|
|
||
| int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) / mscclpp::getIBDeviceCount(); | ||
| int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) % mscclpp::getIBDeviceCount(); |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator changed from division (/) to modulo (%). This changes the device selection logic significantly. Previously, if there were multiple ranks per node and multiple IB devices, the division would distribute ranks across IB devices (e.g., with 8 ranks and 4 devices: ranks 0-1 use device 0, ranks 2-3 use device 1, etc.). Now with modulo, ranks are distributed round-robin (e.g., ranks 0,4 use device 0, ranks 1,5 use device 1, etc.). Please verify this behavioral change is intentional, as it may affect multi-device configurations.
| int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) % mscclpp::getIBDeviceCount(); | |
| int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) / (gEnv->nRanksPerNode / mscclpp::getIBDeviceCount()); |
| .def_prop_rw( | ||
| "ib_device_index", [](EndpointConfig& self) { return self.ib.deviceIndex; }, | ||
| [](EndpointConfig& self, int v) { self.ib.deviceIndex = v; }) | ||
| .def_prop_rw( | ||
| "ib_port", [](EndpointConfig& self) { return self.ib.port; }, | ||
| [](EndpointConfig& self, int v) { self.ib.port = v; }) | ||
| .def_prop_rw( | ||
| "ib_gid_index", [](EndpointConfig& self) { return self.ib.gidIndex; }, | ||
| [](EndpointConfig& self, int v) { self.ib.gidIndex = v; }) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming convention is inconsistent: the C++ struct uses camelCase (deviceIndex, gidIndex) but the Python property names use snake_case (ib_device_index, ib_gid_index). While the def_rw attributes correctly use camelCase to match the C++ members, the def_prop_rw properties add an "ib_" prefix and use snake_case. This creates two different ways to access the same fields in Python, which could be confusing. Consider using a consistent naming convention.
portandgidIndexfield in the IB endpoint config (anddeviceIndexfield for future usages)MSCCLPP_IBV_SOenv variable to specify a custom libibverbs.so--ib_gid_indexCLI option tomp_unit_tests