Skip to content

Conversation

@chhwang
Copy link
Contributor

@chhwang chhwang commented Dec 12, 2025

  • Added port and gidIndex field in the IB endpoint config (and deviceIndex field for future usages)
  • Added MSCCLPP_IBV_SO env variable to specify a custom libibverbs.so
  • Added --ib_gid_index CLI option to mp_unit_tests
  • Other minor fixes

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances InfiniBand (IB) configurability by adding support for custom port selection, GID index specification, and custom libibverbs library loading. The changes enable more flexible deployment in heterogeneous IB environments where different ports or GID indices may be required.

Key changes:

  • Added port, gidIndex, and deviceIndex fields to EndpointConfig::Ib structure
  • Introduced MSCCLPP_IBV_SO environment variable for custom libibverbs library paths
  • Added --ib_gid_index CLI option to mp_unit_tests for test-time configuration

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
include/mscclpp/core.hpp Added new IB configuration fields (port, gidIndex, deviceIndex) to EndpointConfig::Ib with default values
include/mscclpp/env.hpp Added ibvSo field to support MSCCLPP_IBV_SO environment variable
src/include/ib.hpp Updated IbQp constructor to accept port and gidIndex; removed port from IbQpInfo struct; added portNum_ and gidIndex_ member variables
src/ib.cc Refactored IbQp initialization and validation to use configurable port and GID index; updated port usability checks
src/endpoint.cc Updated createQp call to pass port and gidIndex from config
src/env.cpp Added reading and logging of MSCCLPP_IBV_SO environment variable; refactored template logic
src/ibverbs_wrapper.cc Implemented custom libibverbs.so loading via MSCCLPP_IBV_SO environment variable
test/mp_unit/mp_unit_tests.hpp Fixed license header capitalization (MIT License)
test/mp_unit/mp_unit_tests.cc Enhanced argument parser with flexible syntax; added --ib_gid_index option; fixed license header
test/mp_unit/port_channel_tests.cu Updated to use EndpointConfig with gidIndex from command-line argument
test/mp_unit/ib_tests.cu Changed IB device selection from division to modulo; updated test data type to uint64_t; added gidIndex parameter
python/csrc/core_py.cpp Added Python bindings for new IB configuration fields with both camelCase and snake_case accessors
python/test/_cpp/proxy_test.cpp Changed nanobind casts from const reference to value type; switched to emplace_back

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (mscclpp::env()->ibvSo != "") {
void* handle = ::dlopen(mscclpp::env()->ibvSo.c_str(), RTLD_NOW);
if (handle) {
globalIBVerbsHandle.reset(handle);
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a custom libibverbs library is specified via MSCCLPP_IBV_SO but fails to load, the code silently falls through to the default library search without any warning or error. This could lead to confusion when the user expects a specific library to be used but a different one is loaded instead. Consider logging a warning or error message if the custom library path is specified but fails to load.

Suggested change
globalIBVerbsHandle.reset(handle);
globalIBVerbsHandle.reset(handle);
} else {
WARN(NET, "Failed to load custom libibverbs library specified by MSCCLPP_IBV_SO ('", mscclpp::env()->ibvSo, "'): ", std::string(::dlerror()));

Copilot uses AI. Check for mistakes.
MSCCLPP_CUDATHROW(cudaSetDevice(cudaDevId));

int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) / mscclpp::getIBDeviceCount();
int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) % mscclpp::getIBDeviceCount();
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator changed from division (/) to modulo (%). This changes the device selection logic significantly. Previously, if there were multiple ranks per node and multiple IB devices, the division would distribute ranks across IB devices (e.g., with 8 ranks and 4 devices: ranks 0-1 use device 0, ranks 2-3 use device 1, etc.). Now with modulo, ranks are distributed round-robin (e.g., ranks 0,4 use device 0, ranks 1,5 use device 1, etc.). Please verify this behavioral change is intentional, as it may affect multi-device configurations.

Suggested change
int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) % mscclpp::getIBDeviceCount();
int ibDevId = (gEnv->rank % gEnv->nRanksPerNode) / (gEnv->nRanksPerNode / mscclpp::getIBDeviceCount());

Copilot uses AI. Check for mistakes.
Comment on lines +185 to +193
.def_prop_rw(
"ib_device_index", [](EndpointConfig& self) { return self.ib.deviceIndex; },
[](EndpointConfig& self, int v) { self.ib.deviceIndex = v; })
.def_prop_rw(
"ib_port", [](EndpointConfig& self) { return self.ib.port; },
[](EndpointConfig& self, int v) { self.ib.port = v; })
.def_prop_rw(
"ib_gid_index", [](EndpointConfig& self) { return self.ib.gidIndex; },
[](EndpointConfig& self, int v) { self.ib.gidIndex = v; })
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention is inconsistent: the C++ struct uses camelCase (deviceIndex, gidIndex) but the Python property names use snake_case (ib_device_index, ib_gid_index). While the def_rw attributes correctly use camelCase to match the C++ members, the def_prop_rw properties add an "ib_" prefix and use snake_case. This creates two different ways to access the same fields in Python, which could be confusing. Consider using a consistent naming convention.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants