Skip to content

Conversation

@XuehaiPan
Copy link
Contributor

@XuehaiPan XuehaiPan commented Jan 22, 2026

Description

This PR fixes a segfault that can occur during Python interpreter shutdown when tp_traverse/tp_clear callbacks call py::cast after pybind11 internals have been destroyed.

Fixes #5958 (comment)

Background

PR #5958 introduced internals_shutdown() that resets internals during interpreter finalization. However, because GC order is not guaranteed during shutdown, tp_traverse/tp_clear can run after internals have been reset. When these callbacks call py::cast, it recreates an empty internals struct, fails type lookup, and throws pybind11::cast_error, terminating the process.

Solution

This PR takes the "leak internals" approach for correctness:

  1. Revert internals destruction: The explicit destructors from Destruct internals during interpreter finalization #5958 are removed. internals and local_internals revert to the previous "leak on shutdown" behavior. The leak only occurs at process exit, so it's benign in practice.

  2. Re-entrancy detection: A new create_pp_content_once() method tracks which interpreters have already created internals. If code attempts to recreate internals after destruction (e.g., late shutdown code calling py::cast), it triggers a hard failure with a diagnostic message rather than silently creating empty internals.

  3. New capsule helpers: get_internals_capsule() and get_local_internals_capsule() allow tests (and potentially user code) to verify capsule lifetime.

Changes

include/pybind11/detail/internals.h:

  • internals::~internals() -> = default (no explicit destruction)
  • local_internals no longer stores istate or defines a destructor
  • Internals capsule passes dtor=nullptr to intentionally leak at shutdown
  • Added PYBIND11_DTOR_USE_DELETE sentinel for atomic_get_or_create_in_state_dict() to distinguish "use delete" from "leak"
  • Added create_pp_content_once() with re-entrancy detection
  • Added get_internals_capsule(), get_local_internals_capsule(), and get_local_internals_key() helpers

Tests:

  • Refactored OwnsPythonObjects -> ContainerOwnsPythonObjects with a vector-based container
  • Added add_gc_checkers_with_weakrefs() to verify capsule lifetime during GC
  • Added test_py_cast_useable_on_shutdown that exercises the shutdown GC path in a subprocess
  • Moved check_script_success_in_subprocess() to tests/env.py for reuse

Documentation:

  • Updated docs/advanced/classes.rst example to match the new test code

Notes

  • No internals version bump: PYBIND11_INTERNALS_VERSION remains 11 for ABI compatibility with v3.0.1
  • The re-entrancy guard calls pybind11_fail() which terminates the process. This is intentional to surface problematic code paths rather than silently failing.

Suggested changelog entry:


📚 Documentation preview 📚: https://pybind11--5972.org.readthedocs.build/

@XuehaiPan XuehaiPan marked this pull request as draft January 22, 2026 17:38
@XuehaiPan XuehaiPan marked this pull request as ready for review January 22, 2026 19:17
@rwgk
Copy link
Collaborator

rwgk commented Jan 22, 2026

@XuehaiPan Is your plan to work on adding a new unit test here after you have the fix? — Short-term the fix is most important, long-term the test will be more important.

I could try to use some LLM to generate a new test, based on what you provided under 5958. Do you think that'd be useful?

@b-pass
Copy link
Collaborator

b-pass commented Jan 22, 2026

It might be better to mostly revert #5958 if the behavior it has is not what we want. The problem described is expected (but undesirable) behavior from that PR. It is not a use-after-free. It's also possible for OptTree to fix this by manually holding a reference to internals capsule, we could make that easier with something like get_internals_capsule() from this PR.

One option is to go back to something more like the first commit on #5958, before I went into releasing internals itself. The original goal of that PR was to get rid of the leaks of the default_metaclass and instance_base.

@XuehaiPan
Copy link
Contributor Author

I think releasing some part of the internal but keeping the internal leaked would be a safest choice.

@rwgk
Copy link
Collaborator

rwgk commented Jan 23, 2026

I think releasing some part of the internal but keeping the internal leaked would be a safest choice.

@XuehaiPan did you see my question above (​https://github.com/pybind/pybind11/pull/5972#issuecomment-3786298624​)?

Subinterpreter support added a lot of complexity; if we don’t capture these issues in unit tests, pybind11 will regress over time. I’m very skeptical of fixes that don’t start from a failing unit test.

@rwgk
Copy link
Collaborator

rwgk commented Jan 23, 2026

The original goal of that PR was to get rid of the leaks of the default_metaclass and instance_base.

Those leaks are currently dwarfed by CPython leaks documented here (pure‑C reproducer):

#5958 (comment).

Until those are fixed in CPython, or we find a workaround that sidesteps them, the savings from #5958 are marginal. If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise.

@XuehaiPan
Copy link
Contributor Author

If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise.

This makes sense to me because most users only use the main interpreter. The leak will be freed anyway on program shutdown, while the correctness is more important.

I think releasing some part of the internal but keeping the internal leaked would be a safest choice.

@XuehaiPan did you see my question above (​pybind/pybind11/pull/5972#issuecomment-3786298624​)?

Subinterpreter support added a lot of complexity; if we don’t capture these issues in unit tests, pybind11 will regress over time. I’m very skeptical of fixes that don’t start from a failing unit test.

I am trying to add a repro. It is hard to find a consistent repro for the order bugs since there is no guarantee.

@b-pass
Copy link
Collaborator

b-pass commented Jan 23, 2026

Until those are fixed in CPython, or we find a workaround that sidesteps them, the savings from #5958 are marginal. If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise.

I think reverting is a better solution than this PR.

@XuehaiPan
Copy link
Contributor Author

I am trying to add a repro. It is hard to find a consistent repro for the order bugs since there is no guarantee.

I added a test that fails on the master branch. But I don't think it can test all cases since there is no guarantee for the GC order.

@rwgk
Copy link
Collaborator

rwgk commented Jan 24, 2026

This looks great at first glance. I'll start understanding/reviewing this now with LLM assistance.

@rwgk
Copy link
Collaborator

rwgk commented Jan 24, 2026

Below is the result of me discussing this PR with Cursor (GPT-5.2 Codex High).

I didn't study the code changes in this PR in detail.

In my mind, these are the most important goals:

  • Avoid an internal version bump before the v3.0.2 release.
  • Fix the segfault reported by @XuehaiPan before we make the v3.0.2 release.

High level approach: pare this PR back, to more-or-less roll back 5958, but keep as much of the new unit test as possible.

Based on that, Cursor came up with the Options A, B, C below.

@XuehaiPan @b-pass, could you please scrutinize Cursor's analysis, does it make sense?

Based on the analysis, Option A looks best to me. WDYT?


PR 5972 internals version bump considerations (v3.0.2)

Executive summary

For v3.0.2 you want no internals version bump and no new ABI breaks, but you also want to
eliminate the shutdown crash reported under #5958 (comment r2713504794). The only explicit bump in
PR 5972 is the macro change:

  • PYBIND11_INTERNALS_VERSION 11 -> 12

Nothing else in 5972 inherently requires a bump (no struct layout changes or ABI‑critical type
changes). The reason a bump is attractive is behavioral, not ABI: the shutdown fix in 5972
only protects types built with the new code. A bump prevents mixed old/new extensions from sharing
internals, which otherwise leaves the old types still vulnerable.

So the decision point is:

  • No bump means mixed v3.0.1 / v3.0.2 extensions will share internals and old types can still
    hit the crash
    , unless you fully avoid destroying internals at shutdown (leak them).
  • Bump means the fix applies to all modules in a process, but violates the patch‑release goal.

Below are the minimal patch sets and recommendations.


Root cause (why the crash happens)

PR 5958 introduced internals_shutdown() which does pp->reset() when the internals capsule is
destroyed. During interpreter shutdown, GC order is not guaranteed, so tp_traverse/tp_clear
can run after pp->reset() and call py::cast. That recreates an empty internals and fails
type lookup, leading to an uncaught cast_error and termination. This is the crash reported in
comment r2713504794.


What in 5972 is related to the crash fix

Core behavioral change (the shutdown‑safety fix)

These changes allow pybind11 types to hold a reference to the internals capsules, keeping them
alive until those types are destroyed:

  • get_internals_capsule() / get_local_internals_capsule() helpers in
    include/pybind11/detail/internals.h
  • type‑side reference holding (final branch uses weakref callbacks in
    include/pybind11/pybind11.h)

This prevents internals_shutdown() from running before all pybind11 types are collected,
which avoids the crash for newly built extensions.

Non‑essential or unrelated changes

These are not required for the crash fix and add risk for a patch release:

  • PYBIND11_INTERNALS_VERSION bump to 12 (the only actual bump)
  • atomic_get_or_create_in_state_dict() default‑destructor sentinel (changes default leak vs
    delete semantics)
  • GIL acquisition before Py_CLEAR in internals/local_internals dtors (safety improvement but not
    part of the crash mechanism)
  • test refactors unrelated to the crash (tests/env.py helper move, test_multiple_interpreters)
  • doc updates (docs/advanced/classes.rst)

Mixed v3.0.1 / v3.0.2 (no bump) analysis

If you remove the bump, v3.0.1 and v3.0.2 modules will share internals (same capsule key).

Implications:

  • v3.0.2 types are protected (they hold capsule refs).
  • v3.0.1 types are not protected (they do not hold refs).
  • Therefore, the crash can still happen if only v3.0.1 types remain alive when the capsule
    is destroyed, or if they outlive v3.0.2 types.
  • This is not a new bug, but it means the fix is partial in mixed environments.

Minimal patch sets for v3.0.2 (no bump)

Option A (safest for mixed environments, no bump)

Fully avoid destroying internals at shutdown (i.e., revert the risky part of 5958):

  • Make internals_shutdown() a no‑op (or skip pp->reset()).
  • Alternatively, change the capsule destructor for internals to leak the payload (no reset).
  • Add a regression test that exercises py::cast during shutdown and only asserts that the
    subprocess exits cleanly (no capsule‑lifetime assertions).

Pros

  • Fixes the crash for all modules, including v3.0.1 binaries.
  • No internals bump.
  • Minimal behavioral surprise.

Cons

  • Reintroduces the leak fixed in 5958 (at least for shared internals).

This is the only way to guarantee safety in mixed 3.0.1/3.0.2 processes without a bump.

Option B (partial fix, no bump)

Keep the shutdown behavior from 5958, but add the capsule‑ref protection from 5972:

Keep

  • get_internals_capsule() / get_local_internals_capsule() helpers
  • type‑side ref holding (either weakref callbacks in pybind11.h or the earlier inc/dec in
    class.h)
  • add/keep the new regression test that exercises py::cast during shutdown

Drop

  • internals version bump
  • atomic_get_or_create_in_state_dict() sentinel change
  • unrelated test/doc refactors

Pros

  • Fixes the crash for extensions rebuilt with v3.0.2.
  • Preserves the 5958 leak reduction for those modules.

Cons

  • Does not fix the crash for v3.0.1 modules in the same process.
  • Mixed builds remain risky; the fix is incomplete by design.

Option C (post‑release, with bump)

After v3.0.2, keep the shutdown‑safety fix and bump internals version so old modules do not
share internals with new ones. This makes the fix complete in any process.


Recommended course (for v3.0.2)

Given the requirement of full ABI compatibility and avoiding regressions for mixed builds, the
best tradeoff is:

  1. Do not bump PYBIND11_INTERNALS_VERSION for v3.0.2.
  2. Avoid destroying internals at shutdown (Option A).
    • This ensures correctness for all modules, including v3.0.1 binaries.
    • It matches the “leak is acceptable, correctness is critical” consensus in 5972 discussion.
  3. Add a regression test that reproduces the shutdown GC path and asserts “no crash” rather
    than capsule‑lifetime specifics. This keeps the test robust across future design changes.

Then, after v3.0.2, do the larger cleanup:

  • Apply the capsule‑ref fix (or another safe design).
  • Bump internals version for the future set of PRs already planned.
  • Add/restore tests that validate the stronger invariant under a bumped ABI.

Notes on test strategy

The current new test in 5972 uses a weakref callback to confirm the capsule is still alive at type
destruction time. That test is tightly coupled to the “hold capsule refs” design. If you choose
Option A (leak internals), a more resilient test is:

  • Run a subprocess that creates a pybind11 type with a custom tp_traverse/tp_clear that calls
    py::cast, forms a self‑cycle, then exits.
  • Assert the subprocess exits cleanly (no crash).

This directly targets the failure mode without enforcing a particular lifetime mechanism.


Final answer to “what in 5972 requires an internals bump?”

Strictly speaking, nothing in 5972 requires a bump. There is no ABI‑level change to the
internals layout or the capsule payload. The bump is a risk‑management choice to prevent
old and new extensions from sharing internals when the shutdown fix only protects new types. If
you remove the bump, you must accept partial protection (only new modules) or disable
internals destruction at shutdown
to keep mixed processes safe.

XuehaiPan and others added 3 commits January 25, 2026 21:20
This reverts commit 8f25a25.
Prevent re-creation of internals after destruction during interpreter
shutdown. If pybind11 code runs after internals have been destroyed,
fail early with a clear error message instead of silently creating
new empty internals that would cause type lookup failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@XuehaiPan XuehaiPan changed the title Prevent internals destruction before all pybind11 types are destroyed Revert internals destruction and add test for internals recreation Jan 25, 2026
@XuehaiPan
Copy link
Contributor Author

This PR is now ready for code review.

@rwgk
Copy link
Collaborator

rwgk commented Jan 25, 2026

This is a very complex PR. In an attempt to understand it, I worked with Cursor (GPT-5.2 Codex High) on a draft comprehensive PR description (below). This all makes sense to me high-level. @XuehaiPan @b-pass could you please check for correctness? @XuehaiPan could you please add the corrected version to the actual PR description?

I'll look in more detail after the comprehensive PR description is posted.


DRAFT comprehensive PR description

Summary

This PR changes pybind11’s interpreter‑shutdown behavior to avoid failures when tp_traverse /
tp_clear calls into py::cast after internals have been torn down. It does this by reverting
internals destruction
, intentionally leaking internals across shutdown, and adding a guard
against re‑creating internals after destruction
. It also adds a regression test that exercises
the shutdown GC path, plus a small test‑infra refactor and a documentation update.

Importantly, the internals version bump has been reverted; the default
PYBIND11_INTERNALS_VERSION remains 11 for ABI compatibility with v3.0.1.


Motivation / background

PR #5958 introduced internals_shutdown() that resets internals during interpreter finalization.
Because GC order is not guaranteed, tp_traverse / tp_clear can run after internals have
been reset and call py::cast, which recreates an empty internals and then fails type lookup.
This can throw pybind11::cast_error during shutdown and terminate the process. The goal here is
to make shutdown safe again (for v3.0.2) without bumping the internals ABI version.


Runtime behavior changes (net effect vs upstream/master)

1) Revert internals/local_internals destruction during shutdown

The explicit destructors added in #5958 are removed. internals and local_internals no longer
DECREF their cached PyTypes during finalization, so they revert to the previous “leak on shutdown”
behavior.

  • internals::~internals() → default
  • local_internals no longer stores istate or defines a destructor

This eliminates the immediate failure mode caused by pp->reset() before GC has finished.

2) Intentionally leak internals at interpreter shutdown

Internals capsule creation now passes dtor=nullptr to the state‑dict helper, which means the
capsule’s payload is intentionally leaked instead of destroyed at shutdown. This is explicitly
documented in comments to explain why we prefer leaking to early destruction.

3) Re‑entrancy detection when (re)creating internals

A new create_pp_content_once() path tracks which pp pointers have already created internals
content. If creation is attempted again for the same interpreter (e.g., due to late shutdown code
calling py::cast), it triggers a hard failure (pybind11_fail) with a diagnostic message.

This provides explicit detection of the failure mode that used to silently recreate empty
internals, so the behavior is not “accidentally” masked.

4) atomic_get_or_create_in_state_dict() semantics refined

The helper now distinguishes between:

  • default: use delete when the interpreter shuts down
  • nullptr: explicitly leak
  • function: use custom destructor

This enables the explicit “leak internals” behavior without changing default behavior for other
callers.

5) New internals capsule helpers

New helpers return the internals capsules (global and local) and the local‑internals key. These
are primarily used by tests to assert capsule lifetime or to add future runtime protections.


Test changes

New shutdown‑path regression test (via existing test module)

tests/test_custom_type_setup.cpp is refactored:

  • OwnsPythonObjects is replaced with ContainerOwnsPythonObjects
  • tp_traverse / tp_clear now iterate a vector of py::object
  • adds add_gc_checkers_with_weakrefs() that installs weakref callbacks asserting that the
    internals capsules remain alive during object finalization

tests/test_custom_type_setup.py adds a new test:

  • runs in a subprocess
  • creates a self‑cycle
  • installs weakref GC checkers
  • verifies that shutdown GC does not destroy internals before the object is collected

Test infrastructure refactor

  • A shared check_script_success_in_subprocess() helper is added in tests/env.py.
  • tests/test_multiple_interpreters.py now uses this helper and adds env to sys.path in the
    subprocess preamble.

Documentation update

The docs/advanced/classes.rst example is updated to match the new container‑based custom type
used in the tests.


Notes

  • The new re‑entrancy guard (pybind11_fail) introduces a hard‑failure path if internals are
    ever recreated after destruction for the same interpreter. That is a deliberate behavioral change.
  • The shutdown fix is implemented by leaking internals, which trades memory for safety. This
    aligns with the intended “correctness over leaks” policy for the v3.0.2 patch release.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple very easy suggestions for changes to comments.

@XuehaiPan XuehaiPan requested a review from rwgk January 27, 2026 10:41
@XuehaiPan
Copy link
Contributor Author

@rwgk @b-pass This PR is ready for review. The PR description has been updated. My downstream tests have also passed.

Comment on lines 726 to 729
// Detect re-creation of internals after destruction during interpreter shutdown.
// If pybind11 code (e.g., tp_traverse/tp_clear calling py::cast) runs after internals have
// been destroyed, a new empty internals would be created, causing type lookup failures.
// See also get_or_create_pp_in_state_dict() comments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR reverted the "re-creation during shutdown" scenario, so why is this code necessary now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two reasons:

  1. Add an explicit check to prevent potential bugs in the future. Maybe we can eventually not leak the internals.
  2. The reverted code still creates the internals raw pointers multiple times (pp->reset(new InternalsType())). This PR adds a lock to ensure consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants