Revert internals destruction and add test for internals recreation #5972

XuehaiPan · 2026-01-22T17:29:14Z

Description

This PR fixes a segfault that can occur during Python interpreter shutdown when tp_traverse/tp_clear callbacks call py::cast after pybind11 internals have been destroyed.

Fixes #5958 (comment)

Background

PR #5958 introduced internals_shutdown() that resets internals during interpreter finalization. However, because GC order is not guaranteed during shutdown, tp_traverse/tp_clear can run after internals have been reset. When these callbacks call py::cast, it recreates an empty internals struct, fails type lookup, and throws pybind11::cast_error, terminating the process.

Solution

This PR takes the "leak internals" approach for correctness:

Revert internals destruction: The explicit destructors from Destruct internals during interpreter finalization #5958 are removed. internals and local_internals revert to the previous "leak on shutdown" behavior. The leak only occurs at process exit, so it's benign in practice.
Re-entrancy detection: A new create_pp_content_once() method tracks which interpreters have already created internals. If code attempts to recreate internals after destruction (e.g., late shutdown code calling py::cast), it triggers a hard failure with a diagnostic message rather than silently creating empty internals.
New capsule helpers: get_internals_capsule() and get_local_internals_capsule() allow tests (and potentially user code) to verify capsule lifetime.

Changes

include/pybind11/detail/internals.h:

internals::~internals() -> = default (no explicit destruction)
local_internals no longer stores istate or defines a destructor
Internals capsule passes dtor=nullptr to intentionally leak at shutdown
Added PYBIND11_DTOR_USE_DELETE sentinel for atomic_get_or_create_in_state_dict() to distinguish "use delete" from "leak"
Added create_pp_content_once() with re-entrancy detection
Added get_internals_capsule(), get_local_internals_capsule(), and get_local_internals_key() helpers

Tests:

Refactored OwnsPythonObjects -> ContainerOwnsPythonObjects with a vector-based container
Added add_gc_checkers_with_weakrefs() to verify capsule lifetime during GC
Added test_py_cast_useable_on_shutdown that exercises the shutdown GC path in a subprocess
Moved check_script_success_in_subprocess() to tests/env.py for reuse

Documentation:

Updated docs/advanced/classes.rst example to match the new test code

Notes

No internals version bump: PYBIND11_INTERNALS_VERSION remains 11 for ABI compatibility with v3.0.1
The re-entrancy guard calls pybind11_fail() which terminates the process. This is intentional to surface problematic code paths rather than silently failing.

Suggested changelog entry:

Revert internals destruction from Destruct internals during interpreter finalization #5958 to fix segfault when py::cast is called during interpreter shutdown via tp_traverse/tp_clear callbacks.

📚 Documentation preview 📚: https://pybind11--5972.org.readthedocs.build/

rwgk · 2026-01-22T19:28:50Z

@XuehaiPan Is your plan to work on adding a new unit test here after you have the fix? — Short-term the fix is most important, long-term the test will be more important.

I could try to use some LLM to generate a new test, based on what you provided under 5958. Do you think that'd be useful?

b-pass · 2026-01-22T23:57:34Z

It might be better to mostly revert #5958 if the behavior it has is not what we want. The problem described is expected (but undesirable) behavior from that PR. It is not a use-after-free. It's also possible for OptTree to fix this by manually holding a reference to internals capsule, we could make that easier with something like get_internals_capsule() from this PR.

One option is to go back to something more like the first commit on #5958, before I went into releasing internals itself. The original goal of that PR was to get rid of the leaks of the default_metaclass and instance_base.

XuehaiPan · 2026-01-23T04:01:45Z

I think releasing some part of the internal but keeping the internal leaked would be a safest choice.

rwgk · 2026-01-23T06:16:54Z

I think releasing some part of the internal but keeping the internal leaked would be a safest choice.

@XuehaiPan did you see my question above (https://github.com/pybind/pybind11/pull/5972#issuecomment-3786298624)?

Subinterpreter support added a lot of complexity; if we don’t capture these issues in unit tests, pybind11 will regress over time. I’m very skeptical of fixes that don’t start from a failing unit test.

rwgk · 2026-01-23T06:28:00Z

The original goal of that PR was to get rid of the leaks of the default_metaclass and instance_base.

Those leaks are currently dwarfed by CPython leaks documented here (pure‑C reproducer):

#5958 (comment).

Until those are fixed in CPython, or we find a workaround that sidesteps them, the savings from #5958 are marginal. If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise.

XuehaiPan · 2026-01-23T07:36:36Z

If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise.

This makes sense to me because most users only use the main interpreter. The leak will be freed anyway on program shutdown, while the correctness is more important.

I think releasing some part of the internal but keeping the internal leaked would be a safest choice.

@XuehaiPan did you see my question above (pybind/pybind11/pull/5972#issuecomment-3786298624)?

Subinterpreter support added a lot of complexity; if we don’t capture these issues in unit tests, pybind11 will regress over time. I’m very skeptical of fixes that don’t start from a failing unit test.

I am trying to add a repro. It is hard to find a consistent repro for the order bugs since there is no guarantee.

b-pass · 2026-01-23T22:50:54Z

Until those are fixed in CPython, or we find a workaround that sidesteps them, the savings from #5958 are marginal. If reverting #5958 is what it takes to ship v3.0.2, that seems like a reasonable short‑term compromise.

I think reverting is a better solution than this PR.

XuehaiPan · 2026-01-24T05:02:46Z

I am trying to add a repro. It is hard to find a consistent repro for the order bugs since there is no guarantee.

I added a test that fails on the master branch. But I don't think it can test all cases since there is no guarantee for the GC order.

This reverts commit c5ec1cf. This reverts commit 72c2e0a.

rwgk · 2026-01-24T19:04:34Z

This looks great at first glance. I'll start understanding/reviewing this now with LLM assistance.

rwgk · 2026-01-24T20:19:11Z

Below is the result of me discussing this PR with Cursor (GPT-5.2 Codex High).

I didn't study the code changes in this PR in detail.

In my mind, these are the most important goals:

Avoid an internal version bump before the v3.0.2 release.
Fix the segfault reported by @XuehaiPan before we make the v3.0.2 release.

High level approach: pare this PR back, to more-or-less roll back 5958, but keep as much of the new unit test as possible.

Based on that, Cursor came up with the Options A, B, C below.

@XuehaiPan @b-pass, could you please scrutinize Cursor's analysis, does it make sense?

Based on the analysis, Option A looks best to me. WDYT?

PR 5972 internals version bump considerations (v3.0.2)

Executive summary

For v3.0.2 you want no internals version bump and no new ABI breaks, but you also want to
eliminate the shutdown crash reported under #5958 (comment r2713504794). The only explicit bump in
PR 5972 is the macro change:

PYBIND11_INTERNALS_VERSION 11 -> 12

Nothing else in 5972 inherently requires a bump (no struct layout changes or ABI‑critical type
changes). The reason a bump is attractive is behavioral, not ABI: the shutdown fix in 5972
only protects types built with the new code. A bump prevents mixed old/new extensions from sharing
internals, which otherwise leaves the old types still vulnerable.

So the decision point is:

No bump means mixed v3.0.1 / v3.0.2 extensions will share internals and old types can still
hit the crash, unless you fully avoid destroying internals at shutdown (leak them).
Bump means the fix applies to all modules in a process, but violates the patch‑release goal.

Below are the minimal patch sets and recommendations.

Root cause (why the crash happens)

PR 5958 introduced internals_shutdown() which does pp->reset() when the internals capsule is
destroyed. During interpreter shutdown, GC order is not guaranteed, so tp_traverse/tp_clear
can run after pp->reset() and call py::cast. That recreates an empty internals and fails
type lookup, leading to an uncaught cast_error and termination. This is the crash reported in
comment r2713504794.

What in 5972 is related to the crash fix

Core behavioral change (the shutdown‑safety fix)

These changes allow pybind11 types to hold a reference to the internals capsules, keeping them
alive until those types are destroyed:

get_internals_capsule() / get_local_internals_capsule() helpers in
include/pybind11/detail/internals.h
type‑side reference holding (final branch uses weakref callbacks in
include/pybind11/pybind11.h)

This prevents internals_shutdown() from running before all pybind11 types are collected,
which avoids the crash for newly built extensions.

Non‑essential or unrelated changes

These are not required for the crash fix and add risk for a patch release:

PYBIND11_INTERNALS_VERSION bump to 12 (the only actual bump)
atomic_get_or_create_in_state_dict() default‑destructor sentinel (changes default leak vs
delete semantics)
GIL acquisition before Py_CLEAR in internals/local_internals dtors (safety improvement but not
part of the crash mechanism)
test refactors unrelated to the crash (tests/env.py helper move, test_multiple_interpreters)
doc updates (docs/advanced/classes.rst)

Mixed v3.0.1 / v3.0.2 (no bump) analysis

If you remove the bump, v3.0.1 and v3.0.2 modules will share internals (same capsule key).

Implications:

v3.0.2 types are protected (they hold capsule refs).
v3.0.1 types are not protected (they do not hold refs).
Therefore, the crash can still happen if only v3.0.1 types remain alive when the capsule
is destroyed, or if they outlive v3.0.2 types.
This is not a new bug, but it means the fix is partial in mixed environments.

Minimal patch sets for v3.0.2 (no bump)

Option A (safest for mixed environments, no bump)

Fully avoid destroying internals at shutdown (i.e., revert the risky part of 5958):

Make internals_shutdown() a no‑op (or skip pp->reset()).
Alternatively, change the capsule destructor for internals to leak the payload (no reset).
Add a regression test that exercises py::cast during shutdown and only asserts that the
subprocess exits cleanly (no capsule‑lifetime assertions).

Pros

Fixes the crash for all modules, including v3.0.1 binaries.
No internals bump.
Minimal behavioral surprise.

Cons

Reintroduces the leak fixed in 5958 (at least for shared internals).

This is the only way to guarantee safety in mixed 3.0.1/3.0.2 processes without a bump.

Option B (partial fix, no bump)

Keep the shutdown behavior from 5958, but add the capsule‑ref protection from 5972:

Keep

get_internals_capsule() / get_local_internals_capsule() helpers
type‑side ref holding (either weakref callbacks in pybind11.h or the earlier inc/dec in
class.h)
add/keep the new regression test that exercises py::cast during shutdown

Drop

internals version bump
atomic_get_or_create_in_state_dict() sentinel change
unrelated test/doc refactors

Pros

Fixes the crash for extensions rebuilt with v3.0.2.
Preserves the 5958 leak reduction for those modules.

Cons

Does not fix the crash for v3.0.1 modules in the same process.
Mixed builds remain risky; the fix is incomplete by design.

Option C (post‑release, with bump)

After v3.0.2, keep the shutdown‑safety fix and bump internals version so old modules do not
share internals with new ones. This makes the fix complete in any process.

Recommended course (for v3.0.2)

Given the requirement of full ABI compatibility and avoiding regressions for mixed builds, the
best tradeoff is:

Do not bump PYBIND11_INTERNALS_VERSION for v3.0.2.
Avoid destroying internals at shutdown (Option A).
- This ensures correctness for all modules, including v3.0.1 binaries.
- It matches the “leak is acceptable, correctness is critical” consensus in 5972 discussion.
Add a regression test that reproduces the shutdown GC path and asserts “no crash” rather
than capsule‑lifetime specifics. This keeps the test robust across future design changes.

Then, after v3.0.2, do the larger cleanup:

Apply the capsule‑ref fix (or another safe design).
Bump internals version for the future set of PRs already planned.
Add/restore tests that validate the stronger invariant under a bumped ABI.

Notes on test strategy

The current new test in 5972 uses a weakref callback to confirm the capsule is still alive at type
destruction time. That test is tightly coupled to the “hold capsule refs” design. If you choose
Option A (leak internals), a more resilient test is:

Run a subprocess that creates a pybind11 type with a custom tp_traverse/tp_clear that calls
py::cast, forms a self‑cycle, then exits.
Assert the subprocess exits cleanly (no crash).

This directly targets the failure mode without enforcing a particular lifetime mechanism.

Final answer to “what in 5972 requires an internals bump?”

Strictly speaking, nothing in 5972 requires a bump. There is no ABI‑level change to the
internals layout or the capsule payload. The bump is a risk‑management choice to prevent
old and new extensions from sharing internals when the shutdown fix only protects new types. If
you remove the bump, you must accept partial protection (only new modules) or disable
internals destruction at shutdown to keep mixed processes safe.

This reverts commit 8f25a25.

Prevent re-creation of internals after destruction during interpreter shutdown. If pybind11 code runs after internals have been destroyed, fail early with a clear error message instead of silently creating new empty internals that would cause type lookup failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

XuehaiPan · 2026-01-25T18:38:07Z

This PR is now ready for code review.

rwgk · 2026-01-25T20:25:29Z

This is a very complex PR. In an attempt to understand it, I worked with Cursor (GPT-5.2 Codex High) on a draft comprehensive PR description (below). This all makes sense to me high-level. @XuehaiPan @b-pass could you please check for correctness? @XuehaiPan could you please add the corrected version to the actual PR description?

I'll look in more detail after the comprehensive PR description is posted.

DRAFT comprehensive PR description

Summary

This PR changes pybind11’s interpreter‑shutdown behavior to avoid failures when tp_traverse /
tp_clear calls into py::cast after internals have been torn down. It does this by reverting
internals destruction, intentionally leaking internals across shutdown, and adding a guard
against re‑creating internals after destruction. It also adds a regression test that exercises
the shutdown GC path, plus a small test‑infra refactor and a documentation update.

Importantly, the internals version bump has been reverted; the default
PYBIND11_INTERNALS_VERSION remains 11 for ABI compatibility with v3.0.1.

Motivation / background

PR #5958 introduced internals_shutdown() that resets internals during interpreter finalization.
Because GC order is not guaranteed, tp_traverse / tp_clear can run after internals have
been reset and call py::cast, which recreates an empty internals and then fails type lookup.
This can throw pybind11::cast_error during shutdown and terminate the process. The goal here is
to make shutdown safe again (for v3.0.2) without bumping the internals ABI version.

Runtime behavior changes (net effect vs `upstream/master`)

1) Revert internals/local_internals destruction during shutdown

The explicit destructors added in #5958 are removed. internals and local_internals no longer
DECREF their cached PyTypes during finalization, so they revert to the previous “leak on shutdown”
behavior.

internals::~internals() → default
local_internals no longer stores istate or defines a destructor

This eliminates the immediate failure mode caused by pp->reset() before GC has finished.

2) Intentionally leak internals at interpreter shutdown

Internals capsule creation now passes dtor=nullptr to the state‑dict helper, which means the
capsule’s payload is intentionally leaked instead of destroyed at shutdown. This is explicitly
documented in comments to explain why we prefer leaking to early destruction.

3) Re‑entrancy detection when (re)creating internals

A new create_pp_content_once() path tracks which pp pointers have already created internals
content. If creation is attempted again for the same interpreter (e.g., due to late shutdown code
calling py::cast), it triggers a hard failure (pybind11_fail) with a diagnostic message.

This provides explicit detection of the failure mode that used to silently recreate empty
internals, so the behavior is not “accidentally” masked.

4) `atomic_get_or_create_in_state_dict()` semantics refined

The helper now distinguishes between:

default: use delete when the interpreter shuts down
nullptr: explicitly leak
function: use custom destructor

This enables the explicit “leak internals” behavior without changing default behavior for other
callers.

5) New internals capsule helpers

New helpers return the internals capsules (global and local) and the local‑internals key. These
are primarily used by tests to assert capsule lifetime or to add future runtime protections.

Test changes

New shutdown‑path regression test (via existing test module)

tests/test_custom_type_setup.cpp is refactored:

OwnsPythonObjects is replaced with ContainerOwnsPythonObjects
tp_traverse / tp_clear now iterate a vector of py::object
adds add_gc_checkers_with_weakrefs() that installs weakref callbacks asserting that the
internals capsules remain alive during object finalization

tests/test_custom_type_setup.py adds a new test:

runs in a subprocess
creates a self‑cycle
installs weakref GC checkers
verifies that shutdown GC does not destroy internals before the object is collected

Test infrastructure refactor

A shared check_script_success_in_subprocess() helper is added in tests/env.py.
tests/test_multiple_interpreters.py now uses this helper and adds env to sys.path in the
subprocess preamble.

Documentation update

The docs/advanced/classes.rst example is updated to match the new container‑based custom type
used in the tests.

Notes

The new re‑entrancy guard (pybind11_fail) introduces a hard‑failure path if internals are
ever recreated after destruction for the same interpreter. That is a deliberate behavioral change.
The shutdown fix is implemented by leaking internals, which trades memory for safety. This
aligns with the intended “correctness over leaks” policy for the v3.0.2 patch release.

rwgk

A couple very easy suggestions for changes to comments.

tests/test_custom_type_setup.cpp

tests/test_custom_type_setup.py

include/pybind11/detail/internals.h

XuehaiPan · 2026-01-27T11:11:18Z

@rwgk @b-pass This PR is ready for review. The PR description has been updated. My downstream tests have also passed.

b-pass · 2026-01-27T13:45:03Z

include/pybind11/detail/internals.h

+        // Detect re-creation of internals after destruction during interpreter shutdown.
+        // If pybind11 code (e.g., tp_traverse/tp_clear calling py::cast) runs after internals have
+        // been destroyed, a new empty internals would be created, causing type lookup failures.
+        // See also get_or_create_pp_in_state_dict() comments.


This PR reverted the "re-creation during shutdown" scenario, so why is this code necessary now?

I think there are two reasons:

Add an explicit check to prevent potential bugs in the future. Maybe we can eventually not leak the internals.

The reverted code still creates the internals raw pointers multiple times (pp->reset(new InternalsType())). This PR adds a lock to ensure consistency.

XuehaiPan added 4 commits January 22, 2026 21:26

Bump internals version

33fb4a6

Prevent internals destruction before all pybind11 types are destroyed

a4a6a1e

Merge remote-tracking branch 'upstream/master' into fix-segfault

6d8fa8a

Use Py_XINCREF and Py_XDECREF

1d49006

XuehaiPan marked this pull request as draft January 22, 2026 17:38

XuehaiPan added 6 commits January 23, 2026 02:08

Hold GIL before decref

b147430

Use weakrefs

05576f1

Remove unused code

740f693

Move code location

d9227ce

Move code location

7c5d505

Move code location

436d812

XuehaiPan marked this pull request as ready for review January 22, 2026 19:17

Try add tests

ce9ca7f

XuehaiPan added 6 commits January 24, 2026 16:28

Fix PYTHONPATH

fed1749

Fix PYTHONPATH

a407438

Skip tests for subprocess

3df427c

Revert to leak internals

72c2e0a

Revert to leak internals

c5ec1cf

Revert "Revert to leak internals"

8f25a25

This reverts commit c5ec1cf. This reverts commit 72c2e0a.

Revert internals version bump

97e12d2

XuehaiPan and others added 3 commits January 25, 2026 21:20

Reapply to leak internals

cdefbf3

This reverts commit 8f25a25.

Fix C++11/C++14 support

2c83462

XuehaiPan changed the title ~~Prevent internals destruction before all pybind11 types are destroyed~~ Revert internals destruction and add test for internals recreation Jan 25, 2026

XuehaiPan added 6 commits January 25, 2026 23:46

Add lock under multiple interpreters

6922d7d

Try fix tests

d61f17c

Try fix tests

26e7509

Try fix tests

6820ead

Update comments and assertion messages

15bcbf8

Update comments and assertion messages

b0d350e

rwgk reviewed Jan 25, 2026

View reviewed changes

tests/test_custom_type_setup.cpp Show resolved Hide resolved

tests/test_custom_type_setup.py Outdated Show resolved Hide resolved

XuehaiPan added 5 commits January 26, 2026 11:57

Update comments

33ffa8e

Update lock scope

708ca55

Use original pointer type for Windows

85dc7e6

Change hard error to warning

404457e

Update lock scope

51a70ab

rwgk reviewed Jan 26, 2026

View reviewed changes

include/pybind11/detail/internals.h Outdated Show resolved Hide resolved

XuehaiPan added 7 commits January 27, 2026 01:05

Update lock scope to resolve deadlock

5f96327

Remove scope release of GIL

40731b7

Update comments

aa1767c

Lock pp on reset

dea5660

Mark content created after assignment

691241a

Update comments

552f8b0

Simplify implementation

79a80d2

XuehaiPan requested a review from rwgk January 27, 2026 10:41

Update lock scope when delete unique_ptr

56926f6

b-pass reviewed Jan 27, 2026

View reviewed changes

Revert internals destruction and add test for internals recreation #5972

Are you sure you want to change the base?

Revert internals destruction and add test for internals recreation #5972

Conversation

XuehaiPan commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

Solution

Changes

Notes

Suggested changelog entry:

Uh oh!

rwgk commented Jan 22, 2026

Uh oh!

b-pass commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XuehaiPan commented Jan 23, 2026

Uh oh!

rwgk commented Jan 23, 2026

Uh oh!

rwgk commented Jan 23, 2026

Uh oh!

XuehaiPan commented Jan 23, 2026

Uh oh!

b-pass commented Jan 23, 2026

Uh oh!

XuehaiPan commented Jan 24, 2026

Uh oh!

rwgk commented Jan 24, 2026

Uh oh!

rwgk commented Jan 24, 2026

PR 5972 internals version bump considerations (v3.0.2)

Executive summary

Root cause (why the crash happens)

What in 5972 is related to the crash fix

Core behavioral change (the shutdown‑safety fix)

Non‑essential or unrelated changes

Mixed v3.0.1 / v3.0.2 (no bump) analysis

Minimal patch sets for v3.0.2 (no bump)

Option A (safest for mixed environments, no bump)

Option B (partial fix, no bump)

Option C (post‑release, with bump)

Recommended course (for v3.0.2)

Notes on test strategy

Final answer to “what in 5972 requires an internals bump?”

Uh oh!

XuehaiPan commented Jan 25, 2026

Uh oh!

rwgk commented Jan 25, 2026

DRAFT comprehensive PR description

Summary

Motivation / background

Runtime behavior changes (net effect vs upstream/master)

1) Revert internals/local_internals destruction during shutdown

2) Intentionally leak internals at interpreter shutdown

3) Re‑entrancy detection when (re)creating internals

4) atomic_get_or_create_in_state_dict() semantics refined

5) New internals capsule helpers

Test changes

New shutdown‑path regression test (via existing test module)

Test infrastructure refactor

Documentation update

Notes

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XuehaiPan commented Jan 27, 2026

Uh oh!

b-pass Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

XuehaiPan Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

XuehaiPan commented Jan 22, 2026 •

edited

Loading

b-pass commented Jan 22, 2026 •

edited

Loading

Runtime behavior changes (net effect vs `upstream/master`)

4) `atomic_get_or_create_in_state_dict()` semantics refined