Skip to content

Conversation

@puzpuzpuz
Copy link

The race condition happens when per-cpu arenas and background threads features are enabled:

  1. Thread A calls malloc() first -> starts malloc_init_hard()
  2. While Thread A is still in malloc_init_hard(), Threads B, C, D also call malloc()
  3. These threads trigger arena creation (with percpu_arena, each CPU gets its own arena)
  4. arena_init() -> arena_new_create_background_thread() -> background_thread_create(arena_ind=12, etc.)
  5. Thread A hasn't reached line 2232 (background_thread_create(tsd, 0)) yet
  6. So a non-zero arena can mark thread slot 0 as "started" before arena 0 does

The initialization has locks, but arena creation for different arenas can proceed in parallel as soon as malloc_init_state = initialized and init_lock is released, and the background_thread_create(tsd, 0) call happens late in malloc_init_hard() (after arenas are already being created by other threads).

This bug won't be noticeable for applications that have single-threaded initialization path, but in case of JVM it's easily reproducible. When background thread initialization silently fails due to the race, RSS grows infinitely until the process gets OOM killed.

@meta-cla
Copy link

meta-cla bot commented Jan 10, 2026

Hi @puzpuzpuz!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla
Copy link

meta-cla bot commented Jan 10, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the cla signed label Jan 10, 2026
@puzpuzpuz puzpuzpuz force-pushed the background-thread-race branch from e52a78e to 05ebbdd Compare January 12, 2026 11:05
@puzpuzpuz
Copy link
Author

The build seems to be timing out on ARM. Not sure if that has anything to do with the patch.

@lexprfuncall
Copy link

This bug won't be noticeable for applications that have single-threaded initialization path, but in case of JVM it's easily reproducible. When background thread initialization silently fails due to the race, RSS grows infinitely until the process gets OOM killed.

Thank you for the patch! These kinds of initialization bugs can be very tricky to root cause. I saw a similar one recently where the page size at compile time did not match the page size at runtime caused a multi-threaded Python process to deadlock when it called mallctl just to see if jemalloc was in use.

I will take a closer look a the fix alongside the team next week. If possible, I would like to add a test for this. (Might require a follow-up commit.) As is, multi-threaded initialization does not have as much coverage as it deserves.

@lexprfuncall lexprfuncall self-assigned this Jan 16, 2026
@puzpuzpuz puzpuzpuz force-pushed the background-thread-race branch 2 times, most recently from e52a78e to 3c2d2da Compare January 20, 2026 10:24
@puzpuzpuz
Copy link
Author

If possible, I would like to add a test for this. (Might require a follow-up commit.) As is, multi-threaded initialization does not have as much coverage as it deserves.

I've added some multi-threaded tests in 3c2d2da. PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants