Add huge pages support for pymalloc

pymalloc allocates small objects from contiguous regions called arenas. On 64-bit platforms each arena is 1 MiB, obtained via `mmap(MAP_PRIVATE|MAP_ANONYMOUS)` and backed by 256 standard 4 KiB pages. Each page needs its own TLB entry, and the x86_64 dTLB only holds 64-128 entries for 4K pages, so a single arena already overflows TLB capacity, and any non-trivial Python program touches many arenas.

Most modern operating systems support "huge pages": memory pages much larger than the default 4 KiB. On x86_64 Linux the standard huge page size is 2 MiB. A single 2 MiB huge page is covered by one TLB entry instead of 512 entries for the equivalent range of 4 KiB pages. This dramatically reduces TLB pressure for workloads that touch large contiguous allocations. On Linux, explicit huge pages are allocated via `mmap` with the `MAP_HUGETLB` flag (available since kernel 2.6.32) from a pre-reserved pool configured through `/proc/sys/vm/nr_hugepages`. On Windows, the equivalent is `VirtualAlloc` with `MEM_LARGE_PAGES`.

I'd like to propose adding a `./configure --with-pymalloc-hugepages` option that increases `ARENA_BITS` from 20 to 21 (1 MiB -> 2 MiB) and makes `_PyMem_ArenaAlloc()` try `mmap(MAP_HUGETLB)` first, falling back to regular `mmap` if the huge page pool is exhausted. On Windows the equivalent would be `VirtualAlloc(MEM_LARGE_PAGES)` with fallback. `_PyMem_ArenaFree()` needs no changes since `munmap` handles huge pages identically. All derived constants (`ARENA_SIZE`, `MAX_POOLS_IN_ARENA`, radix tree bit widths, `nfp2lasta` sizing) adjust automatically from `ARENA_BITS`. 

The flag is opt-in and off by default. `MAP_HUGETLB` requires the kernel to have huge pages pre-allocated; without them the fallback path produces identical behavior to a non-hugepages build. On Linux, huge pages are managed through `/proc/sys/vm/nr_hugepages`. To allocate 128 huge pages (256 MiB on x86_64 where the default huge page size is 2 MiB):

```bash
# Allocate (requires root)
echo 128 | sudo tee /proc/sys/vm/nr_hugepages

# Verify
grep HugePages /proc/meminfo
# HugePages_Total:     128
# HugePages_Free:      128

# Make persistent across reboots by adding to /etc/sysctl.conf:
# vm.nr_hugepages = 128
```

Each arena consumes one huge page. If the pool runs out, obmalloc falls back to regular 4K pages transparently.

I benchmarked on an i9-14900KS, Linux 6.18.3, GCC 15.2.1 on main with `nr_hugepages=128`. Measured with `perf stat -r 100` using `cpu_core` counters. GC disabled during benchmarks.

Wall-clock results:

| Benchmark | Default | Hugepages | Change |
|---|---|---|---|
| list_of_tuples (1M 3-tuples) | 0.172s | 0.121s | **-29.5%** |
| fragmentation (500K alloc/free/realloc) | 0.162s | 0.119s | **-26.5%** |
| mixed_sizes (500K, 12 size classes) | 0.141s | 0.106s | **-25.1%** |
| bulk_small_alloc (1M bytearrays) | 0.205s | 0.160s | **-22.1%** |
| class_instances (500K \_\_slots\_\_) | 0.120s | 0.096s | **-20.0%** |
| arena_pressure (10x200K objects) | 0.509s | 0.448s | **-12.1%** |
| random_walk (1M, shuffled access) | 0.822s | 0.759s | **-7.6%** |

dTLB miss reductions:

| Benchmark | dTLB Load Miss | dTLB Store Miss | Page Faults |
|---|---|---|---|
| fragmentation | -95.9% | -94.7% | -94.5% |
| random_walk | -93.1% | -98.9% | -91.6% |
| bulk_small_alloc | -91.4% | -94.5% | -93.5% |
| list_of_tuples | -88.0% | -93.7% | -94.1% |
| class_instances | -84.3% | -91.8% | -92.1% |
| mixed_sizes | -80.8% | -76.5% | -78.2% |

The perf command used per benchmark:

```bash
EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
perf stat -r 10 -e "$EVENTS" ./python bench_obmalloc.py fragmentation
```

<details>
<summary>bench_obmalloc.py</summary>

```python
import sys, gc

def bench_small_object_churn():
    objs = []
    for _ in range(200_000): objs.append(bytearray(64))
    for _ in range(200_000): objs.append(bytearray(64)); objs.pop(0)

def bench_bulk_small_alloc():
    objs = [bytearray(48) for _ in range(1_000_000)]
    for o in objs: o[0] = 1

def bench_dict_churn():
    for _ in range(500_000): d = {"a": 1, "b": 2, "c": 3, "d": 4}; del d

def bench_mixed_sizes():
    sizes = [8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512]
    objs = [bytearray(sizes[i % 12]) for i in range(500_000)]

def bench_fragmentation():
    objs = [bytearray(128) for _ in range(500_000)]
    for i in range(0, len(objs), 2): objs[i] = None
    for i in range(0, len(objs), 2): objs[i] = bytearray(128)

def bench_list_of_tuples():
    objs = [(i, i+1, i+2) for i in range(1_000_000)]

def bench_class_instances():
    class Pt:
        __slots__ = ('x', 'y', 'z')
        def __init__(s, x, y, z): s.x = x; s.y = y; s.z = z
    objs = [Pt(i, i+1, i+2) for i in range(500_000)]

def bench_arena_pressure():
    layers = [[bytearray(256) for _ in range(200_000)] for _ in range(10)]

def bench_random_walk():
    import random; random.seed(42)
    objs = [bytearray(64) for _ in range(1_000_000)]
    idx = list(range(len(objs))); random.shuffle(idx)
    for i in idx: objs[i][0] = i & 0xff

BENCHMARKS = dict(small_object_churn=bench_small_object_churn,
    bulk_small_alloc=bench_bulk_small_alloc, dict_churn=bench_dict_churn,
    mixed_sizes=bench_mixed_sizes, fragmentation=bench_fragmentation,
    list_of_tuples=bench_list_of_tuples, class_instances=bench_class_instances,
    arena_pressure=bench_arena_pressure, random_walk=bench_random_walk)

if __name__ == "__main__":
    gc.collect(); gc.disable(); BENCHMARKS[sys.argv[1]](); gc.enable()
```

</details>

Full reproduction:

```bash
./configure && make -j$(nproc) && cp python python_default
./configure --with-pymalloc-hugepages && make -j$(nproc) && cp python python_hugepages
echo 128 | sudo tee /proc/sys/vm/nr_hugepages

EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
for b in bulk_small_alloc mixed_sizes fragmentation list_of_tuples class_instances arena_pressure random_walk; do
    echo "=== $b ==="
    perf stat -r 10 -e "$EVENTS" ./python_default bench_obmalloc.py "$b"
    perf stat -r 10 -e "$EVENTS" ./python_hugepages bench_obmalloc.py "$b"
done
```


### Linked PRs
* gh-144320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add huge pages support for pymalloc #144319

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark	Default	Hugepages	Change
list_of_tuples (1M 3-tuples)	0.172s	0.121s	-29.5%
fragmentation (500K alloc/free/realloc)	0.162s	0.119s	-26.5%
mixed_sizes (500K, 12 size classes)	0.141s	0.106s	-25.1%
bulk_small_alloc (1M bytearrays)	0.205s	0.160s	-22.1%
class_instances (500K __slots__)	0.120s	0.096s	-20.0%
arena_pressure (10x200K objects)	0.509s	0.448s	-12.1%
random_walk (1M, shuffled access)	0.822s	0.759s	-7.6%

Benchmark	dTLB Load Miss	dTLB Store Miss	Page Faults
fragmentation	-95.9%	-94.7%	-94.5%
random_walk	-93.1%	-98.9%	-91.6%
bulk_small_alloc	-91.4%	-94.5%	-93.5%
list_of_tuples	-88.0%	-93.7%	-94.1%
class_instances	-84.3%	-91.8%	-92.1%
mixed_sizes	-80.8%	-76.5%	-78.2%

Uh oh!

Add huge pages support for pymalloc #144319

Description

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions