-
-
Notifications
You must be signed in to change notification settings - Fork 34k
Description
pymalloc allocates small objects from contiguous regions called arenas. On 64-bit platforms each arena is 1 MiB, obtained via mmap(MAP_PRIVATE|MAP_ANONYMOUS) and backed by 256 standard 4 KiB pages. Each page needs its own TLB entry, and the x86_64 dTLB only holds 64-128 entries for 4K pages, so a single arena already overflows TLB capacity, and any non-trivial Python program touches many arenas.
Most modern operating systems support "huge pages": memory pages much larger than the default 4 KiB. On x86_64 Linux the standard huge page size is 2 MiB. A single 2 MiB huge page is covered by one TLB entry instead of 512 entries for the equivalent range of 4 KiB pages. This dramatically reduces TLB pressure for workloads that touch large contiguous allocations. On Linux, explicit huge pages are allocated via mmap with the MAP_HUGETLB flag (available since kernel 2.6.32) from a pre-reserved pool configured through /proc/sys/vm/nr_hugepages. On Windows, the equivalent is VirtualAlloc with MEM_LARGE_PAGES.
I'd like to propose adding a ./configure --with-pymalloc-hugepages option that increases ARENA_BITS from 20 to 21 (1 MiB -> 2 MiB) and makes _PyMem_ArenaAlloc() try mmap(MAP_HUGETLB) first, falling back to regular mmap if the huge page pool is exhausted. On Windows the equivalent would be VirtualAlloc(MEM_LARGE_PAGES) with fallback. _PyMem_ArenaFree() needs no changes since munmap handles huge pages identically. All derived constants (ARENA_SIZE, MAX_POOLS_IN_ARENA, radix tree bit widths, nfp2lasta sizing) adjust automatically from ARENA_BITS.
The flag is opt-in and off by default. MAP_HUGETLB requires the kernel to have huge pages pre-allocated; without them the fallback path produces identical behavior to a non-hugepages build. On Linux, huge pages are managed through /proc/sys/vm/nr_hugepages. To allocate 128 huge pages (256 MiB on x86_64 where the default huge page size is 2 MiB):
# Allocate (requires root)
echo 128 | sudo tee /proc/sys/vm/nr_hugepages
# Verify
grep HugePages /proc/meminfo
# HugePages_Total: 128
# HugePages_Free: 128
# Make persistent across reboots by adding to /etc/sysctl.conf:
# vm.nr_hugepages = 128Each arena consumes one huge page. If the pool runs out, obmalloc falls back to regular 4K pages transparently.
I benchmarked on an i9-14900KS, Linux 6.18.3, GCC 15.2.1 on main with nr_hugepages=128. Measured with perf stat -r 100 using cpu_core counters. GC disabled during benchmarks.
Wall-clock results:
| Benchmark | Default | Hugepages | Change |
|---|---|---|---|
| list_of_tuples (1M 3-tuples) | 0.172s | 0.121s | -29.5% |
| fragmentation (500K alloc/free/realloc) | 0.162s | 0.119s | -26.5% |
| mixed_sizes (500K, 12 size classes) | 0.141s | 0.106s | -25.1% |
| bulk_small_alloc (1M bytearrays) | 0.205s | 0.160s | -22.1% |
| class_instances (500K __slots__) | 0.120s | 0.096s | -20.0% |
| arena_pressure (10x200K objects) | 0.509s | 0.448s | -12.1% |
| random_walk (1M, shuffled access) | 0.822s | 0.759s | -7.6% |
dTLB miss reductions:
| Benchmark | dTLB Load Miss | dTLB Store Miss | Page Faults |
|---|---|---|---|
| fragmentation | -95.9% | -94.7% | -94.5% |
| random_walk | -93.1% | -98.9% | -91.6% |
| bulk_small_alloc | -91.4% | -94.5% | -93.5% |
| list_of_tuples | -88.0% | -93.7% | -94.1% |
| class_instances | -84.3% | -91.8% | -92.1% |
| mixed_sizes | -80.8% | -76.5% | -78.2% |
The perf command used per benchmark:
EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
perf stat -r 10 -e "$EVENTS" ./python bench_obmalloc.py fragmentationbench_obmalloc.py
import sys, gc
def bench_small_object_churn():
objs = []
for _ in range(200_000): objs.append(bytearray(64))
for _ in range(200_000): objs.append(bytearray(64)); objs.pop(0)
def bench_bulk_small_alloc():
objs = [bytearray(48) for _ in range(1_000_000)]
for o in objs: o[0] = 1
def bench_dict_churn():
for _ in range(500_000): d = {"a": 1, "b": 2, "c": 3, "d": 4}; del d
def bench_mixed_sizes():
sizes = [8, 16, 24, 32, 48, 64, 96, 128, 192, 256, 384, 512]
objs = [bytearray(sizes[i % 12]) for i in range(500_000)]
def bench_fragmentation():
objs = [bytearray(128) for _ in range(500_000)]
for i in range(0, len(objs), 2): objs[i] = None
for i in range(0, len(objs), 2): objs[i] = bytearray(128)
def bench_list_of_tuples():
objs = [(i, i+1, i+2) for i in range(1_000_000)]
def bench_class_instances():
class Pt:
__slots__ = ('x', 'y', 'z')
def __init__(s, x, y, z): s.x = x; s.y = y; s.z = z
objs = [Pt(i, i+1, i+2) for i in range(500_000)]
def bench_arena_pressure():
layers = [[bytearray(256) for _ in range(200_000)] for _ in range(10)]
def bench_random_walk():
import random; random.seed(42)
objs = [bytearray(64) for _ in range(1_000_000)]
idx = list(range(len(objs))); random.shuffle(idx)
for i in idx: objs[i][0] = i & 0xff
BENCHMARKS = dict(small_object_churn=bench_small_object_churn,
bulk_small_alloc=bench_bulk_small_alloc, dict_churn=bench_dict_churn,
mixed_sizes=bench_mixed_sizes, fragmentation=bench_fragmentation,
list_of_tuples=bench_list_of_tuples, class_instances=bench_class_instances,
arena_pressure=bench_arena_pressure, random_walk=bench_random_walk)
if __name__ == "__main__":
gc.collect(); gc.disable(); BENCHMARKS[sys.argv[1]](); gc.enable()Full reproduction:
./configure && make -j$(nproc) && cp python python_default
./configure --with-pymalloc-hugepages && make -j$(nproc) && cp python python_hugepages
echo 128 | sudo tee /proc/sys/vm/nr_hugepages
EVENTS="dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses,iTLB-load-misses,cache-misses,cache-references,instructions,cycles,page-faults"
for b in bulk_small_alloc mixed_sizes fragmentation list_of_tuples class_instances arena_pressure random_walk; do
echo "=== $b ==="
perf stat -r 10 -e "$EVENTS" ./python_default bench_obmalloc.py "$b"
perf stat -r 10 -e "$EVENTS" ./python_hugepages bench_obmalloc.py "$b"
done