Skip to content

Add NumPy-aware default hashing, regression test, and xxhash benchmark#337

Draft
shaypal5 wants to merge 6 commits intomasterfrom
codex/add-test-and-benchmark-for-numpy-array-performance
Draft

Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
shaypal5 wants to merge 6 commits intomasterfrom
codex/add-test-and-benchmark-for-numpy-array-performance

Conversation

@shaypal5
Copy link
Member

@shaypal5 shaypal5 commented Feb 17, 2026

Motivation

  • This will close issue Numpy array default hasher #43
  • Large NumPy ndarrays were being hashed via generic pickle serialization which is inefficient and can be incorrect for array content comparison at scale. The change ensures array content is hashed deterministically and efficiently without importing NumPy at module import time.
  • Provide a regression test to ensure equal arrays produce cache hits and content-changed arrays produce misses for both memory and pickle backends.
  • Provide a simple benchmark to compare the current default hasher against an xxhash-based reference on very large arrays to validate performance tradeoffs.

Description

  • Implement NumPy-aware hashing helpers in src/cachier/config.py: _is_numpy_array, _hash_numpy_array, and _update_hash_for_value, and replace the old pickle+SHA256 approach with an incremental blake2b-based default hasher (_default_hash_func) that treats ndarray metadata and raw bytes specially.
  • Add tests/test_numpy_hash.py which verifies cache hits for identical large arrays and misses when array content changes, parametrized across memory and pickle backends.
  • Add scripts/benchmark_numpy_hash.py which benchmarks the default hasher vs an xxhash reference implementation on configurable large NumPy arrays and prints median timings and the ratio.
  • Keep NumPy import lazy/dynamic so missing optional dependencies do not break import-time behavior.

Testing

  • Ran the regression test with pytest -q tests/test_numpy_hash.py, result: 2 passed.
  • Ran linting with ruff check src/cachier/config.py tests/test_numpy_hash.py scripts/benchmark_numpy_hash.py, result: all checks passed.
  • Ran static typing with mypy src/cachier/config.py, result: no issues.
  • Executed the benchmark python scripts/benchmark_numpy_hash.py --elements 10000000 --runs 5 after installing xxhash, result: cachier_default median: 0.326273s, xxhash_reference median: 0.229056s, ratio 1.42x (benchmark succeeded; requires numpy and xxhash in the environment).

Codex Task

@shaypal5 shaypal5 marked this pull request as draft February 18, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments