Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
Draft
Add NumPy-aware default hashing, regression test, and xxhash benchmark#337
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
memoryandpicklebackends.xxhash-based reference on very large arrays to validate performance tradeoffs.Description
src/cachier/config.py:_is_numpy_array,_hash_numpy_array, and_update_hash_for_value, and replace the old pickle+SHA256 approach with an incrementalblake2b-based default hasher (_default_hash_func) that treats ndarray metadata and raw bytes specially.tests/test_numpy_hash.pywhich verifies cache hits for identical large arrays and misses when array content changes, parametrized acrossmemoryandpicklebackends.scripts/benchmark_numpy_hash.pywhich benchmarks the default hasher vs anxxhashreference implementation on configurable large NumPy arrays and prints median timings and the ratio.Testing
pytest -q tests/test_numpy_hash.py, result:2 passed.ruff check src/cachier/config.py tests/test_numpy_hash.py scripts/benchmark_numpy_hash.py, result: all checks passed.mypy src/cachier/config.py, result: no issues.python scripts/benchmark_numpy_hash.py --elements 10000000 --runs 5after installingxxhash, result:cachier_default median: 0.326273s,xxhash_reference median: 0.229056s, ratio1.42x(benchmark succeeded; requiresnumpyandxxhashin the environment).Codex Task