-
-
Notifications
You must be signed in to change notification settings - Fork 372
Description
The zero check in the encoding path is less than ideal
zarr-python/src/zarr/core/codec_pipeline.py
Line 416 in b873691
| if not chunk_spec.config.write_empty_chunks and chunk_array.all_equal( |
For lazy arrays, it eventually calls:
- np.broadcast_arrays
- np.array_equal
where the latter can instantiate a full array in memory.
The problem is that some lazy array implementation will not "keep this instantiation of the interediate full memory array" in memory for the the operation 2 lines down, that will eventually write the array.
So if the instantiation of the lazy array is costly, we are:
- Instantiating it once for checking for zeros.
- Instantiating a second time to write it to the store.
I would love to avoid this 'double cost' in a "pythonic way".
I would have loved to suggest
My proposal would be to refactor the "zero check" to use np.array_equiv https://numpy.org/devdocs/reference/generated/numpy.array_equiv.html#numpy.array_equiv
>>> import numpy as np
>>> np.array_equiv(np.zeros((3, 4)), 0)
Truebut looking at the source, it seems to recreate the same problem https://github.com/numpy/numpy/blob/main/numpy/_core/numeric.py#L2552-L2598
I'm not sure what the right solution is.
One could cache the local array.
I notice that this operation is actually considered inefficient by the original writers leaving a note like
# use array_equal to obtain equal_nan=True functionality
# Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
# every single time we have to write data?
Where this is a problem for me:
The numpy broadcast operation is actually really difficult to implement if you want to ensure high performance based on the chunks of the array. My experience with dask from 7 years ago reminds me that it was also difficult for them to implement efficient rechunking.
So implementing the np.broadcast_arrays just feels like "alot of work"....
zarr-python/src/zarr/core/buffer/core.py
Line 543 in b873691
| _data, other = np.broadcast_arrays(self._data, other) |
The zero check feels like it makes sense.... but its like "alot of work to do" for some array backends.
Suggested patch:
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index f0d01566..54a2c09f 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -540,6 +540,11 @@ class NDBuffer:
# use array_equal to obtain equal_nan=True functionality
# Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
# every single time we have to write data?
+
+ # The operation array_equal operation below effectively will force the array
+ # into memory.
+ # So we cache it here.
+ self._data = np.asarray(self._data)
_data, other = np.broadcast_arrays(self._data, other)
return np.array_equal(
self._data,