feat: Add lazy table loading via anndata.experimental.read_lazy #1055
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for lazy loading of tables in SpatialData using anndata's experimental
read_lazy()function.Motivation
Currently, all elements in SpatialData (images, labels, points) are loaded lazily using Dask, except for tables which are always loaded into memory. For large datasets, particularly Mass Spectrometry Imaging (MSI) data where tables can contain millions of pixels with hundreds of thousands of m/z bins, this creates memory bottlenecks.
Changes
Core lazy loading support:
lazy: bool = Falseparameter toSpatialData.read()andread_zarr()lazy: bool = Falseparameter to_read_table()in io_table.pyanndata.experimental.read_lazy()whenlazy=True_is_lazy_anndata()helper function to detect lazy AnnData objectsread_lazyQuery API compatibility fixes:
_filter_table_by_element_names()to handle lazyDataset2Dobs_filter_table_by_elements()to handle lazyDataset2Dobsget_values()to handle lazyDataset2Dobs_inplace_fix_subset_categorical_obs()to handle lazy tablesThe query fixes ensure that
bounding_box_query,aggregate, and other APIs work correctly with lazy-loaded tables. The issue was thatpd.DataFrame(table.obs)doesn't correctly convert lazyDataset2Dobjects - it produces a malformed DataFrame. The fix usestable.obs.to_memory()for lazy tables instead.Usage
Benchmark Results
Test configuration: 100,000 pixels x 100,000 m/z bins, 3,000 peaks/pixel (~296M non-zeros)
Reproducible Example
Requirements
anndata >= 0.12for lazy loading supportReal-world use case
This feature was developed for Thyra, a Mass Spectrometry Imaging converter. MSI datasets can have:
With lazy loading, users can work with these datasets without loading the full table into memory.
Test plan
test_lazy_read_basic- Verify lazy=True creates a SpatialData object without errorstest_lazy_false_loads_normally- Verify lazy=False maintains current behaviortest_read_zarr_lazy_parameter- Verify lazy parameter is passed through correctly