Skip to content

Conversation

@Tomatokeftes
Copy link

@Tomatokeftes Tomatokeftes commented Jan 27, 2026

Summary

This PR adds support for lazy loading of tables in SpatialData using anndata's experimental read_lazy() function.

Motivation

Currently, all elements in SpatialData (images, labels, points) are loaded lazily using Dask, except for tables which are always loaded into memory. For large datasets, particularly Mass Spectrometry Imaging (MSI) data where tables can contain millions of pixels with hundreds of thousands of m/z bins, this creates memory bottlenecks.

Changes

Core lazy loading support:

  • Add lazy: bool = False parameter to SpatialData.read() and read_zarr()
  • Add lazy: bool = False parameter to _read_table() in io_table.py
  • Use anndata.experimental.read_lazy() when lazy=True
  • Add _is_lazy_anndata() helper function to detect lazy AnnData objects
  • Modify validation to skip eager checks for lazy tables (prevents defeating lazy loading)
  • Add fallback with warning if anndata version doesn't support read_lazy

Query API compatibility fixes:

  • Fix _filter_table_by_element_names() to handle lazy Dataset2D obs
  • Fix _filter_table_by_elements() to handle lazy Dataset2D obs
  • Fix get_values() to handle lazy Dataset2D obs
  • Fix _inplace_fix_subset_categorical_obs() to handle lazy tables

The query fixes ensure that bounding_box_query, aggregate, and other APIs work correctly with lazy-loaded tables. The issue was that pd.DataFrame(table.obs) doesn't correctly convert lazy Dataset2D objects - it produces a malformed DataFrame. The fix uses table.obs.to_memory() for lazy tables instead.

Usage

from spatialdata import SpatialData

# Load tables lazily (keeps large tables out of memory)
sdata = SpatialData.read("large_dataset.zarr", lazy=True)

# Access table - data is loaded on-demand
table = sdata.tables["my_table"]
# table.X is now backed by Dask/Zarr, not loaded into memory

# Query APIs work with lazy tables
from spatialdata import bounding_box_query
result = bounding_box_query(sdata, min_coordinate=[0, 0], max_coordinate=[100, 100], 
                            axes=("x", "y"), target_coordinate_system="global")

Benchmark Results

Test configuration: 100,000 pixels x 100,000 m/z bins, 3,000 peaks/pixel (~296M non-zeros)

Metric Lazy Loading Eager Loading Improvement
Memory 15.4 MB 2,270.7 MB 99% savings
Time 0.13s 1.57s 12x faster

Reproducible Example

import numpy as np
from scipy import sparse
import anndata as ad
import psutil
import tempfile
from pathlib import Path

# Create synthetic sparse data (100k pixels x 100k m/z bins, 3000 peaks/pixel)
rng = np.random.default_rng(42)
n_pixels, n_mz, peaks_per_pixel = 100000, 100000, 3000
nnz = int(n_pixels * peaks_per_pixel)

X = sparse.csc_matrix(
    (rng.lognormal(7, 1.5, nnz).astype(np.float32),
     (rng.integers(0, n_pixels, nnz), rng.integers(0, n_mz, nnz))),
    shape=(n_pixels, n_mz)
)

# Create and write AnnData
adata = ad.AnnData(X=X)
adata.obs_names = [f"pixel_{i}" for i in range(n_pixels)]
adata.var_names = [f"mz_{i}" for i in range(n_mz)]

zarr_path = Path(tempfile.mkdtemp()) / "test.zarr"
adata.write_zarr(str(zarr_path))

# Compare lazy vs eager loading
from anndata.experimental import read_lazy
from anndata import read_zarr

def get_mem():
    return psutil.Process().memory_info().rss / 1e6

mem_before = get_mem()
adata_lazy = read_lazy(str(zarr_path))
print(f"Lazy:  +{get_mem() - mem_before:.1f} MB")

mem_before = get_mem()
adata_eager = read_zarr(str(zarr_path))
print(f"Eager: +{get_mem() - mem_before:.1f} MB")

Requirements

  • Requires anndata >= 0.12 for lazy loading support
  • Falls back to eager loading with a warning if anndata version is older

Real-world use case

This feature was developed for Thyra, a Mass Spectrometry Imaging converter. MSI datasets can have:

  • Millions of pixels (observations)
  • Hundreds of thousands of m/z bins (variables)
  • Resulting in tables that exceed available RAM

With lazy loading, users can work with these datasets without loading the full table into memory.

Test plan

  • test_lazy_read_basic - Verify lazy=True creates a SpatialData object without errors
  • test_lazy_false_loads_normally - Verify lazy=False maintains current behavior
  • test_read_zarr_lazy_parameter - Verify lazy parameter is passed through correctly
  • All 29 relational query tests pass (verifies query API fixes)
  • All 277 spatial query tests pass (verifies bounding_box_query works)
  • Manual testing: lazy_loading, table_ops, query, aggregate, compute all work

Tomatokeftes and others added 2 commits January 27, 2026 11:58
Add a `lazy` parameter to `SpatialData.read()` and `read_zarr()` that enables
lazy loading of tables using anndata's experimental `read_lazy()` function.

This is particularly useful for large datasets (e.g., Mass Spectrometry Imaging
with millions of pixels) where loading tables into memory is not feasible.

Changes:
- Add `lazy: bool = False` parameter to `read_zarr()` in io_zarr.py
- Add `lazy: bool = False` parameter to `_read_table()` in io_table.py
- Add `lazy: bool = False` parameter to `SpatialData.read()` in spatialdata.py
- Add `_is_lazy_anndata()` helper to detect lazy AnnData objects
- Skip eager validation for lazy tables to preserve lazy loading benefits
- Add tests for lazy loading functionality

Requires anndata >= 0.12 for lazy loading support. Falls back to eager loading
with a warning if anndata version does not support read_lazy.
@Tomatokeftes Tomatokeftes marked this pull request as draft January 27, 2026 11:03
- Simplify if/return pattern in _is_lazy_anndata (SIM103)
- Add missing TableModel import in test fixture (F821)
- Use modern np.random.Generator instead of np.random.rand (NPY002)
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 84.09091% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.14%. Comparing base (ef88b5c) to head (e182dde).

Files with missing lines Patch % Lines
src/spatialdata/_io/io_table.py 63.63% 4 Missing ⚠️
src/spatialdata/_core/query/relational_query.py 66.66% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1055      +/-   ##
==========================================
- Coverage   92.21%   92.14%   -0.08%     
==========================================
  Files          49       49              
  Lines        7593     7614      +21     
==========================================
+ Hits         7002     7016      +14     
- Misses        591      598       +7     
Files with missing lines Coverage Δ
src/spatialdata/_core/spatialdata.py 91.96% <100.00%> (ø)
src/spatialdata/_io/io_zarr.py 92.38% <100.00%> (+0.07%) ⬆️
src/spatialdata/_utils.py 84.61% <100.00%> (+0.10%) ⬆️
src/spatialdata/models/models.py 88.70% <100.00%> (+0.10%) ⬆️
src/spatialdata/_core/query/relational_query.py 91.08% <66.66%> (-0.54%) ⬇️
src/spatialdata/_io/io_table.py 83.67% <63.63%> (-6.58%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Tomatokeftes Tomatokeftes marked this pull request as ready for review January 27, 2026 13:01
When lazy AnnData objects (from anndata.experimental.read_lazy) are
subset, their obs attribute is a Dataset2D object, not a pandas
DataFrame. Using pd.DataFrame(table.obs) produces a malformed DataFrame.

This fix uses table.obs.to_memory() for lazy tables to properly convert
Dataset2D to DataFrame while preserving all column data.

Files modified:
- relational_query.py: _filter_table_by_element_names,
  _filter_table_by_elements, get_values
- _utils.py: _inplace_fix_subset_categorical_obs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant