feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Tomatokeftes · 2026-01-27T10:59:03Z

Summary

This PR adds support for lazy loading of tables in SpatialData using anndata's experimental read_lazy() function.

Motivation

Currently, all elements in SpatialData (images, labels, points) are loaded lazily using Dask, except for tables which are always loaded into memory. For large datasets, particularly Mass Spectrometry Imaging (MSI) data where tables can contain millions of pixels with hundreds of thousands of m/z bins, this creates memory bottlenecks.

Changes

Core lazy loading support:

Add lazy: bool = False parameter to SpatialData.read() and read_zarr()
Add lazy: bool = False parameter to _read_table() in io_table.py
Use anndata.experimental.read_lazy() when lazy=True
Add _is_lazy_anndata() helper function to detect lazy AnnData objects
Modify validation to skip eager checks for lazy tables (prevents defeating lazy loading)
Add fallback with warning if anndata version doesn't support read_lazy

Query API compatibility fixes:

Fix _filter_table_by_element_names() to handle lazy Dataset2D obs
Fix _filter_table_by_elements() to handle lazy Dataset2D obs
Fix get_values() to handle lazy Dataset2D obs
Fix _inplace_fix_subset_categorical_obs() to handle lazy tables

The query fixes ensure that bounding_box_query, aggregate, and other APIs work correctly with lazy-loaded tables. The issue was that pd.DataFrame(table.obs) doesn't correctly convert lazy Dataset2D objects - it produces a malformed DataFrame. The fix uses table.obs.to_memory() for lazy tables instead.

Usage

from spatialdata import SpatialData

# Load tables lazily (keeps large tables out of memory)
sdata = SpatialData.read("large_dataset.zarr", lazy=True)

# Access table - data is loaded on-demand
table = sdata.tables["my_table"]
# table.X is now backed by Dask/Zarr, not loaded into memory

# Query APIs work with lazy tables
from spatialdata import bounding_box_query
result = bounding_box_query(sdata, min_coordinate=[0, 0], max_coordinate=[100, 100], 
                            axes=("x", "y"), target_coordinate_system="global")

Benchmark Results

Test configuration: 100,000 pixels x 100,000 m/z bins, 3,000 peaks/pixel (~296M non-zeros)

Metric	Lazy Loading	Eager Loading	Improvement
Memory	15.4 MB	2,270.7 MB	99% savings
Time	0.13s	1.57s	12x faster

Reproducible Example

import numpy as np
from scipy import sparse
import anndata as ad
import psutil
import tempfile
from pathlib import Path

# Create synthetic sparse data (100k pixels x 100k m/z bins, 3000 peaks/pixel)
rng = np.random.default_rng(42)
n_pixels, n_mz, peaks_per_pixel = 100000, 100000, 3000
nnz = int(n_pixels * peaks_per_pixel)

X = sparse.csc_matrix(
    (rng.lognormal(7, 1.5, nnz).astype(np.float32),
     (rng.integers(0, n_pixels, nnz), rng.integers(0, n_mz, nnz))),
    shape=(n_pixels, n_mz)
)

# Create and write AnnData
adata = ad.AnnData(X=X)
adata.obs_names = [f"pixel_{i}" for i in range(n_pixels)]
adata.var_names = [f"mz_{i}" for i in range(n_mz)]

zarr_path = Path(tempfile.mkdtemp()) / "test.zarr"
adata.write_zarr(str(zarr_path))

# Compare lazy vs eager loading
from anndata.experimental import read_lazy
from anndata import read_zarr

def get_mem():
    return psutil.Process().memory_info().rss / 1e6

mem_before = get_mem()
adata_lazy = read_lazy(str(zarr_path))
print(f"Lazy:  +{get_mem() - mem_before:.1f} MB")

mem_before = get_mem()
adata_eager = read_zarr(str(zarr_path))
print(f"Eager: +{get_mem() - mem_before:.1f} MB")

Requirements

Requires anndata >= 0.12 for lazy loading support
Falls back to eager loading with a warning if anndata version is older

Real-world use case

This feature was developed for Thyra, a Mass Spectrometry Imaging converter. MSI datasets can have:

Millions of pixels (observations)
Hundreds of thousands of m/z bins (variables)
Resulting in tables that exceed available RAM

With lazy loading, users can work with these datasets without loading the full table into memory.

Test plan

test_lazy_read_basic - Verify lazy=True creates a SpatialData object without errors
test_lazy_false_loads_normally - Verify lazy=False maintains current behavior
test_read_zarr_lazy_parameter - Verify lazy parameter is passed through correctly
All 29 relational query tests pass (verifies query API fixes)
All 277 spatial query tests pass (verifies bounding_box_query works)
Manual testing: lazy_loading, table_ops, query, aggregate, compute all work

Add a `lazy` parameter to `SpatialData.read()` and `read_zarr()` that enables lazy loading of tables using anndata's experimental `read_lazy()` function. This is particularly useful for large datasets (e.g., Mass Spectrometry Imaging with millions of pixels) where loading tables into memory is not feasible. Changes: - Add `lazy: bool = False` parameter to `read_zarr()` in io_zarr.py - Add `lazy: bool = False` parameter to `_read_table()` in io_table.py - Add `lazy: bool = False` parameter to `SpatialData.read()` in spatialdata.py - Add `_is_lazy_anndata()` helper to detect lazy AnnData objects - Skip eager validation for lazy tables to preserve lazy loading benefits - Add tests for lazy loading functionality Requires anndata >= 0.12 for lazy loading support. Falls back to eager loading with a warning if anndata version does not support read_lazy.

for more information, see https://pre-commit.ci

- Simplify if/return pattern in _is_lazy_anndata (SIM103) - Add missing TableModel import in test fixture (F821) - Use modern np.random.Generator instead of np.random.rand (NPY002)

codecov · 2026-01-27T11:16:26Z

Codecov Report

❌ Patch coverage is 84.09091% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.14%. Comparing base (ef88b5c) to head (e182dde).

Files with missing lines	Patch %	Lines
src/spatialdata/_io/io_table.py	63.63%	4 Missing ⚠️
src/spatialdata/_core/query/relational_query.py	66.66%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1055      +/-   ##
==========================================
- Coverage   92.21%   92.14%   -0.08%     
==========================================
  Files          49       49              
  Lines        7593     7614      +21     
==========================================
+ Hits         7002     7016      +14     
- Misses        591      598       +7

Files with missing lines	Coverage Δ
src/spatialdata/_core/spatialdata.py	`91.96% <100.00%> (ø)`
src/spatialdata/_io/io_zarr.py	`92.38% <100.00%> (+0.07%)`	⬆️
src/spatialdata/_utils.py	`84.61% <100.00%> (+0.10%)`	⬆️
src/spatialdata/models/models.py	`88.70% <100.00%> (+0.10%)`	⬆️
src/spatialdata/_core/query/relational_query.py	`91.08% <66.66%> (-0.54%)`	⬇️
src/spatialdata/_io/io_table.py	`83.67% <63.63%> (-6.58%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When lazy AnnData objects (from anndata.experimental.read_lazy) are subset, their obs attribute is a Dataset2D object, not a pandas DataFrame. Using pd.DataFrame(table.obs) produces a malformed DataFrame. This fix uses table.obs.to_memory() for lazy tables to properly convert Dataset2D to DataFrame while preserving all column data. Files modified: - relational_query.py: _filter_table_by_element_names, _filter_table_by_elements, get_values - _utils.py: _inplace_fix_subset_categorical_obs

Tomatokeftes and others added 2 commits January 27, 2026 11:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

44d4f45

for more information, see https://pre-commit.ci

Tomatokeftes marked this pull request as draft January 27, 2026 11:03

fix: address pre-commit linting issues

120d11d

- Simplify if/return pattern in _is_lazy_anndata (SIM103) - Add missing TableModel import in test fixture (F821) - Use modern np.random.Generator instead of np.random.rand (NPY002)

Tomatokeftes marked this pull request as ready for review January 27, 2026 13:01

Tomatokeftes added 2 commits January 28, 2026 10:23

style: fix line length and use ternary operators per ruff

e182dde

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Uh oh!

Tomatokeftes commented Jan 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Are you sure you want to change the base?

feat: Add lazy table loading via anndata.experimental.read_lazy #1055

Uh oh!

Conversation

Tomatokeftes commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Usage

Benchmark Results

Reproducible Example

Requirements

Real-world use case

Test plan

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tomatokeftes commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading