feat: add plan_splits function for distributed compute by hamersaw · Pull Request #5863 · lance-format/lance

hamersaw · 2026-01-30T21:16:45Z

Adding a plan_splits function to the Scanner to facilitate a single solution for partitioning Lance dataset for efficient distributed compute. This function (1) filters the dataset (using index looking / delete vectors) producing a mapping of fragment IDs to valid row ranges and (2) bin packs these fragment rows ranges into "splits" that target a configurable partition size (in rows count or bytes).

The most important areas to agree on:

We explicitly disable refine filters when injecting the FilteredReadExec. This is because all refine filters are provided within the pre-computed plan.
Within the FilteredReadExec we maintain the plan when with_children is called. The problem is that if datafusion attempts an optimization is can clear the plan that we explicitly set. Since this flow injects the FilteredReadExec as a leaf node, it should be OK to copy this over.

Example usage through python API:

import lance
from lance import Split
import pyarrow as pa
import shutil

# insert initial table and create index
dataset = lance.write_dataset(
    pa.Table.from_pylist([{"id": 1, "name": "Alice", "age": 20, "weight": 130.5},
                          {"id": 2, "name": "Bob", "age": 30, "weight": 180.0},
                          {"id": 3, "name": "David", "age": 42, "weight": 200.2}]),
    "memory://lance.test")

dataset.insert(
    pa.Table.from_pylist([{"id": 4, "name": "Ricky", "age": 22, "weight": 150.0},
                          {"id": 5, "name": "Carl", "age": 29, "weight": 120.3}],
    ))

dataset.create_scalar_index(
    column="age",
    index_type="BTREE"
)

# insert more data (unindexed)
dataset.insert(
    pa.Table.from_pylist([{"id": 6, "name": "Carla", "age": 37, "weight": 150.0},
                          {"id": 7, "name": "Eve", "age": 29, "weight": 120.3}],
    ))


scanner = dataset.scanner(columns=["weight", "_rowid", "name"], filter="age >= 30 AND weight <= 200.0")

# evaluate splits
splits = scanner.plan_splits(max_row_count=2)
for split in splits:
    # serialize and deserialize split
    split_bytes = split.to_bytes()
    new_split = Split.from_bytes(split_bytes, dataset._ds)

    # read split data
    scanner = dataset.scanner(columns=new_split.output_columns).with_filtered_read_exec(new_split.filtered_read_exec)
    reader = scanner.to_reader()

    table = reader.read_all()
    print(table.to_pydict())

- Add FilteredReadPlan struct using RowAddrTreeMap for row selection - Add get_or_create_plan API for lazy plan computation via OnceCell - Support providing pre-computed plan to FilteredReadExec::try_new - Centralize plan creation in get_or_create_plan_impl - Make RowAddrSelection public in lance-core

- Add FilteredReadInternalPlan (private) using BTreeMap<u32, Vec<Range<u64>>> for efficient local execution without bitmap conversion - Keep FilteredReadPlan (public) using RowAddrTreeMap for distributed execution - Local path: plan_scan() → internal plan → ScopedFragmentRead (zero conversions) - External API: get_or_create_plan() converts internal → external once - with_plan() converts external → internal for distributed workers - Add bitmap_to_ranges() utility in lance-core for efficient bitmap conversion - Use BTreeMap for rows to maintain deterministic fragment order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

codecov · 2026-01-30T22:38:26Z

Codecov Report

❌ Patch coverage is 93.34187% with 52 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/scanner.rs	92.49%	22 Missing and 16 partials ⚠️
rust/lance/src/dataset/split.rs	94.81%	10 Missing and 4 partials ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

…c implementations in the future Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

rust/lance/src/dataset/scanner.rs

jackye1995 · 2026-02-11T07:55:01Z

rust/lance/src/dataset/scanner.rs

We should just let SplitPlanningOptions implement Default, instead of setting it inline.

I think the semantics of this make Default not work exactly how I envisioned it. Basically, the user can set max_size_bytes and / or max_row_count (if both then the min is used). However, if the user does not set anything then we fallback to a default. If we use Default to set this, and the user only wants to filter on maximum number of rows they will need to unset the max_size_bytes.

We could update this to some kind of build logic? splitting_options.with_max_rows(N).build() which allows us to check if neither are set and default there? I'm not convinced this is any clean / clear than the current logic because we still need a fallback in the plan_splits code if neither are set (default value or failure).

rust/lance/src/dataset/scanner.rs

java/src/main/java/org/lance/ipc/FilteredReadPlan.java

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

rust/lance/src/io/exec/filtered_read.rs

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

…an flow rather than creating a new execution path Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

This reverts commit 46ef0c5.

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

LuQQiu and others added 5 commits January 29, 2026 14:23

fix: remove redundant clone in test

05b9bf6

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

small fix

20dacb8

initial commit

884fe00

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the enhancement New feature or request label Jan 30, 2026

hamersaw mentioned this pull request Jan 30, 2026

feat: add scanner.plan_splits function #5792

Closed

hamersaw added 4 commits January 30, 2026 15:19

working for filterable scans

43dbf3d

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge branch 'main' into feature/plan-splits

dc80d78

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

removed dead code

d660136

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

added rough python bindings for testing

2be3da2

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the python label Jan 30, 2026

hamersaw added 3 commits February 3, 2026 15:51

working e2e

b95f631

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

using an enum for Splits that allows us to add FTS and vector specifi…

e204900

…c implementations in the future Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

adding java bindings

eb33d51

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

github-actions bot added the java label Feb 4, 2026

made java FilteredReadPlan serializable

0d6d6d1

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw mentioned this pull request Feb 4, 2026

feat: using scanner.planSplits to prune fragments / rows and bin pack spark partitions lance-format/lance-spark#202

Draft

hamersaw added 5 commits February 5, 2026 13:50

adding unit tests

abf6764

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

added bin_pack unit tests

4902d16

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

docs updates

5dbdac0

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hopefully python docs correct

860d3c8

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge remote-tracking branch 'upstream/main' into feature/plan-splits

4003156

hamersaw marked this pull request as ready for review February 5, 2026 21:33

jackye1995 reviewed Feb 11, 2026

View reviewed changes

hamersaw mentioned this pull request Feb 16, 2026

feat: add proto serialization for FilteredReadExec #5954

Merged

hamersaw added 3 commits February 16, 2026 15:17

SplitOptions -> SplittingOptions

80067ed

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

moved max_rows_per_split computation to split module

a3efb5e

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

moved all split structs to split module

7e5fd85

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

LuQQiu reviewed Feb 16, 2026

View reviewed changes

rust/lance/src/io/exec/filtered_read.rs Outdated Show resolved Hide resolved

hamersaw added 15 commits February 16, 2026 20:53

reverted FilteredReadPlan to bitmap internal

bd186ce

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

added a with_filtered_read_plan function that executes the regular sc…

7da204d

…an flow rather than creating a new execution path Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

removed execute_filtered_read_plan function from scanner

2664fa3

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

filled out splits unit tests

096a8d3

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge remote-tracking branch 'upstream/main' into feature/plan-splits

503f25c

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

updated Splits to Split

e5f6e15

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

implemented split proto serializing

267f9a1

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

correctly add and position output columns in python and java APIs

d2c848d

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

remove fragment option for split

cabd2cd

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

added serialize / deserialize Split from python / java

86a8b9f

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

fixed java tests

0f806af

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

throwing not supported if offset on scanner

46ef0c5

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Merge remote-tracking branch 'upstream/main' into feature/plan-splits

8a6ab24

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

Revert "throwing not supported if offset on scanner"

8fa1c52

This reverts commit 46ef0c5.

lint

0adcfe6

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

hamersaw requested review from LuQQiu and jackye1995 February 18, 2026 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add plan_splits function for distributed compute#5863

feat: add plan_splits function for distributed compute#5863
hamersaw wants to merge 36 commits intolance-format:mainfrom
hamersaw:feature/plan-splits

hamersaw commented Jan 30, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jackye1995 Feb 11, 2026

Uh oh!

hamersaw Feb 17, 2026

Uh oh!

hamersaw Feb 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

hamersaw commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

jackye1995 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

hamersaw commented Jan 30, 2026 •

edited

Loading

codecov bot commented Jan 30, 2026 •

edited

Loading