fix: invalidate index fragment bitmaps after data replacement and stale merge#5929
fix: invalidate index fragment bitmaps after data replacement and stale merge#5929wjones127 wants to merge 4 commits intolance-format:mainfrom
Conversation
…le merge After a merge_insert that modifies indexed columns (RewriteColumns path), followed by compaction + optimize_indices, subsequent queries could fail with "non-existent fragment" errors. Two root causes: 1. optimize_indices merged stale old B-tree data: `merge_indices_with_unindexed_frags` used raw `fragment_bitmap` instead of `effective_fragment_bitmap`, carrying entries for pruned fragments. When the effective bitmap was empty (fully pruned), `index.update()` still called `combine_old_new()` which merged stale entries. Now uses effective bitmap and rebuilds from scratch when old data is fully stale. 2. DataReplacement had no index invalidation: the DataReplacement handler in `build_manifest()` didn't call `prune_updated_fields_from_indices`, so replacing an indexed column's data left the index bitmap unchanged. Fixes lance-format#5321 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code ReviewThis PR fixes index fragment bitmap corruption after P0: Potential Issue in Scalar Index Rebuild LogicIn super::scalar::build_scalar_index(
dataset.as_ref(),
column.name.as_str(),
&new_uuid.to_string(),
¶ms,
true, // include_row_id
None, // fragments (None means all)
Some(new_data_stream), // additional_data_stream
)
.await?However, the Recommendation: Verify that P1: Replaced Fields ExtractionIn let replaced_fields: Vec<u32> = new_datafiles
.first()
.map(|f| {
f.fields.iter().filter(...).collect()
})
.unwrap_or_default();This is safe given the earlier validation (lines 2037-2058) that ensures all new data files have the same fields. However, adding a comment here would make this dependency explicit. TestsThe test coverage is thorough:
Overall the fix logic is sound and addresses the reported issue correctly. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
ef90fe2 to
ed6c6b5
Compare
ed6c6b5 to
6ba3166
Compare
When `optimize_indices` calls `index.update()`, BTreeIndex's `combine_old_new()` reads all stored data including entries for invalidated fragments, causing "non-existent fragment" errors. The bitmap pruning from the previous fix only affects the bitmap, not the stored index data. Pass the valid fragment bitmap into `ScalarIndex::update()` so BTreeIndex can filter old data rows whose fragment ID (upper 32 bits of the row address) is not in the bitmap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
westonpace
left a comment
There was a problem hiding this comment.
Thanks for figuring this out. It makes me wonder if the update path is even worth the complexity it brings at this point. Btree is the only one that really makes use of it.
| /// | ||
| /// If `valid_old_fragments` is provided, old index data for fragments not in the bitmap | ||
| /// will be filtered out during the merge. |
There was a problem hiding this comment.
So this only matters for btree because it is the only index that scans the old portion of the index instead of rescanning the whole column?
There was a problem hiding this comment.
Actually, bitmap does this too. I need to fix that.
There was a problem hiding this comment.
Fixed bitmap. Other indices should be done as a follow up, but I'm not aware of a code path where they cause a bug yet. Many of them like ZoneMap and BloomFilter given inexact results.
rust/lance-index/src/scalar/btree.rs
Outdated
There was a problem hiding this comment.
Minor nit: I've been trying to slowly switch over to using string constants like BTREE_IDS_COLUMN and avoiding integral indexing just because it seems less prone to error in the future.
That's a fair point. There's probably not a huge advantage to scanning the index vs scanning the fragments. |
This fixes a bug where queries or merge insert could result in the error
this fragment does not exist in the dataset.This would happen because the index referenced row addresses that were no longer in the dataset.
Here's one way this would happen:
This PR makes sure:
This rules are described in the spec at https://lance.org/format/table/index/#loading-an-index
Fixes #5321