Skip to content

feat: support FTS query execution for LSM scanner#5905

Open
touch-of-grey wants to merge 1 commit intolance-format:mainfrom
touch-of-grey:LsmFTSQueryPlan
Open

feat: support FTS query execution for LSM scanner#5905
touch-of-grey wants to merge 1 commit intolance-format:mainfrom
touch-of-grey:LsmFTSQueryPlan

Conversation

@touch-of-grey
Copy link
Contributor

Based on previous discussion, separate out FTS query plan since it requires global BM25. @jackye1995 please take a look

This will calculate global BM25 and then use the same scorer to rank across different inverted indexes, similar to how Lucene does it.

@github-actions github-actions bot added the enhancement New feature or request label Feb 7, 2026
@jackye1995 jackye1995 self-requested a review February 7, 2026 08:18
@jackye1995
Copy link
Contributor

Thanks, I will take a look tomorrow morning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross-generation scoring is too specific to LSM, make the comment more generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross-generation scoring is too specific to LSM, make the comment more generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a dedicated method for global stats? Can we only use existing mechanism and only allow optional BM25 override?

}

/// Add a bloom filter for staleness detection.
pub fn with_bloom_filter(mut self, generation: u64, bloom_filter: Arc<Sbbf>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the latest design, we can make bloom filter also just a bloom filter index in the flushed memtable. It will have a zone size equal to the row count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This impacts both FTS and vector search. I can raise a separated PR later about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation is missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not fallback, the index should always have a tokenizer set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import at top

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should make sure we use the same session cache across dataset opening for the dataset and flushed memtables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually quite expensive. We should make sure we are not blocked on loading the bm25 stats, we should compute it while forming plan and doing execution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments