feat: support FTS query execution for LSM scanner#5905
feat: support FTS query execution for LSM scanner#5905touch-of-grey wants to merge 1 commit intolance-format:mainfrom
Conversation
|
Thanks, I will take a look tomorrow morning |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
cross-generation scoring is too specific to LSM, make the comment more generic
rust/lance-index/src/scalar.rs
Outdated
There was a problem hiding this comment.
cross-generation scoring is too specific to LSM, make the comment more generic
There was a problem hiding this comment.
why do we need a dedicated method for global stats? Can we only use existing mechanism and only allow optional BM25 override?
| } | ||
|
|
||
| /// Add a bloom filter for staleness detection. | ||
| pub fn with_bloom_filter(mut self, generation: u64, bloom_filter: Arc<Sbbf>) -> Self { |
There was a problem hiding this comment.
I think with the latest design, we can make bloom filter also just a bloom filter index in the flushed memtable. It will have a zone size equal to the row count.
There was a problem hiding this comment.
Agree. This impacts both FTS and vector search. I can raise a separated PR later about it
There was a problem hiding this comment.
this implementation is missing
There was a problem hiding this comment.
should not fallback, the index should always have a tokenizer set
There was a problem hiding this comment.
we should make sure we use the same session cache across dataset opening for the dataset and flushed memtables.
There was a problem hiding this comment.
this is actually quite expensive. We should make sure we are not blocked on loading the bm25 stats, we should compute it while forming plan and doing execution
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9c97b23 to
097007a
Compare
Based on previous discussion, separate out FTS query plan since it requires global BM25. @jackye1995 please take a look
This will calculate global BM25 and then use the same scorer to rank across different inverted indexes, similar to how Lucene does it.