-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[WIP] Revisiting JVector codec #15472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Lucene's HNSW merging has exactly this optimization (reusing incoming HNSW graph from largest of the segments being merged, as long as there are no (not many now?) deletions, as a starting point for the merged HNSW graph) I think? So preserving this from jVector would make the comparison more fair ... |
Nice! But, sigh, I see your PR there is blocked on the the usual GitHub penalize-new-contributors "feature"/tax of insisting that a maintainer simply approve the GH automation actions that would smoke test the PR and maybe give you some feedback on simple things to fix. |
|
@abernardi597 there was also a previous PR #14892 which implemented a Lucene Codec wrapping jVector, also inspired by OpenSearch's integration, but a while ago (early summer 2025). I suspect OpenSearch's jVector integration made many improvements since then. Anyways, how does your PR here compare to that original PR? Did you start from that one, or intentionally not start from it to do everything fresh, or something in between? |
Does Lucene's HNSW have an analogue for |
+1 thank you -- making it easy-ish for anyone to benchmark jVector against Faiss wrapped Codec and Lucene's HNSW implementation would be awesome.
Plus we now have Cohere v3 vectors, 1024 dims instead of 768 from Cohere v2. And they are unit-sphere normalized, unlike Cohere v2. |
I've been working on some modifications to further align the two implementations. I am working on this graph-reuse bit, though it looks like Lucene also does a smart merge where it inserts key nodes from the smaller graph such that it can re-use adjacency information from the small graph to seed the graph search when inserting the remaining nodes. JVector does not do this at the moment, but would likely benefit from such a change (possibly as an upstream contribution).
I looked at the original PR as a starting point, but found that there were several key changes in the upstream OpenSearch implementation that could be brought in. Merging those commits seemed unwieldy, so I opted to start from scratch by checking out the codec into the
We have found that There is also a
Combining this with the single-thread-indexing mentioned above lets me run a more apples-to-apples test, with 32x indexing threads and 32x merge threads with a final force-merge for both codecs:
I'm nearly at a point where I can re-use the largest graph at merge-time, but I'm working through an elusive duplicate neighbor bug.
Apologies on the delay here, I am working on re-applying my changes on top of these awesome improvements! |
| moduleApi project(':lucene:facet') | ||
| moduleTestImplementation project(':lucene:test-framework') | ||
|
|
||
| moduleImplementation('io.github.jbellis:jvector:4.0.0-rc.5') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before merging this has to be cleaned up. Lucene does not want delectations of external dependencies with version numbers here. Needs to move to version.toml file.
Description
I took a stab at bringing the OpenSearch JVector codec into Lucene as a codec in
sandbox(see issue #14681) to see how a DiskANN-insipired index might compare to the current generation of HNSW.I made quite a few changes along the way and wanted to cut this PR to share some of those changes/results and maybe solicit some feedback from interested parties. Most notably, I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices. I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.
I hooked it up to
lucene-util(PR incoming) for comparison, trying to play into the strengths of each codec while also maintaining similar levels of parallelism. I ran HNSW using 32x indexing threads and force-merging into 1 segment while using 1x indexing thread for JVector backed by a 32x concurrencyForkJoinPoolfor its SIMD operations andForkJoinPool.commonPool()for its other parallel operations. I also fixedoversample=1for both and usedneighborOverflow=2andalpha=2for JVector.These results are from the 768-dim cohere dataset using PQ for quantization in JVector and OSQ in Lucene using a
m7g.16xlargeEC2 instance.This PR is not really intended to be merged, in light of some of the feedback on the previous PR (#14892) that suggests Lucene should try to incorporate some of the learnings rather than add yet another KNN engine.