[WIP] Revisiting JVector codec #15472

abernardi597 · 2025-12-04T15:34:57Z

Description

I took a stab at bringing the OpenSearch JVector codec into Lucene as a codec in sandbox (see issue #14681) to see how a DiskANN-insipired index might compare to the current generation of HNSW.
I made quite a few changes along the way and wanted to cut this PR to share some of those changes/results and maybe solicit some feedback from interested parties. Most notably, I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices. I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.

I hooked it up to lucene-util (PR incoming) for comparison, trying to play into the strengths of each codec while also maintaining similar levels of parallelism. I ran HNSW using 32x indexing threads and force-merging into 1 segment while using 1x indexing thread for JVector backed by a 32x concurrency ForkJoinPool for its SIMD operations and ForkJoinPool.commonPool() for its other parallel operations. I also fixed oversample=1 for both and used neighborOverflow=2 and alpha=2 for JVector.

These results are from the 768-dim cohere dataset using PQ for quantization in JVector and OSQ in Lucene using a m7g.16xlarge EC2 instance.

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	visited	index(s)	index_docs/s	force_merge(s)	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType	metric
0.965	1.408	1.399	0.994	100000	100	50	64	250	no	4968	5.99	16700.07	10.10	1	298.17	292.969	292.969	HNSW	COSINE
0.939	2.186	2.155	0.986	100000	100	50	64	250	no	3485	19.58	5107.77	0.01	1	318.80	292.969	292.969	JVECTOR	COSINE
0.963	1.409	1.401	0.994	100000	100	50	64	250	8 bits	5028	8.75	11431.18	12.95	1	372.84	367.737	74.768	HNSW	COSINE
0.939	9.524	9.516	0.999	100000	100	50	64	250	8 bits	3525	886.28	112.83	0.01	1	392.79	367.737	74.768	JVECTOR	COSINE
0.899	0.967	0.959	0.992	100000	100	50	64	250	4 bits	5076	8.84	11314.78	9.07	1	335.80	331.116	38.147	HNSW	COSINE
0.937	3.469	3.457	0.997	100000	100	50	64	250	4 bits	3437	148.70	672.51	0.01	1	356.17	331.116	38.147	JVECTOR	COSINE
0.669	0.681	0.673	0.988	100000	100	50	64	250	1 bits	5895	8.04	12439.36	8.84	1	308.42	303.459	10.490	HNSW	COSINE
0.730	1.056	1.044	0.989	100000	100	50	64	250	1 bits	2672	51.39	1945.90	0.01	1	328.70	303.459	10.490	JVECTOR	COSINE

This PR is not really intended to be merged, in light of some of the feedback on the previous PR (#14892) that suggests Lucene should try to incorporate some of the learnings rather than add yet another KNN engine.

577)

mikemccand · 2025-12-11T16:25:54Z

I did remove the incremental graph building functionality that is used to speed up merges, though I'd like to add it back and look at the improvements in merge-time for JVector indices.

Lucene's HNSW merging has exactly this optimization (reusing incoming HNSW graph from largest of the segments being merged, as long as there are no (not many now?) deletions, as a starting point for the merged HNSW graph) I think? So preserving this from jVector would make the comparison more fair ...

mikemccand · 2025-12-11T16:26:19Z

I also made a PR for JVector (datastax/jvector#577) to fix a byte-order inconsistency to better leverage Lucene's bulk-read for floats.

Nice!

But, sigh, I see your PR there is blocked on the the usual GitHub penalize-new-contributors "feature"/tax of insisting that a maintainer simply approve the GH automation actions that would smoke test the PR and maybe give you some feedback on simple things to fix.

mikemccand · 2025-12-11T16:35:11Z

@abernardi597 there was also a previous PR #14892 which implemented a Lucene Codec wrapping jVector, also inspired by OpenSearch's integration, but a while ago (early summer 2025). I suspect OpenSearch's jVector integration made many improvements since then.

Anyways, how does your PR here compare to that original PR? Did you start from that one, or intentionally not start from it to do everything fresh, or something in between?

mikemccand · 2025-12-11T16:36:19Z

I also fixed oversample=1 for both and used neighborOverflow=2 and alpha=2 for JVector.

Does Lucene's HNSW have an analogue for neighborOverflow=2 and alpha=2 that you are trying to match to make the comparison as apples/apples as possible?

mikemccand · 2025-12-11T16:44:05Z

I hooked it up to lucene-util (PR incoming) for comparison

+1 thank you -- making it easy-ish for anyone to benchmark jVector against Faiss wrapped Codec and Lucene's HNSW implementation would be awesome.

knnPerfTest.py got a number of improvements recently (autologger, factoring away non-differentiating columns, preserving index-time and force-merge-time across invocations, etc.).

Plus we now have Cohere v3 vectors, 1024 dims instead of 768 from Cohere v2. And they are unit-sphere normalized, unlike Cohere v2.

abernardi597 · 2025-12-12T16:25:35Z

Lucene's HNSW merging has exactly this optimization

I've been working on some modifications to further align the two implementations.
For example, I have added changes to do single-threaded graph construction on the indexing thread (instead of buffering all the docs until building the graph in parallel at flush-time).

I am working on this graph-reuse bit, though it looks like Lucene also does a smart merge where it inserts key nodes from the smaller graph such that it can re-use adjacency information from the small graph to seed the graph search when inserting the remaining nodes. JVector does not do this at the moment, but would likely benefit from such a change (possibly as an upstream contribution).

how does your PR here compare to that original PR?

I looked at the original PR as a starting point, but found that there were several key changes in the upstream OpenSearch implementation that could be brought in. Merging those commits seemed unwieldy, so I opted to start from scratch by checking out the codec into the sandbox. Then I fixed the build and style issues before making some changes to how the codec actually works to get more functional parity with Lucene's HNSW codecs. Specifically trying to get the extra KNN tests passing and moving towards the single-indexing-thread model as I mentioned above.

Does Lucene's HNSW have an analogue for neighborOverflow=2 and alpha=2

We have found that alpha=2 is actually partially responsible for the increase in index/search time. alpha > 1 is a hyper-parameter that relaxes the diversity check by a multiplicative factor, with alpha=1 being the same diversity check as HNSW. We found that alpha=2 resulted in graphs with every node saturated with edges (maxConn edges), which was really slowing down the construction and search.

There is also a hierarchyEnabled flag that adds layers to the graph in much the same fashion as the H in HNSW.
Enabling the hierarchy with alpha=1 and also allowing 2*maxConn for level=0 gives somewhat more promising results:

recall	latency(ms)	netCPU	avgCpuCount	nDoc	topK	fanout	maxConn	beamWidth	quantized	visited	index(s)	index_doc/s	force_merge(s)	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType	metric
0.926	2.262	2.2	0.973	100000	100	50	64	250	no	3277	12.07	8282.26	0.01	1	319.21	292.969	292.969	JVECTOR	COSINE
0.926	10.106	9.95	0.985	100000	100	50	64	250	8 bits	3238	196.74	508.27	0.01	1	393.46	367.737	74.768	JVECTOR	COSINE
0.926	3.444	3.386	0.983	100000	100	50	64	250	4 bits	3189	75.32	1327.6	0.01	1	356.71	331.116	38.147	JVECTOR	COSINE
0.739	1.15	1.122	0.976	100000	100	50	64	250	1 bits	2581	22.24	4496.4	0.01	1	329.15	303.459	10.49	JVECTOR	COSINE

Combining this with the single-thread-indexing mentioned above lets me run a more apples-to-apples test, with 32x indexing threads and 32x merge threads with a final force-merge for both codecs:

recall	latency(ms)	netCPU	avgCpuCount	numDoc	topK	fanout	maxConn	beamWidth	quantized	visited	index(s)	index_doc/s	force_merge(s)	total_index(s)	num_segments	index_size(MB)	vec_disk(MB)	vec_RAM(MB)	indexType
0.960	1.796	1.764	0.982	200000	100	50	64	250	no	5596	12.81	15612.80	24.58	37.39	1	596.91	585.938	585.938	HNSW
0.904	2.416	2.371	0.981	200000	100	50	64	250	no	3321	15.42	12972.69	73.05	88.47	1	686.89	585.938	585.938	JVECTOR
0.894	1.391	1.363	0.980	200000	100	50	64	250	4 bits	5661	18.55	10784.00	21.93	40.48	1	672.01	662.231	76.294	HNSW
0.903	3.923	3.862	0.984	200000	100	50	64	250	4 bits	3274	15.56	12850.99	107.83	123.39	1	760.87	662.231	76.294	JVECTOR
0.661	0.887	0.867	0.977	200000	100	50	64	250	1 bits	6552	17.25	11594.20	19.71	36.96	1	617.32	606.918	20.981	HNSW
0.724	1.252	1.229	0.982	200000	100	50	64	250	1 bits	2704	15.52	12888.26	35.89	51.41	1	705.95	606.918	20.981	JVECTOR

I'm nearly at a point where I can re-use the largest graph at merge-time, but I'm working through an elusive duplicate neighbor bug.

making it easy-ish for anyone to benchmark jVector against Faiss wrapped Codec and Lucene's HNSW implementation would be awesome

Apologies on the delay here, I am working on re-applying my changes on top of these awesome improvements!

uschindler · 2025-12-13T13:20:25Z

lucene/sandbox/build.gradle

  moduleApi project(':lucene:facet')
  moduleTestImplementation project(':lucene:test-framework')
+
+  moduleImplementation('io.github.jbellis:jvector:4.0.0-rc.5') {


before merging this has to be cleaned up. Lucene does not want delectations of external dependencies with version numbers here. Needs to move to version.toml file.

abernardi597 added 30 commits December 4, 2025 14:58

Add jvector-4.0.0-rc.5 dependency

7577d1e

[build-fails] Checkout opensearch jvector codec

7843b82

Fix license headers

c65c467

Run tidy

cf2aa85

Fix package declarations

3b98cd8

Remove logging

8f00f87

Remove KNNCounter stats

2e2f564

Fix references to missing KNNConstants

a47ca91

Remove lombok.Value annotation from JVectorKnnCollector

c30a070

Fix AcceptDocs param in JVectorKnnFloatVectorQuery

d9e5ba3

Fix AcceptDocs param in JVectorReader

6b178d8

Fix static imports of SIMD_POOL*

9604146

Remove Lombok.Getter from JVectorWriter

caba8d9

Remove lombok annotations from VectorIndexFieldMetadata

0fcac22

Fix illegal access to PerFieldKnnVectorsFormat.FieldReader

384cde8

Fix references to getCodec

4a56392

Fix references to TestUtils.generateRandomVectors

585e777

Fix ThreadLeakFilters in test

8ddcddb

Fix missing @OverRide

25cd540

Remove unused members

ee5ed2d

Fix unqualified javadoc

3c47063

Suppress cases-omitted from switch expression

af5e0be

Add basic javadocs for classes without

4b2beb8

Fix forbiddenApis error

f97bfc5

Rename KNNJVectorTests

17e211f

Fix missing @test annotations

c2c8082

Use JVectorSearchStrategy to plumb search parameters to JVectorReader

d26e1f6

Use IntUnaryOperator for numberOfSubspacesPerVetorSupplier

7435bf9

Fix missed call to KnnCollector.incVisitedCount

ddbae42

Skip search altogether when graph is empty

e5b7619

abernardi597 added 11 commits December 4, 2025 14:58

Use proper index slicing

af69571

Remove segmentName arg from getGraph

b03d201

Improve primary index file handling on write

4267ec0

Remove FieldInfos field

f516762

Fixup improve missing graph

df18dc8

Write all fields to the same file

bbdcee5

fixup! Fix missing @OverRide

f90eacf

Fix remove sorting

cecc473

Avoid extra copies in RandomAccessMergedFloatVectorValues.getVector()

1d2b4e8

Move PQ encoding to FieldWriter.addValue instead of flush

cfbf4c2

Use bulk read methods where possible (requires JVector byte-order pull

dc3dba6

577)

github-actions bot added module:sandbox module:build-infra labels Dec 4, 2025

abernardi597 and others added 10 commits December 5, 2025 16:30

Move BuildScoreProvider to FieldWriter

9ef6dd9

Use ImmutableGraphIndex for writeField

4b25b4b

Build graph while adding docs

54959f5

Support maxDegrees per-layer

ab9e92a

Start largestQuantizedReaderIndex at -1

528b6d9

fixup! Move PQ encoding to FieldWriter.addValue instead of flush

fe3c783

Don't re-use PQ codebooks

029a116

fixup! fixup! Move PQ encoding to FieldWriter.addValue instead of flush

5976164

Small re-organize PQ merge

d2738f6

Merge branch 'apache:main' into jvector

1ddfe8f

uschindler reviewed Dec 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Revisiting JVector codec #15472

[WIP] Revisiting JVector codec #15472

abernardi597 commented Dec 4, 2025 •

edited

Loading

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

mikemccand commented Dec 11, 2025 •

edited

Loading

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

abernardi597 commented Dec 12, 2025 •

edited

Loading

Uh oh!

uschindler Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] Revisiting JVector codec #15472

Are you sure you want to change the base?

[WIP] Revisiting JVector codec #15472

Conversation

abernardi597 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

mikemccand commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

mikemccand commented Dec 11, 2025

Uh oh!

abernardi597 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uschindler Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abernardi597 commented Dec 4, 2025 •

edited

Loading

mikemccand commented Dec 11, 2025 •

edited

Loading

abernardi597 commented Dec 12, 2025 •

edited

Loading