Enforce fallback support for float vector retrieval in quantized KNN vector formats #15476

Pulkitg64 · 2025-12-05T10:51:12Z

Description

In PR #14792, we added fallback support to Lucene99ScalarQuantizedVectorReader to retrieve float vectors when full-precision vectors are not present in the index. However, this fallback support was removed by mistake during the OSQ implementation as part of the Lucene104 codec, requiring us to re-add it in PR #15415.

To ensure such support is enforced in future codecs, we need a mechanism to detect when it's missing. This PR addresses this by moving the relevant test case to BaseKnnVectorsFormatTestCase, ensuring that all subclasses must implement this functionality. The test will fail for any new codec that lacks the required fallback support.

…nVectorsFormatTestCase

benwtrent

Can we call it something other than quantizedFormat? There are may formats that I would call quantized that would return false here (e.g. all HNSW ones right?).

Maybe something like "supportsFloatVectorFallback()"?

Pulkitg64 · 2025-12-09T05:00:26Z

Can we call it something other than quantizedFormat? There are may formats that I would call quantized that would return false here (e.g. all HNSW ones right?).

Maybe something like "supportsFloatVectorFallback()"?

Makes sense. Changed in next revision.

benwtrent · 2025-12-09T12:02:00Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

+  }
+
+  /** Replaces a raw vector file with an empty one that has valid header/footer. */
+  private void replaceWithEmptyVectorFile(Directory dir, String fileName) throws Exception {


Could you make these helper methods protected ? They make certain assumptions about the layout of the format (e.g. file extensions, etc.) that might not actually be true.

What if we change the flat float format in the future, and we change the file names, or we write additional info to the metadata? Then the bwc tests and the mainline tests would get out of sync and it would get frustrating to correct.

I am fine with the "default" being the same as the current flat float format, but lets make these helper methods overridable

+1, it's risky/messy, yet necessary, abstraction violation.

I think ideally the Codec (KnnVectorsFormat/Writer) would expose this method (writing empty full precision vectors), maybe, since it's the only place that knows the file layout. And it'd optionally write both empty and full flat vector files, somehow. I think we need a clean solution to #13158... and once we have that, we fix these base test methods.

mikemccand · 2025-12-08T18:05:53Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

+    for (int i = 0; i < numVectors; i++) {
+      float[] vec = randomVector(dim);
+      if (normalize) {
+        float[] copy = new float[vec.length];


Why make a copy? Nothing uses vec after this?

mikemccand · 2025-12-09T14:56:15Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

+  }
+
+  /** Replaces a raw vector file with an empty one that has valid header/footer. */
+  private void replaceWithEmptyVectorFile(Directory dir, String fileName) throws Exception {


+1, it's risky/messy, yet necessary, abstraction violation.

I think ideally the Codec (KnnVectorsFormat/Writer) would expose this method (writing empty full precision vectors), maybe, since it's the only place that knows the file layout. And it'd optionally write both empty and full flat vector files, somehow. I think we need a clean solution to #13158... and once we have that, we fix these base test methods.

mikemccand

Thank you for following up to fix the testing issue @Pulkitg64! This way future new KN Codec components shouldn't accidentally lose this capability.

mikemccand · 2025-12-09T15:07:15Z

...t/org/apache/lucene/backward_codecs/lucene102/TestLucene102BinaryQuantizedVectorsFormat.java

  }
+
+  @Override
+  protected boolean supportsFloatVectorFallback() {


At first I thought "huh, maybe base class should just have default return false; impl", but then realized that's bad: we want new formats to have to explicitly think about this question "do I support regenerated float[] from compressed/quantized forms" rather than inherit dangerous default (which would risk losing the feature again).

If Lucene had PQ (product quantization) working in a new KnnVectorsWriter, which I think is both dimensionality reducing, and quantizing scalar values, it could in theory re-hydrate float[] I think?

we want new formats to have to explicitly think about this question

Yes that's right :) and If any new test class implements the function and returns true, then the test case will fail with empty float vector error. This would trigger the user to implement fallback in the codec.

mikemccand · 2025-12-09T15:08:46Z

...est/org/apache/lucene/backward_codecs/lucene99/TestLucene99ScalarQuantizedVectorsFormat.java

+  }
+
+  @Override
+  protected Codec getCodecForQuantizedTest() {


Should we also rename this Quantized -> FloatVectorFallback or so?

Changed in new revision.

mikemccand · 2025-12-09T15:12:52Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java


+  protected abstract boolean supportsFloatVectorFallback();
+
+  protected int getQuantizationBits() {


Add comment explaining that this is for computing epsilon tolerance of float quantization errors in the test case?

mikemccand · 2025-12-09T15:17:09Z

lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java

+  }
+
+  /** Updates vector metadata file to indicate zero vector length. */
+  protected void updateVectorMetadataFile(Directory dir, String fileName) throws Exception {


Hmm, can we move these methods down into the actual FlatVectorsWriters impls that write this specific format? These Codec specific details really shouldn't be in the abstract base class...

Is the problem that we have two different FlatVectorsFormats that share this same impl? If so, maybe we can share somehow -- make static methods?

Thank you for the suggestion Mike! I want to make sure I understand the approach correctly before proceeding. If I understand correctly, the idea is to move this function to the FlatVectorWriter class. I see a few potential challenges with this approach:

We'd need to determine how to invoke this function from tests, which may require exposing additional APIs.

Alternatively, we could make this part of the codec itself to create empty vector files during flush, but this functionality isn't currently supported and might be too invasive a change for this particular fix.

For given time, would it make sense to instead focus on addressing #13158 (creating a read-only index)? (I think I have some idea which we can pursue to solve the issue)

Please let me know if you meant something else in your comment.

Oh sorry, I meant move these methods down into the test classes that correspond to the format you are tweaking, i.e. TestXXVectorsFormat classes. I think specifically to TestLucene99HnswVectorsFormat.java?

My logic is that how the subclasses of this test class go and truncate to empty full precision vector files is a Codec implementation dependent thing, whereas this Base class should be agnostic to such Codec specifics. Multiple Codecs should be able to re-use this class. E.g. TestSimpleTextKnnVectorsFormat would need to implement its own way to zero out the vector files? Maybe we could add that here/now? This way it exercises that this base test class is in fact generic enough to test across codecs... (and this is sort to purpose of the SimpleTextCodec -- to show that the Codec API really is generic enough to correctly support a wildly different kind of Codec implementation. Plus it's a super coo way to debug indexing problems since you can go and look at the text files.)

(Separately: we don't seem to have BaseFlatVectorsFormatTestCase nor separate test cases for each flat format? Does BaseKnnVectorsFormatTestCase test the FlatVectorsFormat too maybe? Anyway, let's not try to solve that here -- PNP!).

Alternatively, we could make this part of the codec itself to create empty vector files during flush, but this functionality isn't currently supported and might be too invasive a change for this particular fix.

Yeah let's not try to do that here -- that indeed belongs under #13158 -- I think having Codec write/own (inventory) the empty vector files seems like a good approach, but it's likely complicated.

Moved testReadQuantizedVectorWithEmptyRawVectors test case to BaserKn…

796b16d

…nVectorsFormatTestCase

github-actions bot added module:core/codecs module:test-framework module:sandbox labels Dec 5, 2025

Pulkitg64 changed the title ~~Moved testReadQuantizedVectorWithEmptyRawVectors test case to BaseKnnVectorsFormatTestCase~~ Enforce fallback support for float vector retrieval in KNN vector formats Dec 5, 2025

Pulkitg64 changed the title ~~Enforce fallback support for float vector retrieval in KNN vector formats~~ Enforce fallback support for float vector retrieval in quantized KNN vector formats Dec 5, 2025

Added entry to Changes.txt

249a1b3

github-actions bot added this to the 10.4.0 milestone Dec 5, 2025

benwtrent reviewed Dec 8, 2025

View reviewed changes

Pulkit Gupta added 2 commits December 9, 2025 04:55

Merge branch 'main' into test

32724f2

Rename function

a3af294

benwtrent reviewed Dec 9, 2025

View reviewed changes

Made helper functions protected

34b8e76

mikemccand reviewed Dec 9, 2025

View reviewed changes

CHanged function name and remove redundanct vector copying

07f6072


		protected abstract boolean supportsFloatVectorFallback();

		protected int getQuantizationBits() {

Enforce fallback support for float vector retrieval in quantized KNN vector formats #15476

Are you sure you want to change the base?

Enforce fallback support for float vector retrieval in quantized KNN vector formats #15476

Uh oh!

Conversation

Pulkitg64 commented Dec 5, 2025

Description

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Pulkitg64 commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikemccand left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pulkitg64 Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pulkitg64 Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pulkitg64 Dec 10, 2025 •

edited

Loading

Pulkitg64 Dec 10, 2025 •

edited

Loading