Skip to content

Conversation

@Pulkitg64
Copy link
Contributor

Description

In PR #14792, we added fallback support to Lucene99ScalarQuantizedVectorReader to retrieve float vectors when full-precision vectors are not present in the index. However, this fallback support was removed by mistake during the OSQ implementation as part of the Lucene104 codec, requiring us to re-add it in PR #15415.

To ensure such support is enforced in future codecs, we need a mechanism to detect when it's missing. This PR addresses this by moving the relevant test case to BaseKnnVectorsFormatTestCase, ensuring that all subclasses must implement this functionality. The test will fail for any new codec that lacks the required fallback support.

@Pulkitg64 Pulkitg64 changed the title Moved testReadQuantizedVectorWithEmptyRawVectors test case to BaseKnnVectorsFormatTestCase Enforce fallback support for float vector retrieval in KNN vector formats Dec 5, 2025
@Pulkitg64 Pulkitg64 changed the title Enforce fallback support for float vector retrieval in KNN vector formats Enforce fallback support for float vector retrieval in quantized KNN vector formats Dec 5, 2025
@github-actions github-actions bot added this to the 10.4.0 milestone Dec 5, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it something other than quantizedFormat? There are may formats that I would call quantized that would return false here (e.g. all HNSW ones right?).

Maybe something like "supportsFloatVectorFallback()"?

@Pulkitg64
Copy link
Contributor Author

Can we call it something other than quantizedFormat? There are may formats that I would call quantized that would return false here (e.g. all HNSW ones right?).

Maybe something like "supportsFloatVectorFallback()"?

Makes sense. Changed in next revision.

}

/** Replaces a raw vector file with an empty one that has valid header/footer. */
private void replaceWithEmptyVectorFile(Directory dir, String fileName) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make these helper methods protected ? They make certain assumptions about the layout of the format (e.g. file extensions, etc.) that might not actually be true.

What if we change the flat float format in the future, and we change the file names, or we write additional info to the metadata? Then the bwc tests and the mainline tests would get out of sync and it would get frustrating to correct.

I am fine with the "default" being the same as the current flat float format, but lets make these helper methods overridable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it's risky/messy, yet necessary, abstraction violation.

I think ideally the Codec (KnnVectorsFormat/Writer) would expose this method (writing empty full precision vectors), maybe, since it's the only place that knows the file layout. And it'd optionally write both empty and full flat vector files, somehow. I think we need a clean solution to #13158... and once we have that, we fix these base test methods.

for (int i = 0; i < numVectors; i++) {
float[] vec = randomVector(dim);
if (normalize) {
float[] copy = new float[vec.length];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make a copy? Nothing uses vec after this?

}

/** Replaces a raw vector file with an empty one that has valid header/footer. */
private void replaceWithEmptyVectorFile(Directory dir, String fileName) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it's risky/messy, yet necessary, abstraction violation.

I think ideally the Codec (KnnVectorsFormat/Writer) would expose this method (writing empty full precision vectors), maybe, since it's the only place that knows the file layout. And it'd optionally write both empty and full flat vector files, somehow. I think we need a clean solution to #13158... and once we have that, we fix these base test methods.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following up to fix the testing issue @Pulkitg64! This way future new KN Codec components shouldn't accidentally lose this capability.

}

@Override
protected boolean supportsFloatVectorFallback() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought "huh, maybe base class should just have default return false; impl", but then realized that's bad: we want new formats to have to explicitly think about this question "do I support regenerated float[] from compressed/quantized forms" rather than inherit dangerous default (which would risk losing the feature again).

If Lucene had PQ (product quantization) working in a new KnnVectorsWriter, which I think is both dimensionality reducing, and quantizing scalar values, it could in theory re-hydrate float[] I think?

Copy link
Contributor Author

@Pulkitg64 Pulkitg64 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want new formats to have to explicitly think about this question

Yes that's right :) and If any new test class implements the function and returns true, then the test case will fail with empty float vector error. This would trigger the user to implement fallback in the codec.

}

@Override
protected Codec getCodecForQuantizedTest() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also rename this Quantized -> FloatVectorFallback or so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in new revision.


protected abstract boolean supportsFloatVectorFallback();

protected int getQuantizationBits() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment explaining that this is for computing epsilon tolerance of float quantization errors in the test case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/** Updates vector metadata file to indicate zero vector length. */
protected void updateVectorMetadataFile(Directory dir, String fileName) throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can we move these methods down into the actual FlatVectorsWriters impls that write this specific format? These Codec specific details really shouldn't be in the abstract base class...

Is the problem that we have two different FlatVectorsFormats that share this same impl? If so, maybe we can share somehow -- make static methods?

Copy link
Contributor Author

@Pulkitg64 Pulkitg64 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion Mike! I want to make sure I understand the approach correctly before proceeding. If I understand correctly, the idea is to move this function to the FlatVectorWriter class. I see a few potential challenges with this approach:

  1. We'd need to determine how to invoke this function from tests, which may require exposing additional APIs.
  2. Alternatively, we could make this part of the codec itself to create empty vector files during flush, but this functionality isn't currently supported and might be too invasive a change for this particular fix.

For given time, would it make sense to instead focus on addressing #13158 (creating a read-only index)? (I think I have some idea which we can pursue to solve the issue)

Please let me know if you meant something else in your comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I meant move these methods down into the test classes that correspond to the format you are tweaking, i.e. TestXXVectorsFormat classes. I think specifically to TestLucene99HnswVectorsFormat.java?

My logic is that how the subclasses of this test class go and truncate to empty full precision vector files is a Codec implementation dependent thing, whereas this Base class should be agnostic to such Codec specifics. Multiple Codecs should be able to re-use this class. E.g. TestSimpleTextKnnVectorsFormat would need to implement its own way to zero out the vector files? Maybe we could add that here/now? This way it exercises that this base test class is in fact generic enough to test across codecs... (and this is sort to purpose of the SimpleTextCodec -- to show that the Codec API really is generic enough to correctly support a wildly different kind of Codec implementation. Plus it's a super coo way to debug indexing problems since you can go and look at the text files.)

(Separately: we don't seem to have BaseFlatVectorsFormatTestCase nor separate test cases for each flat format? Does BaseKnnVectorsFormatTestCase test the FlatVectorsFormat too maybe? Anyway, let's not try to solve that here -- PNP!).

Alternatively, we could make this part of the codec itself to create empty vector files during flush, but this functionality isn't currently supported and might be too invasive a change for this particular fix.

Yeah let's not try to do that here -- that indeed belongs under #13158 -- I think having Codec write/own (inventory) the empty vector files seems like a good approach, but it's likely complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants