-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Enforce fallback support for float vector retrieval in quantized KNN vector formats #15476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nVectorsFormatTestCase
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call it something other than quantizedFormat? There are may formats that I would call quantized that would return false here (e.g. all HNSW ones right?).
Maybe something like "supportsFloatVectorFallback()"?
Makes sense. Changed in next revision. |
| } | ||
|
|
||
| /** Replaces a raw vector file with an empty one that has valid header/footer. */ | ||
| private void replaceWithEmptyVectorFile(Directory dir, String fileName) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make these helper methods protected ? They make certain assumptions about the layout of the format (e.g. file extensions, etc.) that might not actually be true.
What if we change the flat float format in the future, and we change the file names, or we write additional info to the metadata? Then the bwc tests and the mainline tests would get out of sync and it would get frustrating to correct.
I am fine with the "default" being the same as the current flat float format, but lets make these helper methods overridable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it's risky/messy, yet necessary, abstraction violation.
I think ideally the Codec (KnnVectorsFormat/Writer) would expose this method (writing empty full precision vectors), maybe, since it's the only place that knows the file layout. And it'd optionally write both empty and full flat vector files, somehow. I think we need a clean solution to #13158... and once we have that, we fix these base test methods.
| for (int i = 0; i < numVectors; i++) { | ||
| float[] vec = randomVector(dim); | ||
| if (normalize) { | ||
| float[] copy = new float[vec.length]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make a copy? Nothing uses vec after this?
| } | ||
|
|
||
| /** Replaces a raw vector file with an empty one that has valid header/footer. */ | ||
| private void replaceWithEmptyVectorFile(Directory dir, String fileName) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it's risky/messy, yet necessary, abstraction violation.
I think ideally the Codec (KnnVectorsFormat/Writer) would expose this method (writing empty full precision vectors), maybe, since it's the only place that knows the file layout. And it'd optionally write both empty and full flat vector files, somehow. I think we need a clean solution to #13158... and once we have that, we fix these base test methods.
mikemccand
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for following up to fix the testing issue @Pulkitg64! This way future new KN Codec components shouldn't accidentally lose this capability.
| } | ||
|
|
||
| @Override | ||
| protected boolean supportsFloatVectorFallback() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I thought "huh, maybe base class should just have default return false; impl", but then realized that's bad: we want new formats to have to explicitly think about this question "do I support regenerated float[] from compressed/quantized forms" rather than inherit dangerous default (which would risk losing the feature again).
If Lucene had PQ (product quantization) working in a new KnnVectorsWriter, which I think is both dimensionality reducing, and quantizing scalar values, it could in theory re-hydrate float[] I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want new formats to have to explicitly think about this question
Yes that's right :) and If any new test class implements the function and returns true, then the test case will fail with empty float vector error. This would trigger the user to implement fallback in the codec.
| } | ||
|
|
||
| @Override | ||
| protected Codec getCodecForQuantizedTest() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also rename this Quantized -> FloatVectorFallback or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in new revision.
|
|
||
| protected abstract boolean supportsFloatVectorFallback(); | ||
|
|
||
| protected int getQuantizationBits() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment explaining that this is for computing epsilon tolerance of float quantization errors in the test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| /** Updates vector metadata file to indicate zero vector length. */ | ||
| protected void updateVectorMetadataFile(Directory dir, String fileName) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can we move these methods down into the actual FlatVectorsWriters impls that write this specific format? These Codec specific details really shouldn't be in the abstract base class...
Is the problem that we have two different FlatVectorsFormats that share this same impl? If so, maybe we can share somehow -- make static methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion Mike! I want to make sure I understand the approach correctly before proceeding. If I understand correctly, the idea is to move this function to the FlatVectorWriter class. I see a few potential challenges with this approach:
- We'd need to determine how to invoke this function from tests, which may require exposing additional APIs.
- Alternatively, we could make this part of the codec itself to create empty vector files during flush, but this functionality isn't currently supported and might be too invasive a change for this particular fix.
For given time, would it make sense to instead focus on addressing #13158 (creating a read-only index)? (I think I have some idea which we can pursue to solve the issue)
Please let me know if you meant something else in your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry, I meant move these methods down into the test classes that correspond to the format you are tweaking, i.e. TestXXVectorsFormat classes. I think specifically to TestLucene99HnswVectorsFormat.java?
My logic is that how the subclasses of this test class go and truncate to empty full precision vector files is a Codec implementation dependent thing, whereas this Base class should be agnostic to such Codec specifics. Multiple Codecs should be able to re-use this class. E.g. TestSimpleTextKnnVectorsFormat would need to implement its own way to zero out the vector files? Maybe we could add that here/now? This way it exercises that this base test class is in fact generic enough to test across codecs... (and this is sort to purpose of the SimpleTextCodec -- to show that the Codec API really is generic enough to correctly support a wildly different kind of Codec implementation. Plus it's a super coo way to debug indexing problems since you can go and look at the text files.)
(Separately: we don't seem to have BaseFlatVectorsFormatTestCase nor separate test cases for each flat format? Does BaseKnnVectorsFormatTestCase test the FlatVectorsFormat too maybe? Anyway, let's not try to solve that here -- PNP!).
Alternatively, we could make this part of the codec itself to create empty vector files during flush, but this functionality isn't currently supported and might be too invasive a change for this particular fix.
Yeah let's not try to do that here -- that indeed belongs under #13158 -- I think having Codec write/own (inventory) the empty vector files seems like a good approach, but it's likely complicated.
Description
In PR #14792, we added fallback support to Lucene99ScalarQuantizedVectorReader to retrieve float vectors when full-precision vectors are not present in the index. However, this fallback support was removed by mistake during the OSQ implementation as part of the Lucene104 codec, requiring us to re-add it in PR #15415.
To ensure such support is enforced in future codecs, we need a mechanism to detect when it's missing. This PR addresses this by moving the relevant test case to BaseKnnVectorsFormatTestCase, ensuring that all subclasses must implement this functionality. The test will fail for any new codec that lacks the required fallback support.