test: add nested data query tests#5901
Draft
wjones127 wants to merge 9 commits intolance-format:mainfrom
Draft
Conversation
Contributor
Code ReviewThis PR adds comprehensive test coverage for nested data types (List, List, Struct, List<Struct>) with various index configurations. The tests are well-structured and follow existing patterns in the query test suite. No P0 or P1 issues identified. The code is well-organized, follows the established testing patterns, and covers important edge cases (nulls at various levels, empty lists, duplicates). A few minor observations (not blocking):
The PR achieves its stated goal of adding query tests for nested data types with good coverage of edge cases. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Add tests for List<str>, List<int>, Struct, and List<Struct<str>> covering scan, take, and filter (including NOT/OR variants) with and without indices (LabelList, BTree, Bitmap). Data includes null list elements, null lists, null struct fields, and null struct elements in lists to catch regressions like lance-format#5867. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_query_list_str: Fails with LabelList index (issue lance-format#5682) - test_query_struct: Fails due to struct-level nulls not preserved (issue lance-format#1120) - test_query_list_struct: Fails due to list-of-struct not properly handled (issue lance-format#838) Added comprehensive tests for empty lists and nulls on both sides of OR expressions. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
LabelList index still has issues with null element handling despite PR lance-format#5867 and PR lance-format#5914. Tests pass without LabelList index. Re-enable when fully fixed. Issue: lance-format#5682 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
3225b12 to
c6bff84
Compare
Both test_query_struct and test_query_list_struct are expected to be fixed in Lance 2.1+. Re-enable tests when minimum version is 2.1 or later. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Added a with_file_version() method to DatasetTestCases to allow running tests with specific Lance file format versions. This enables testing features that are only available in newer file format versions. Also added test_query_struct_v2_1 which tests struct-level null preservation with Lance 2.1 format. This test now passes, confirming issue lance-format#1120 is fixed in Lance 2.1+. - struct-level nulls are now preserved with V2_1 format (issue lance-format#1120) - list-of-struct still has issues even with V2_1 (issue lance-format#838 remains open) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Investigation revealed that test_query_list_struct fails due to a panic in the repdef encoder (lance-encoding/src/repdef.rs:630), NOT due to list-of-struct filtering limitations (issue lance-format#838). Key findings: - Python API: Successfully writes and reads List<Struct> data - Rust test: Panics during encode with ListBuilder + StructBuilder - Issue is specific to how Rust builders construct validity patterns - This is SEPARATE from lance-format#838 (filtering/selection support) Test status: - test_query_struct_v2_1: PASSES (confirms issue lance-format#1120 fixed in V2.1) - test_query_list_struct: FAILS (encoder panic, separate from lance-format#838) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
CRITICAL FINDING: The List<Struct> encoding issue is related to issue lance-format#1120. Root Cause: Struct-level nulls (null struct elements in a list) are NOT preserved on round-trip. This is the same issue as lance-format#1120 (struct-level nulls not preserved) but for list elements. It's SEPARATE from lance-format#838 (filtering/selection support). Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
CRITICAL FINDING: Behavior differs significantly by file format version: 2.0 (DEFAULT): Silently LOSES null struct elements (data corruption) 2.1: PANICS on READ (assertion in decoder) 2.2: PANICS on READ (assertion in decoder) The panic is in the decoder (struct.rs:382), not the encoder. This indicates that 2.1/2.2 changed how null structs are encoded, but the decoder wasn't properly updated to handle the new encoding. This is a REGRESSION - 2.0's silent corruption is "better" than 2.1/2.2's crash, though both are wrong. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Attempted to reproduce the exact panic from the Rust test:
thread 'lance-cpu' panicked at rust/lance-encoding/src/repdef.rs:630:9:
assertion failed: self.current_len == 0 ||
self.current_len == validity.len() + self.current_num_specials
Created multiple test patterns:
- test_repdef_panic.py: Various list/struct combinations
- test_repdef_null_first.py: Null struct elements at offset 0
- test_repdef_builder_order.py: Full sequence from Rust test
Result: All Python/PyArrow tests PASS
- Python creates structures that encode/decode correctly
- Issue is specific to Rust ListBuilder + StructBuilder combination
- The 'special' value tracking in definition levels must differ
Key insight: PyArrow's higher-level API abstracts away the
intermediate builder states. Rust builders expose these states
differently, creating mismatches in:
- current_len tracking
- definition level special count
- offset/validity patterns
The Rust test's incremental builder pattern likely creates a
state where current_len != validity.len() + current_num_specials
Recommendation: Debug the Rust test with print statements in
repdef.rs:628-631 to see:
1. What value current_len has
2. What value validity.len() + current_num_specials produces
3. Which specific append() call triggers the mismatch
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add query tests for nested data types (
List<str>,List<int>,Struct,List<Struct<str>>), covering scan, take, and filter with and without indices (LabelList, BTree, Bitmap).Test data includes null list elements, null lists, null struct fields, and null struct elements in lists. Some of these cases are expected to fail due to known issues (e.g. LabelList dropping rows with null elements, struct-level nulls not preserved on round-trip).
Test plan
cargo test -p lance --test integration_tests --features slow_tests nested🤖 Generated with Claude Code