Skip to content

test: add nested data query tests#5901

Draft
wjones127 wants to merge 9 commits intolance-format:mainfrom
wjones127:wjones/nested-query-tests
Draft

test: add nested data query tests#5901
wjones127 wants to merge 9 commits intolance-format:mainfrom
wjones127:wjones/nested-query-tests

Conversation

@wjones127
Copy link
Contributor

Summary

Add query tests for nested data types (List<str>, List<int>, Struct, List<Struct<str>>), covering scan, take, and filter with and without indices (LabelList, BTree, Bitmap).

Test data includes null list elements, null lists, null struct fields, and null struct elements in lists. Some of these cases are expected to fail due to known issues (e.g. LabelList dropping rows with null elements, struct-level nulls not preserved on round-trip).

Test plan

  • cargo test -p lance --test integration_tests --features slow_tests nested

🤖 Generated with Claude Code

@github-actions github-actions bot added the chore label Feb 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Code Review

This PR adds comprehensive test coverage for nested data types (List, List, Struct, List<Struct>) with various index configurations. The tests are well-structured and follow existing patterns in the query test suite.

No P0 or P1 issues identified. The code is well-organized, follows the established testing patterns, and covers important edge cases (nulls at various levels, empty lists, duplicates).

A few minor observations (not blocking):

  • The tests appropriately exercise null handling at multiple levels (null elements, null lists, null struct fields, null struct elements) which is valuable for catching regressions like fix: handle NULL elements in LABEL_LIST index results and explain_plan #5867
  • The test data setup in test_query_list_struct is verbose due to the builder API, but this is unavoidable given Arrow's struct builder ergonomics
  • The comment noting LabelList doesn't support struct elements explains why that test case omits index configurations

The PR achieves its stated goal of adding query tests for nested data types with good coverage of edge cases.

@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

wjones127 and others added 3 commits February 10, 2026 09:36
Add tests for List<str>, List<int>, Struct, and List<Struct<str>>
covering scan, take, and filter (including NOT/OR variants) with and
without indices (LabelList, BTree, Bitmap).

Data includes null list elements, null lists, null struct fields, and
null struct elements in lists to catch regressions like lance-format#5867.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_query_list_str: Fails with LabelList index (issue lance-format#5682)
- test_query_struct: Fails due to struct-level nulls not preserved (issue lance-format#1120)
- test_query_list_struct: Fails due to list-of-struct not properly handled (issue lance-format#838)

Added comprehensive tests for empty lists and nulls on both sides of OR expressions.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
LabelList index still has issues with null element handling despite PR lance-format#5867 and PR lance-format#5914.
Tests pass without LabelList index. Re-enable when fully fixed.

Issue: lance-format#5682

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@wjones127 wjones127 force-pushed the wjones/nested-query-tests branch from 3225b12 to c6bff84 Compare February 10, 2026 17:43
wjones127 and others added 6 commits February 10, 2026 10:11
Both test_query_struct and test_query_list_struct are expected to be fixed
in Lance 2.1+. Re-enable tests when minimum version is 2.1 or later.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Added a with_file_version() method to DatasetTestCases to allow running tests
with specific Lance file format versions. This enables testing features that
are only available in newer file format versions.

Also added test_query_struct_v2_1 which tests struct-level null preservation
with Lance 2.1 format. This test now passes, confirming issue lance-format#1120 is fixed
in Lance 2.1+.

- struct-level nulls are now preserved with V2_1 format (issue lance-format#1120)
- list-of-struct still has issues even with V2_1 (issue lance-format#838 remains open)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Investigation revealed that test_query_list_struct fails due to a panic in the
repdef encoder (lance-encoding/src/repdef.rs:630), NOT due to list-of-struct
filtering limitations (issue lance-format#838).

Key findings:
- Python API: Successfully writes and reads List<Struct> data
- Rust test: Panics during encode with ListBuilder + StructBuilder
- Issue is specific to how Rust builders construct validity patterns
- This is SEPARATE from lance-format#838 (filtering/selection support)

Test status:
- test_query_struct_v2_1: PASSES (confirms issue lance-format#1120 fixed in V2.1)
- test_query_list_struct: FAILS (encoder panic, separate from lance-format#838)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
CRITICAL FINDING: The List<Struct> encoding issue is related to issue lance-format#1120.

Root Cause:
Struct-level nulls (null struct elements in a list) are NOT preserved on round-trip.

This is the same issue as lance-format#1120 (struct-level nulls not preserved) but for list
elements. It's SEPARATE from lance-format#838 (filtering/selection support).

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
CRITICAL FINDING: Behavior differs significantly by file format version:

2.0 (DEFAULT): Silently LOSES null struct elements (data corruption)
2.1: PANICS on READ (assertion in decoder)
2.2: PANICS on READ (assertion in decoder)

The panic is in the decoder (struct.rs:382), not the encoder.

This indicates that 2.1/2.2 changed how null structs are encoded, but
the decoder wasn't properly updated to handle the new encoding.

This is a REGRESSION - 2.0's silent corruption is "better" than
2.1/2.2's crash, though both are wrong.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Attempted to reproduce the exact panic from the Rust test:
  thread 'lance-cpu' panicked at rust/lance-encoding/src/repdef.rs:630:9:
  assertion failed: self.current_len == 0 ||
      self.current_len == validity.len() + self.current_num_specials

Created multiple test patterns:
- test_repdef_panic.py: Various list/struct combinations
- test_repdef_null_first.py: Null struct elements at offset 0
- test_repdef_builder_order.py: Full sequence from Rust test

Result: All Python/PyArrow tests PASS
- Python creates structures that encode/decode correctly
- Issue is specific to Rust ListBuilder + StructBuilder combination
- The 'special' value tracking in definition levels must differ

Key insight: PyArrow's higher-level API abstracts away the
intermediate builder states. Rust builders expose these states
differently, creating mismatches in:
- current_len tracking
- definition level special count
- offset/validity patterns

The Rust test's incremental builder pattern likely creates a
state where current_len != validity.len() + current_num_specials

Recommendation: Debug the Rust test with print statements in
repdef.rs:628-631 to see:
1. What value current_len has
2. What value validity.len() + current_num_specials produces
3. Which specific append() call triggers the mismatch

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments