refactor: use dict entries and encoded size instead of cardinality for dict decision by Xuanwo · Pull Request #5891 · lance-format/lance

Xuanwo · 2026-02-05T11:26:51Z

This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead.

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

…r dict decision

github-actions · 2026-02-05T11:27:55Z

Code Review

Summary

This PR changes the dictionary encoding decision logic from using pre-computed cardinality statistics to using a budget-based approach with max_dict_entries and max_encoded_size limits. It also adds early-out sampling to skip dictionary encoding on near-unique data.

P0 Issues

Hash collision risk in sample_is_near_unique for variable-width data (primitive.rs:287-318)

The function uses xxh3_64 hashes in a HashSet<u64> to estimate uniqueness of strings. Hash collisions will undercount unique values, potentially causing the function to incorrectly identify high-cardinality data as "not near-unique", leading to wasted CPU on dictionary encoding attempts. While xxh3 has excellent distribution, for a 4096-sample set the collision probability is non-negligible for data with actual high cardinality.

Recommendation: This is acceptable for a sampling heuristic since false negatives (missing some unique data) just mean we try dictionary encoding and bail out. However, document this trade-off explicitly.
Magic constants without documentation (primitive.rs:66-67)

DEFAULT_SAMPLE_SIZE: usize = 4096 and DEFAULT_SAMPLE_UNIQUE_RATIO: f64 = 0.98 are introduced inside should_dictionary_encode. These affect encoding behavior but aren't documented or exposed as configurable.

Recommendation: Move these to module-level constants with documentation explaining their purpose and tuning considerations.

P1 Issues

Lazy cardinality computation may cause unexpected latency (statistics.rs:189-214)

The change makes VariableWidthBlock::get_stat(Stat::Cardinality) lazily compute cardinality, acquiring a write lock inside what callers may expect to be a read-only operation. If other code paths depend on cardinality being pre-computed (e.g., for progress estimation or logging), they may experience unexpected blocking.

Recommendation: Verify no other code paths rely on cardinality being eagerly computed. The test coverage seems adequate, but consider a brief audit.
Potential division edge case (primitive.rs:181-183)
```
let threshold_cardinality = num_values
    .checked_div(divisor.max(1))
    .unwrap_or(0)
    .min(max_cardinality);
```
Using divisor.max(1) is good, but if someone sets LANCE_ENCODING_DICT_DIVISOR=0 via env var, this silently becomes 1 rather than erroring. This is likely fine but may surprise users expecting an error.

Positive Notes

Good use of budget-based approach that aborts dictionary encoding early if limits are exceeded
The sampling heuristic should reduce CPU waste on high-cardinality data
Tests cover the new budget limits well

Testing Suggestions

Consider adding a test that verifies dictionary encoding still works correctly when the sample suggests near-uniqueness but actual data has lower cardinality (edge case where sampling step misses repeated patterns).

refactor: use dict entries and encoded size instead of cardinality fo…

16b348d

…r dict decision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891
Xuanwo wants to merge 1 commit intomainfrom
luban/mind-into

Xuanwo commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Xuanwo commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026

Code Review

Summary

P0 Issues

P1 Issues

Positive Notes

Testing Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments