Skip to content

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891

Open
Xuanwo wants to merge 1 commit intomainfrom
luban/mind-into
Open

refactor: use dict entries and encoded size instead of cardinality for dict decision#5891
Xuanwo wants to merge 1 commit intomainfrom
luban/mind-into

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Feb 5, 2026

This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead.


Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Code Review

Summary

This PR changes the dictionary encoding decision logic from using pre-computed cardinality statistics to using a budget-based approach with max_dict_entries and max_encoded_size limits. It also adds early-out sampling to skip dictionary encoding on near-unique data.

P0 Issues

  1. Hash collision risk in sample_is_near_unique for variable-width data (primitive.rs:287-318)

    The function uses xxh3_64 hashes in a HashSet<u64> to estimate uniqueness of strings. Hash collisions will undercount unique values, potentially causing the function to incorrectly identify high-cardinality data as "not near-unique", leading to wasted CPU on dictionary encoding attempts. While xxh3 has excellent distribution, for a 4096-sample set the collision probability is non-negligible for data with actual high cardinality.

    Recommendation: This is acceptable for a sampling heuristic since false negatives (missing some unique data) just mean we try dictionary encoding and bail out. However, document this trade-off explicitly.

  2. Magic constants without documentation (primitive.rs:66-67)

    DEFAULT_SAMPLE_SIZE: usize = 4096 and DEFAULT_SAMPLE_UNIQUE_RATIO: f64 = 0.98 are introduced inside should_dictionary_encode. These affect encoding behavior but aren't documented or exposed as configurable.

    Recommendation: Move these to module-level constants with documentation explaining their purpose and tuning considerations.

P1 Issues

  1. Lazy cardinality computation may cause unexpected latency (statistics.rs:189-214)

    The change makes VariableWidthBlock::get_stat(Stat::Cardinality) lazily compute cardinality, acquiring a write lock inside what callers may expect to be a read-only operation. If other code paths depend on cardinality being pre-computed (e.g., for progress estimation or logging), they may experience unexpected blocking.

    Recommendation: Verify no other code paths rely on cardinality being eagerly computed. The test coverage seems adequate, but consider a brief audit.

  2. Potential division edge case (primitive.rs:181-183)

    let threshold_cardinality = num_values
        .checked_div(divisor.max(1))
        .unwrap_or(0)
        .min(max_cardinality);

    Using divisor.max(1) is good, but if someone sets LANCE_ENCODING_DICT_DIVISOR=0 via env var, this silently becomes 1 rather than erroring. This is likely fine but may surprise users expecting an error.

Positive Notes

  • Good use of budget-based approach that aborts dictionary encoding early if limits are exceeded
  • The sampling heuristic should reduce CPU waste on high-cardinality data
  • Tests cover the new budget limits well

Testing Suggestions

Consider adding a test that verifies dictionary encoding still works correctly when the sample suggests near-uniqueness but actual data has lower cardinality (edge case where sampling step misses repeated patterns).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments