chore: Clarify rehash setting in hash utils#20154
chore: Clarify rehash setting in hash utils#20154kumarUjjawal wants to merge 4 commits intoapache:mainfrom
Conversation
datafusion/common/src/hash_utils.rs
Outdated
| dict_hashes: &[u64], | ||
| dict_values: &dyn Array, | ||
| idx: usize, | ||
| // `multi_col` is the historical name for what is now referred to as `rehash` elsewhere |
There was a problem hiding this comment.
It's better to fix the naming than add a comment trying to explain the discrepency
There was a problem hiding this comment.
Made some changes, also did a minor refactor, there is another opportunity for the refactor to centralize the common “compute per-row nested hash → apply rehash (init vs combine) → hash null rows” logic in a shared helper, so each nested hasher mostly just computes a row_hash and delegates the buffer update/null handling. What do you think?
There was a problem hiding this comment.
I'd be careful of this approach, as we'd want to pull as many checks outside the hotloop as we can; checking rehash each time we compute a hash is inefficient compared to checking once before the loop, as many other functions here do
There was a problem hiding this comment.
Oh, Thanks for the heads up. Does the changes look good?
There was a problem hiding this comment.
I don't think we should take the approach of having a apply_row_hash function as it currently is, per my reasoning above
datafusion/common/src/hash_utils.rs
Outdated
| hashes_buffer: &mut [u64], | ||
| rehash: bool, | ||
| ) -> Result<()> { | ||
| // Hashing for nested types follows the same initialize-vs-combine convention as |
There was a problem hiding this comment.
We don't need to repeat this explanation in each function; the top level explanation is sufficient
datafusion/common/src/hash_utils.rs
Outdated
| } | ||
| } | ||
|
|
||
| // Hash null struct values consistently with other array types |
There was a problem hiding this comment.
I thought other array types ignore nulls when hashing? Except for DataType::Null
Which issue does this PR close?
rehashsetting in hash utils #20150Rationale for this change
rehash controls whether a column initializes per-row hashes or combines into an existing accumulator when
hashing multiple columns. Nested and dictionary hashing previously didn’t follow this convention consistently, making behavior harder to reason about.
What changes are included in this PR?
• Documented rehash semantics and how create_hashes applies it.
• Renamed dictionary hashing parameter multi_col to rehash for consistency.
• Updated nested-type hashers (list/struct/map/union/fixed-size-list) to accept and honor rehash, and made null
handling consistent with other types.
• Updated/added tests to reflect the corrected semantics.
Are these changes tested?
Yes
Are there any user-facing changes?
No