-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When preserving file partitioning we claim "hash" partitioned on the columns values. Then when applying a dynamic filter for partitioned joins it creates a filter based on a true hash function, not key/value based like file partitioning. This causes partitions to be filtered incorrectly:
Declare file partitioned: Hash([partition_col], N)
Filter with hash-based routing:
CASE hash(key) % N
WHEN 0 THEN <partition 0 bounds>
WHEN 1 THEN <partition 1 bounds>
...
Since file partitioned is using value based partitioning and dynamic filtering is using true hash based end of in this scenario:
┌─────┬──────────────────────┬───────────────┐
│ Key │ Hive Partition Index │ hash(key) % 3 │
├─────┼──────────────────────┼───────────────┤
│ "A" │ 0 │ 2 │
├─────┼──────────────────────┼───────────────┤
│ "B" │ 1 │ 2 │
├─────┼──────────────────────┼───────────────┤
│ "C" │ 2 │ 2 │
└─────┴──────────────────────┴───────────────┘
thus, when the filter evaluates row f_dkey=A, the filter computes it to map to partition 2 bounds even though it is actually in partition 0. So it doesn't match and is incorrectly filtered out.
To Reproduce
see PR: #20175
Expected behavior
For now dynamic filtering will be disabled for partitioned hash joins when preserve file partitions is on.
A long term fix is to introduce a new type of partitioning for the file partitioning to safely distinguish the two. Something like KeyPartitoned or ValuePartitioned is suiting.
Oracle calls this ListPartitioning
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working