Skip to content

Epic: CometNativeScan improvements (per-partition serde, cleanup, DPP, AQE DPP, V2 operator) #3510

@mbutrovich

Description

@mbutrovich

I opened #3446 initially to apply a bunch of changes from #3349 to CometNativeScan. The PR got away from me though, particularly due to DPP. I had it working for the static DPP rules, but the dynamic ones that come from AQE are tricky. I am opening this issue to track the things I'd like to tackle in smaller chunks rather than one giant PR:

  • Per-partition serde to no longer send every SparkFilePartition of tasks to every partition. This simply reduces serde overhead for large scans. Addressed by feat: CometNativeScan per-partition plan serde #3511.
  • DPP (non-AQE) for V1 operator. These runtime filters are created by Spark's PlanDynamicPruningFilters and are easier for Comet support since this rule runs before Comet's rules.
  • DPP (AQE) for V1 operator. These runtime filters are created by Spark's PlanAdaptiveDynamicPruningFilters and are difficult for Comet support since this rule runs after Comet's rules. I'll summarize my learning from feat: CometNativeScan per-partition plan data, add DPP [iceberg] #3446:
    • Comet's rules replace things like BroadcastHashJoin with CometBroadcastHashJoin, which PlanAdaptiveDynamicPruningFilters does not recognize.
    • We can't modify Spark rules, so we could wait until after PlanAdaptiveDynamicPruningFilters runs. This requires registering new Comet rules after where they currently run. I tried to create a simple rule to defer just BroadcastHashJoin replacement until later, but this became too complicated with multiple scan implementations. I think when we pare down our scan implementations, we can revisit a broader redesign of Comet rules in a way that works better with AQE. We will need this for stronger Spark 4.0 support.
  • CometNativeBatchScan operator. See Add native_datafusion V2 DataSource API reader #3481. Will not support DPP, so pretty quick to implement.
  • Minor refactor tasks: consider renaming these (CometNativeScan, CometNativeBatchScan) since we now have other scan types. Don't wrap them in CometScan or CometBatchScan. This might allow us to collapse our CometScanRule and CometExecRule. Part of the larger rule refactoring discussed above.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions