Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jan 9, 2026

Implement schema validation in project_with_partition to ensure the input schema matches the Iceberg table schema before calculating partition values. This prevents subtle bugs from schema mismatches and provides clear error messages when schemas don't match.

Changes:

  • Add helper functions to recursively strip metadata from Arrow schemas
  • Implement schema validation that compares input schema with expected Iceberg table schema, ignoring metadata differences
  • Add comprehensive tests for metadata stripping and schema validation
  • Closes Validate input schema inside project for DataFusion #1752

The implementation follows the approach suggested in issue #1752:

  • Recursively visits schema and removes metadata from all fields
  • Compares cleaned schemas using Arrow's built-in equality operator
  • Returns helpful error messages showing both schemas on mismatch

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

Are these changes tested?

@viirya viirya force-pushed the feat/datafusion-schema-validation branch from 4e6566f to 62ab07a Compare January 9, 2026 08:57
///
/// # Returns
/// A new Arrow data type with all metadata removed from nested structures
fn strip_metadata_from_datatype(data_type: &DataType) -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions:

  1. Move this part to arrow module, we have plans to make move arrow out of core library, so it would be better to put all arrow related code to same module.
  2. Use ArrowSchemaVisitor to do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will update this today.

@viirya viirya force-pushed the feat/datafusion-schema-validation branch 3 times, most recently from d9b253a to a5b4783 Compare January 9, 2026 18:22
Implement schema validation in project_with_partition to ensure the input
schema matches the Iceberg table schema before calculating partition values.
This prevents subtle bugs from schema mismatches and provides clear error
messages when schemas don't match.

Changes:
- Add helper functions to recursively strip metadata from Arrow schemas
- Implement schema validation that compares input schema with expected
  Iceberg table schema, ignoring metadata differences
- Add comprehensive tests for metadata stripping and schema validation
- Closes apache#1752

The implementation follows the approach suggested in issue apache#1752:
- Recursively visits schema and removes metadata from all fields
- Compares cleaned schemas using Arrow's built-in equality operator
- Returns helpful error messages showing both schemas on mismatch

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@viirya viirya force-pushed the feat/datafusion-schema-validation branch from a5b4783 to d3a1c7a Compare January 9, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate input schema inside project for DataFusion

2 participants