Skip to content

fix: Use table schema rather than file schema when projecting properties#20174

Draft
alamb wants to merge 3 commits intoapache:mainfrom
alamb:alamb/fix_properties_bug
Draft

fix: Use table schema rather than file schema when projecting properties#20174
alamb wants to merge 3 commits intoapache:mainfrom
alamb:alamb/fix_properties_bug

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 5, 2026

Which issue does this PR close?

Rationale for this change

This is a regression we found in our unit tests.

When upgrading to DataFusion 52 we hit a bug in our test cases where pre-sorted data was being resorted

What changes are included in this PR?

Fix the bug

Are these changes tested?

Yes

I tried for a while with codex to get a .slt test purely via SQL but was not successful. To trigger the bug the output_ordering needs to not yet be projected before eq_properties() runs (that’s the only case the project_orderings(...) fix addresses). The SQL planner seems to always build output_ordering in the base schema, so the bug doesn’t manifest.

Here is what I tried

Details

# Create a table ordered by (a, b, c) using inline data. A filtered ORDER BY on b
# should not introduce an extra SortExec after projection reorders the columns.
statement ok
COPY (
  VALUES
    (1, 1, 1),
    (1, 1, 2),
    (1, 2, 1),
    (1, 2, 3),
    (2, 1, 1),
    (2, 1, 2),
    (2, 2, 1)
) TO 'test_files/scratch/order/ordered_abc/part-0.csv'
STORED AS CSV;

statement ok
CREATE EXTERNAL TABLE ordered_abc (
  a INT,
  b INT,
  c INT
)
STORED AS CSV
WITH ORDER (a ASC, b ASC, c ASC)
LOCATION 'test_files/scratch/order/ordered_abc/'
OPTIONS ('format.has_header' 'false');

query TT
EXPLAIN SELECT b, a FROM ordered_abc WHERE a = 1 ORDER BY b;
----
logical_plan
01)Sort: ordered_abc.b ASC NULLS LAST
02)--Projection: ordered_abc.b, ordered_abc.a
03)----Filter: ordered_abc.a = Int32(1)
04)------TableScan: ordered_abc projection=[a, b], partial_filters=[ordered_abc.a = Int32(1)]
physical_plan
01)SortPreservingMergeExec: [b@0 ASC NULLS LAST]
02)--ProjectionExec: expr=[b@1 as b, a@0 as a]
03)----FilterExec: a@0 = 1
04)------RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1, maintains_sort_order=true
05)--------DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/order/ordered_abc/part-0.csv]]}, projection=[a, b], output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], file_type=csv, has_header=false

statement ok
drop table ordered_abc;

Are there any user-facing changes?

@github-actions github-actions bot added the datasource Changes to the datasource crate label Feb 5, 2026
self.output_ordering.clone(),
)
.with_constraints(self.constraints.clone());
let orderings = project_orderings(&self.output_ordering, schema);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the fix (to call project_orderings) -- the rest is tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thought I am not sure if this is correct 🤔 I need to do some more research into when the projection is applied

@@ -685,11 +685,10 @@ impl DataSource for FileScanConfig {

fn eq_properties(&self) -> EquivalenceProperties {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed the equivalence properties are computed on demand each time during planning, which can be quite substantial. I'll see if I can find some way to avoid doing so as a follow on

@alamb alamb force-pushed the alamb/fix_properties_bug branch from 6554aac to 35209a9 Compare February 5, 2026 21:08
let object_store_url = ObjectStoreUrl::parse("test:///").unwrap();

let table_schema = TableSchema::new(Arc::clone(&file_schema), vec![]);
let table_schema = TableSchema::from(Arc::clone(&file_schema));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a driveby cleanup to reduce test duplication (codex actually copied a lot of this extra bloat into the new tests, so instead I went and cleaned it up at the source)

I can break it into a new PR if you prefer

self.output_ordering.clone(),
)
.with_constraints(self.constraints.clone());
let orderings = project_orderings(&self.output_ordering, schema);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thought I am not sure if this is correct 🤔 I need to do some more research into when the projection is applied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Existing file ordering is lost in some cases with projections

1 participant