Skip to content

GH-49186: [R] Support dplyr::filter_out() in Arrow dplyr backend#49256

Open
larry77 wants to merge 2 commits intoapache:mainfrom
larry77:r-filter-out
Open

GH-49186: [R] Support dplyr::filter_out() in Arrow dplyr backend#49256
larry77 wants to merge 2 commits intoapache:mainfrom
larry77:r-filter-out

Conversation

@larry77
Copy link

@larry77 larry77 commented Feb 12, 2026

Rationale for this change

New function in dplyr not yet implemented in Arrow

What changes are included in this PR?

This PR adds support for dplyr::filter_out() in the Arrow R dplyr backend.

The implementation reuses the existing filter() machinery and extends
set_filters() with an exclude flag. When exclude = TRUE, the predicate
is transformed to match dplyr semantics (drop rows where predicate is TRUE,
keep rows where predicate is FALSE or NA).

Multiple filter_out() predicates are combined before exclusion so that
filter_out(a, b) matches dplyr semantics (i.e. drop rows where a & b is TRUE).

This works for arrow_table(), RecordBatchReader, and open_dataset(), and
preserves lazy evaluation for larger-than-memory datasets.

Tests are added to verify basic behavior, NA handling, and multiple predicates.

Note: local test run hits a with_language() locale issue ('.cache' not found),
which appears environment-specific and unrelated to these changes.

Are these changes tested?

Yes

Are there any user-facing changes?

Just the new function

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@larry77 larry77 changed the title r: add support for dplyr::filter_out() in Arrow backend GH-49257 : [R] Support dplyr::filter_out() in Arrow dplyr backend Feb 12, 2026
@github-actions
Copy link

⚠️ GitHub issue #49257 has been automatically assigned in GitHub to PR creator.

@larry77 larry77 changed the title GH-49257 : [R] Support dplyr::filter_out() in Arrow dplyr backend GH-49257: [R] Support dplyr::filter_out() in Arrow dplyr backend Feb 12, 2026
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR @larry77! This is looking good, a few notes from me:

  • Would you mind adding filter_out to supported_dplyr_methods in arrow/r/R/arrow-package.R?
  • Tests look good; I went to see if the core dplyr ones did much different but I think those cases you included cover everything.
  • The code in the body of filter_out() appears to be copied from filter(); could we instead extract out that code into its own function and then use it in both?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 13, 2026
@thisisnic thisisnic changed the title GH-49257: [R] Support dplyr::filter_out() in Arrow dplyr backend GH-49186: [R] Support dplyr::filter_out() in Arrow dplyr backend Feb 13, 2026
@github-actions
Copy link

⚠️ GitHub issue #49186 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 13, 2026
@larry77
Copy link
Author

larry77 commented Feb 13, 2026

Thanks for making this PR @larry77! This is looking good, a few notes from me:

* Would you mind adding `filter_out` to `supported_dplyr_methods` in `arrow/r/R/arrow-package.R`?

* Tests look good; I went to see if the core dplyr ones did much different but I think those cases you included cover everything.

* The code in the body of `filter_out()` appears to be copied from `filter()`; could we instead extract out that code into its own function and then use it in both?

Thanks for the review!

– Added filter_out to supported_dplyr_methods in r/R/arrow-package.R.
– Refactored filter() and filter_out() to share a common implementation (no duplicated body).

Local tests pass aside from the existing locale-related .cache flake in the “missing column” test, which seems environment-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants