GH-49186: [R] Support dplyr::filter_out() in Arrow dplyr backend#49256
GH-49186: [R] Support dplyr::filter_out() in Arrow dplyr backend#49256larry77 wants to merge 2 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
thisisnic
left a comment
There was a problem hiding this comment.
Thanks for making this PR @larry77! This is looking good, a few notes from me:
- Would you mind adding
filter_outtosupported_dplyr_methodsinarrow/r/R/arrow-package.R? - Tests look good; I went to see if the core dplyr ones did much different but I think those cases you included cover everything.
- The code in the body of
filter_out()appears to be copied fromfilter(); could we instead extract out that code into its own function and then use it in both?
|
|
Thanks for the review! – Added filter_out to supported_dplyr_methods in r/R/arrow-package.R. Local tests pass aside from the existing locale-related .cache flake in the “missing column” test, which seems environment-specific. |
Rationale for this change
New function in dplyr not yet implemented in Arrow
What changes are included in this PR?
This PR adds support for dplyr::filter_out() in the Arrow R dplyr backend.
The implementation reuses the existing filter() machinery and extends
set_filters() with an
excludeflag. When exclude = TRUE, the predicateis transformed to match dplyr semantics (drop rows where predicate is TRUE,
keep rows where predicate is FALSE or NA).
Multiple filter_out() predicates are combined before exclusion so that
filter_out(a, b) matches dplyr semantics (i.e. drop rows where a & b is TRUE).
This works for arrow_table(), RecordBatchReader, and open_dataset(), and
preserves lazy evaluation for larger-than-memory datasets.
Tests are added to verify basic behavior, NA handling, and multiple predicates.
Note: local test run hits a with_language() locale issue ('.cache' not found),
which appears environment-specific and unrelated to these changes.
Are these changes tested?
Yes
Are there any user-facing changes?
Just the new function