Skip to content

nom parser instead of ad-hoc in examples#20122

Open
cj-zhukov wants to merge 5 commits intoapache:mainfrom
cj-zhukov:cj-zhukov/nom-parser-instead-of-ad-hoc-in-examples
Open

nom parser instead of ad-hoc in examples#20122
cj-zhukov wants to merge 5 commits intoapache:mainfrom
cj-zhukov:cj-zhukov/nom-parser-instead-of-ad-hoc-in-examples

Conversation

@cj-zhukov
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@cj-zhukov
Copy link
Contributor Author

High-Level Overview

This PR is an exploratory step to evaluate whether using a parser combinator library (nom) improves the clarity and robustness of the example documentation parsing logic.

In a previous PR #19750, the parsing of subcommands and example metadata in main.rs docs was implemented using ad-hoc string manipulation. While that approach works, this PR experiments with replacing that logic using nom for two functions:

  • parse_subcommand_line
  • parse_metadata_line

Personally, I found the nom-based implementation easier to read, reason about, and maintain. Expressing the grammar declaratively with a parser tool feels more natural for this kind of structured input, and the intent of the parsing logic is clearer compared to manual string slicing and conditionals.

That said, this PR is intentionally limited in scope. nom is currently used only for these two parsing helpers, and introducing a new dependency for such a narrow use case may not be justified on its own. The main open question is whether DataFusion would benefit from using nom more broadly for similar parsing tasks in the future.

If the project sees value in adopting nom for other parsing needs, this PR could serve as a small, contained starting point. Otherwise, it may be reasonable to stick with the existing ad-hoc approach to avoid dependency overhead.

Feedback on whether this trade-off is worthwhile is very welcome.

@cj-zhukov
Copy link
Contributor Author

I'd like to keep the parser simple for now. Currently, it can't handle extra symbols like () in the description of an example. In practice, only one group udf has this case, so I updated its README to remove the parentheses.

I'm happy to improve the parser in the future to handle such cases more robustly if needed. For now, this keeps the code readable and avoids unnecessary complexity.

@cj-zhukov
Copy link
Contributor Author

@Jefffrey since you helped with previous PRs related to example docs generation, it would be great if you could take a look at this one as well. Your feedback or any improvements would be much appreciated.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cj-zhukov I have some feeling the examples_docs can be renamed to reflect it is a parser for examples, perhaps it would be good to split into smaller utility files as well. But this can be addressed in the following PR.

One thing to note is: nom not very actively developing

| udwf | [`udf/simple_udwf.rs`](examples/udf/simple_udwf.rs) | Simple UDWF example |
| Subcommand | File Path | Description |
| ---------- | ------------------------------------------------------- | --------------------------------------------- |
| adv_udaf | [`udf/advanced_udaf.rs`](examples/udf/advanced_udaf.rs) | Advanced User Defined Aggregate Function UDAF |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the braces were nice since it was just highlighting the abbreviation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I agree that supporting extra symbols like parentheses in the description would be better.

I’ll update the parser in this PR to handle that case more robustly.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see we use nom to do the work. (I had some good experiences using it to build sql parser years ago)

@xudong963
Copy link
Member

nom not very actively developing

IMO, It has become relatively mature.

@cj-zhukov
Copy link
Contributor Author

Thanks @cj-zhukov I have some feeling the examples_docs can be renamed to reflect it is a parser for examples, perhaps it would be good to split into smaller utility files as well. But this can be addressed in the following PR.

One thing to note is: nom not very actively developing

Thanks for the feedback! I agree that examples_docs could be renamed to better reflect that it’s focused on parsing example metadata, and that the code could be split into smaller, more focused utility modules.

To keep this PR scoped and easy to review, I’d prefer to address renaming and refactoring in a follow-up PR. I’ll open one specifically for improving naming and structure.

Thanks also for the note about nom - I’m keeping its usage minimal here so it should be easy to adjust if needed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants