fix clean copyright by hust-nj · Pull Request #29 · togethercomputer/RedPajama-Data

hust-nj · 2023-04-27T16:00:02Z

I think there are 2 main problems in current clean_copyright_comments function

RedPajama-Data/data_prep/github/github_clean_dedup_local.py

Line 27 in 567ac9a

def clean_copyright_comments(content: str):

.

First, It cannot remove the copyright successfully in the following C-style code because of the early return in

RedPajama-Data/data_prep/github/github_clean_dedup_local.py

Line 37 in 567ac9a

return content

// Copyright

int main() {
    return 0;
    
    /* comment */
}

Second, I find that, when the file is large, the regex sometimes costs much time in my experiment, I think we only need to find the copyright in the first 100 lines.

mauriceweber · 2023-05-02T15:12:54Z

Hi @hust-nj ! Thanks for bringing this to our attention! I will review your PR asap.

mauriceweber · 2023-05-09T06:45:41Z

Hi @hust-nj , I had a look at your PR. Here's some feedback:

I would prefer not to limit the search for copyright to the first 100 lines; based on what are you proposing 100 lines?
Your current implementation also gets rid of comments in the beginning of any file, which we would like to keep. For example, this:

// A comment

int main() {
    return 0;
    
    /* comment */
}

yields

int main() {
    return 0;
    
    /* comment */
}

fix clean copyright

c09b93d

mauriceweber self-requested a review May 2, 2023 15:12

mauriceweber marked this pull request as draft May 9, 2023 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix clean copyright#29

fix clean copyright#29
hust-nj wants to merge 1 commit intotogethercomputer:mainfrom
hust-nj:main

hust-nj commented Apr 27, 2023 •

edited

Loading

Uh oh!

mauriceweber commented May 2, 2023

Uh oh!

mauriceweber commented May 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hust-nj commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mauriceweber commented May 2, 2023

Uh oh!

mauriceweber commented May 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hust-nj commented Apr 27, 2023 •

edited

Loading