-
Notifications
You must be signed in to change notification settings - Fork 273
feat: implement coarse versioning for future matching against #4590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/gemini review |
|
|
||
| # Matches RPM versions: optional epoch, alternating alphanumeric segments. | ||
| rpm_version_strategy = st.from_regex( | ||
| re.compile(r'^([0-9]+:)?(([0-9]+|[A-Za-z]+)((?![0-9A-Za-z])[ -~])*)+$', |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hypothesis uses the regex to generate patterns. It doesn't match with them afaik.
|
|
||
| # Matches RPM versions: optional epoch, alternating alphanumeric segments. | ||
| rpm_version_strategy = st.from_regex( | ||
| re.compile(r'^([0-9]+:)?(([0-9]+|[A-Za-z]+)((?![0-9A-Za-z])[ -~])*)+$', |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hypothesis uses the regex to generate patterns. It doesn't match with them afaik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a substantial and well-executed pull request that introduces a coarse versioning scheme to optimize database queries. The implementation is thorough, covering a wide range of ecosystems, and the addition of property-based testing with hypothesis is a fantastic way to ensure the correctness of the version ordering logic. The refactoring of _sort_key methods to raise ValueError on invalid versions is a solid design improvement. I've identified one potential issue in the Debian coarse versioning logic and a minor opportunity for optimization in osv/models.py. Overall, this is an excellent contribution.
| parts = re.findall(r'^$|\d+|\D+', v) | ||
| int_parts = [] | ||
| for couple in batched(parts, 2): | ||
| if not couple[0].isdecimal(): | ||
| # This is probably handled by is_valid | ||
| break | ||
| int_parts.append(int(couple[0])) | ||
| if len(couple) == 1: | ||
| break | ||
| sep = couple[1] | ||
| if sep == '.': | ||
| continue | ||
| if sep[0] == '.': | ||
| # Bigger than the max int, so we overflow | ||
| int_parts.append(9999999999) | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for parsing Debian versions in coarse_version appears to be incorrect for versions containing alphabetic characters. For an input like '10.20.a30', re.findall(r'^$|\d+|\D+', v) will produce ['10', '.', '20', '.', 'a', '30']. The loop processing these parts will break when it encounters the non-decimal part 'a'. This results in int_parts being [10, 20], which then produces a coarse version ending in .00000000.
However, the corresponding test expects a result ending in .99999999, implying that the alphabetic character should cause an 'overflow' to a maximal value for that component to preserve ordering. The current implementation doesn't seem to handle this case; it only considers special separators for overflow, not alphabetic components in the version string itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an input like
'10.20.a30',re.findall(r'^$|\d+|\D+', v)will produce['10', '.', '20', '.', 'a', '30'].
Um, no:
>>> re.findall(r'^$|\d+|\D+', '10.20.a30')
['10', '.', '20', '.a', '30']which would find its way to the int_parts.append(9999999999) line
osv/ecosystems/ecosystems_base.py
Outdated
| while len(components) < 3: | ||
| components.append(pad_value) | ||
|
|
||
| return f'00:{components[0]:08d}.{components[1]:08d}.{components[2]:08d}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, let's make this construction a function that just takes in 3 int args. then we can replace 374, 337, and 300 with this func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made coarse_version_from_ints be the generic construction function for this, and consolidated some of the logic into it.
osv/models.py
Outdated
|
|
||
| _MAX_GIT_VERSIONS_TO_INDEX = 5000 | ||
|
|
||
| MIN_COARSE_VERSION = '00:00000000.00000000.00000000' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can use the helper function suggested in ecosystems.
another-rex
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Implemented a string-comparable encoding of ecosystem versions to allow us to do filtering on the database queries.
This should help reduce the number of entities needed to fetch and compare in many ecosystems.
I've also slightly changed the behaviour of
_sort_key, which previously was returning a 'maximal' value for invalid versions. Now it should raise aValueError, and thesort_keywrapper (which was handling0values) converts those into maximal version objects.I've added
hypothesisto do some fuzzing tests to make sure the logic retains the ordering rules of the ecosystems. (Please let me know if after this is merged you run into more failure cases so I can fix them). Fuzzing also helped identify some potential uncaught errors in our version parsing (particularly, usingisdigitinstead ofisdecimal), which I've fixed where applicable.I need to write a script to populate the existing AffectedVersions entities with the newly generated values, then work on making the API use the coarse versions for querying.
We have some PRs open on adding new ecosystems - they don't necessarily need the coarse version method, but I'm happy to merge those first then add the code for this later.