-
Notifications
You must be signed in to change notification settings - Fork 70
fix test failures on 1.12, avoid race condition in multithreaded partitioned writes #582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #582 +/- ##
==========================================
- Coverage 87.43% 86.98% -0.46%
==========================================
Files 26 27 +1
Lines 3288 3396 +108
==========================================
+ Hits 2875 2954 +79
- Misses 413 442 +29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| x = x.data | ||
| len = length(x) | ||
| validity = ValidityBitmap(x) | ||
| # XXX This is a race condition if two workers hit this block at the same time, then they'll create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quinnj I think there is a race condition baked into the current architecture that can't be addressed without a very large refactoring. The current architecture creates the locks on a worker thread if they don't already exist, which means that threads are competing for the creation of the initial lock. The locks should be created before any tasks are spawned.
| using FilePathsBase | ||
| using DataFrames | ||
| import Random: randstring | ||
| using TestSetExtensions: ExtendedTestSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given how long the Arrow tests take, it's useful to have some indication of progress so that we can tell if tests have hung. ExtendedTestSet shows a . for each completed test.
(We also get colored diffs of arrays when tests fail, which is nice.)
|
@kou @ericphanson Could I get review on this please? It's not a pretty solution, but it works. |
ericphanson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think correct writing of multiply partitioned files is better than fast-but-wrong so this seems like an improvement. In general it looks like there's a lot to cleanup here with the threading, e.g. we shouldn't really be spawning tasks without holding onto them so we can fetch them and recover any errors they threw. We should prefer structured concurrency primitives like threaded map / asyncmap etc. But that's unrelated to this PR.
|
I'll merge this in a few days if nobody has more comments. We can try 2.8.4 RC2 release with this, right? See also: The 2.8.4 RC1 vote: https://lists.apache.org/thread/7g3wr39wlbs8dj08hpb87mf9z2mlqrft |
@kou Yes, exactly. 😄 |
There is race condition fundamental to the current architecture for creating and writing dictionary encodings. The relevant lock is created on a worker thread and thus there is a race to create the lock and initialize the relevant data structure. This race condition has existed for a long time and consistently occurs when testing on 1.12, but I have occasionally been able to see it occur on Julia 1.10.
Reworking this goes well beyond what I currently have time for, so I have simply disabled multithreaded writing as a stopgap. This may seem extreme, but:
Typefrom Base on Julia 1.12 #543 and addresses another source of potential correctness issues on Julia 1.12+.