Skip to content

✨ feat(soft): detect and break stale locks#476

Merged
gaborbernat merged 5 commits intotox-dev:mainfrom
gaborbernat:56
Feb 14, 2026
Merged

✨ feat(soft): detect and break stale locks#476
gaborbernat merged 5 commits intotox-dev:mainfrom
gaborbernat:56

Conversation

@gaborbernat
Copy link
Member

@gaborbernat gaborbernat commented Feb 13, 2026

When a process holding a SoftFileLock crashes, the lock file is left behind and blocks every other process forever.
This is especially painful on CI and in long-running daemons where manual cleanup is impractical.

The lock file now stores {pid}\n{hostname}\n on acquire. On contention (EEXIST), the competing process reads this
metadata, verifies the hostname matches, and probes whether the holding PID is still alive (os.kill(pid, 0) on Unix,
OpenProcess(SYNCHRONIZE) on Windows). If the holder is confirmed dead, the stale lock is broken via an atomic
rename + unlink sequence that avoids races between concurrent breakers.

Stale detection is Unix/macOS only. On Windows, Python's C runtime (_wopen) cannot set FILE_SHARE_DELETE, so
any read handle on the lock file blocks DeleteFileW during release -- causing a livelock under threaded contention.
Windows already distinguishes EACCES (holder alive, fd open) from EEXIST (file exists, no active holder), and in
practice EEXIST resolves quickly as the releasing thread deletes the file. Cross-host stale locks are also left
untouched since PID liveness cannot be verified remotely.

SoftFileLock leaves orphaned lock files when the holding process
crashes. Other processes then block forever waiting for a lock that
will never be released.

Now the lock file stores the holder's PID and hostname. On contention,
if the holder is on the same host and its PID no longer exists, the
stale lock is atomically renamed away and removed, allowing acquisition
to proceed on the next retry. All detection errors are suppressed to
preserve backward compatibility with empty or foreign lock files.
@gaborbernat gaborbernat enabled auto-merge (squash) February 13, 2026 23:31
@gaborbernat gaborbernat force-pushed the 56 branch 7 times, most recently from 4a3c34f to 1d21380 Compare February 14, 2026 00:30
On Windows, opening the lock file for reading during stale detection
blocked concurrent file deletion in _release, causing a livelock under
heavy threaded contention (100 threads × 100 iterations).

The fix uses CreateFileW with FILE_SHARE_DELETE when reading lock info
on Windows, allowing _release's unlink to succeed even while another
thread reads the file. On Unix, os.open/os.read/os.close is used
instead of Path.read_text for consistent low-level fd handling.
The ty type checker does not narrow sys.platform across method
boundaries, so ctypes.windll in a separate _read_lock_info_win method
was flagged as unresolved on Linux. Inlined the Windows branch into
_read_lock_info under the sys.platform guard.
On Windows, even with FILE_SHARE_DELETE, a file marked for deletion
keeps its name visible until the last handle closes. With 100 threads
competing, there's always a reader holding a handle, preventing the
file from fully disappearing — causing a livelock.

Skip stale detection entirely for lock files younger than 2 seconds.
During normal threaded contention files are sub-second old, while
genuinely stale locks persist much longer. This eliminates the read
handle contention without sacrificing stale lock recovery.
@gaborbernat gaborbernat merged commit e35c3af into tox-dev:main Feb 14, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant