-
Notifications
You must be signed in to change notification settings - Fork 799
Description
Describe your environment
OS: (e.g, Ubuntu) RHEL8
Python version: (e.g., Python 3.9.10) python 3.11
SDK version: (e.g., 1.25.0) - 1.25.0
API version: (e.g., 1.25.0) - 1.25.0
What happened?
While using multiple threads with each thread trying to create a metric, I run into the race condition raising the KeyError
Why it manifests as KeyError: one thread marks an instrument ID as “already created” before the SDK has finished constructing and storing the actual instrument object. A second thread then tries to look up the instrument by ID and hits a missing dict entry.
This is a classic TOCTOU race (“time-of-check to time-of-use”) caused by split-brain state between two separate registries:
Registry A (API layer): stores a set of instrument IDs (“have we created this before?”)
Registry B (SDK layer): a dict mapping instrument ID → instrument instance (“here is the created object”)
Because these are two different objects under different lock context, one thread can “publish” the fact that an instrument exists (by inserting into the set) before it “publishes” the actual instrument instance (by inserting into the dict). Another thread then sees “already registered” and immediately does a dict lookup that can legitimately be missing at that moment → KeyError.
What should have been done is to maintain a single source of truth (one registry) and do a lock-protected get-or-create on that one structure, so “existence” and “retrievability” are the same invariant.
Steps to Reproduce
Here is a sample snippet: But unfortunately, this happens only in highly loaded machine
diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py
index 6d8d5dc..693ee3b 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py
@@ -226,7 +226,8 @@ class Meter(APIMeter):
)
with self._instrument_id_instrument_lock:
return self._instrument_id_instrument[instrument_id]
-
+ import time
+ time.sleep(1)
instrument = _Histogram(
name,
self._instrumentation_scope,
In [1]: import threading
...: from opentelemetry.sdk.metrics import MeterProvider
...:
...: def main():
...: meter = MeterProvider().get_meter("race-demo")
...:
...: N = 100
...: label = "py.test.foo.shiva.logical_equal_duration"
...:
...: barrier = threading.Barrier(N)
...: results = [None] * N
...: errors = [None] * N
...:
...: def worker(i: int):
...: try:
...: barrier.wait()
...: results[i] = meter.create_histogram(label)
...: except Exception as e:
...: errors[i] = e
...:
...: threads = [threading.Thread(target=worker, args=(i,)) for i in range(N)]
...: for t in threads:
...: t.start()
...: for t in threads:
...: t.join()
...:
...: errs = [e for e in errors if e is not None]
...: if errs:
...: print("Errors:", {type(e).__name__ for e in errs})
...: raise errs[0]
...:
...: print("unique instrument objects:", len({id(x) for x in results}))
...: print("OK")
...:
...: if __name__ == "__main__":
...: main()
...:
An instrument with name py.test.foo.shiva.logical_equal_duration, type Histogram, unit and description has been created already.
An instrument with name py.test.foo.shiva.logical_equal_duration, type Histogram, unit and description has been created already.
An instrument with name py.test.foo.shiva.logical_equal_duration, type Histogram, unit and description has been created already.
...
...
Errors: {'KeyError'}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[1], line 36
33 print("OK")
35 if __name__ == "__main__":
---> 36 main()
Cell In[1], line 30, in main()
28 if errs:
29 print("Errors:", {type(e).__name__ for e in errs})
---> 30 raise errs[0]
32 print("unique instrument objects:", len({id(x) for x in results}))
33 print("OK")
Cell In[1], line 17, in main.<locals>.worker(i)
15 try:
16 barrier.wait()
---> 17 results[i] = meter.create_histogram(label)
18 except Exception as e:
19 errors[i] = e
File /codemill/dhamodha/projects/open/opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py:228, in Meter.create_histogram(self, name, unit, description)
219 _logger.warning(
220 "An instrument with name %s, type %s, unit %s and "
221 "description %s has been created already.",
(...) 225 description,
226 )
227 with self._instrument_id_instrument_lock:
--> 228 return self._instrument_id_instrument[instrument_id]
229 import time
230 time.sleep(1)
KeyError: 'py.test.foo.shiva.logical_equal_duration,_Histogram,,'
Expected Result
I'd expect only warning saying that the instrument is already registered instead of KeyError
Actual Result
Errors: {'KeyError'}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[1], line 36
33 print("OK")
35 if __name__ == "__main__":
---> 36 main()
Cell In[1], line 30, in main()
28 if errs:
29 print("Errors:", {type(e).__name__ for e in errs})
---> 30 raise errs[0]
32 print("unique instrument objects:", len({id(x) for x in results}))
33 print("OK")
Cell In[1], line 17, in main.<locals>.worker(i)
15 try:
16 barrier.wait()
---> 17 results[i] = meter.create_histogram(label)
18 except Exception as e:
19 errors[i] = e
File /codemill/dhamodha/projects/open/opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py:228, in Meter.create_histogram(self, name, unit, description)
219 _logger.warning(
220 "An instrument with name %s, type %s, unit %s and "
221 "description %s has been created already.",
(...) 225 description,
226 )
227 with self._instrument_id_instrument_lock:
--> 228 return self._instrument_id_instrument[instrument_id]
229 import time
230 time.sleep(1)
KeyError: 'py.test.foo.shiva.logical_equal_duration,_Histogram,,'
Additional context
No response
Would you like to implement a fix?
None
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.