While working on the sqreen.io Python agent, I discovered a rather nasty, but fun to analyse bug that lead me deep into Python internals.
The nasty bug
I was working on a piece of code that spawns a thread, retrieves some JSON, and stores it for later use. Basically, the code looked something like this:
from threading import Thread, Event
class MyThread(Thread):
def __init__(self):
self.event = Event()
def run(self):
data = self.retrieve_json()
self.data = data
self.event.set()
print("Data retrieval done!")
while True:
pass
thread = MyThread()
thread.start()
timeout = thread.event.wait(5)
if timeout is not None:
print("JSON retrieval was too slow")
Everything seemed to work, and the error message displayed correctly when network latency was too high:
$> python2 app.py
Data retrieval done!
Next, I wanted to convert the dict keys and values from unicode to latin-1 strings:
class MyThread(Thread):
def run(self):
data = self.retrieve_json()
# Convert unicode strings into latin1 strings because of WSGI
self.data = {key.encode('latin1'): value.encode('latin1') for (key, value) in data.items()}
self.event.set()
print("Data retrieval done!")
while True:
pass
I’m pretty sure you’ve all made this type of transformation once when receiving JSON, haven’t you?
After adding this code, the error message started showing up no matter what value I gave to timeout, anywhere from 5 seconds or 1 hour. What made this even weirder was that the “Data retrieval done!” message always printed right after the error message:
$> python2 app.py
JSON retrieval was too slow
Data retrieval done!
However, when I ran this in Python 3 everything worked as expected:
$> python3 app.py
Data retrieval done!
Something very strange was going on, and I decided to investigate.
The isolation
I tried to pinpoint the problem and came up with this minimal reproducing script, let’s call it str_encode_thread.py
:
import time
import threading
def to_latin_1(event, value):
print(time.time(), "Before encode")
value.encode('latin-1')
print(time.time(), "After encode")
event.set()
print(time.time(), "Event set")
event = threading.Event()
thread = threading.Thread(target=to_latin_1, args=(event, u'42'))
thread.start()
event.wait(10)
If you launch this script with Python 3, you’ll see:
$> python3 str_encode_thread.py
1467046041.643346 Before encode
1467046041.643446 After encode
1467046041.643483 Event set
And with Python 2, you’ll see:
$> python2 str_encode_thread.py
(1467046099.495261, 'Before encode')
(1467046099.495937, 'After encode')
(1467046099.495961, 'Event set')
So far, so good. The fun comes when we try to import this script. If you import the script with Python 3, everything will still be normal:
$> time python3 -c "import str_encode_thread"
1467046239.921475 Before encode
1467046239.921578 After encode
1467046239.921602 Event set
python3 -c "import str_encode_thread" 0,05s user 0,01s system 84% cpu 0,081 total
When imported in Python 2 however our nasty friend shows up again:
$> time python2 -c "import str_encode_thread"
(1467046276.826755, 'Before encode')
(1467046286.830485, 'After encode') # <- 10 seconds (╯°□°)╯︵ ┻━┻
(1467046286.830519, 'Event set')
python2 -c "import str_encode_thread" 0,03s user 0,03s system 0% cpu 10,109 total
If you look closely at the After encode
message, you can see that it’s printed 10 seconds after the Before encode
message, indicating that the thread is somehow freezing.
Down the rabbit hole
Why would a simple value.encode('latin1')
block a thread?
Fortunately, there is a very nice package named faulthandler which is very helpful in such a situation. It can display a traceback on unix signals, user signals or python faults. It’s part of the standard library since Python 3.3, and can be installed via pip in older Python versions.
In our case, we will use dump_traceback_later
to see where the tread is blocking.
The new script with faulthandler is not that different (don’t forget to pip install faulthandler
):
import time
import threading
import faulthandler
faulthandler.dump_traceback_later(5)
def to_latin_1(event, value):
print(time.time(), "Before encode")
value.encode('latin-1')
print(time.time(), "After encode")
event.set()
print(time.time(), "Event set")
event = threading.Event()
thread = threading.Thread(target=to_latin_1, args=(event, u'42'))
thread.start()
event.wait(10)
Let’s run our new script:
$> time python2 -c "import str_encode_thread"
(1468083195.13011, 'Before encode')
Timeout (0:00:05)!
Thread 0x0000700000401000 (most recent call first):
File "/usr/lib/python2.7/encodings/__init__.py", line 100 in search_function
File "str_encode_thread.py", line 11 in to_latin_1
File "/usr/lib/python2.7/threading.py", line 754 in run
File "/usr/lib/python2.7/threading.py", line 801 in __bootstrap_inner
File "/usr/lib/python2.7/threading.py", line 774 in __bootstrap
Current thread 0x00007fff71258000 (most recent call first):
File "/usr/lib/python2.7/threading.py", line 359 in wait
File "/usr/lib/python2.7/threading.py", line 614 in wait
File "str_encode_thread.py", line 20 in <module>
File "<string>", line 1 in <module>
(1468083205.133041, 'After encode')
(1468083205.133106, 'Event set')
python2 -c "import str_encode_thread" 0,03s user 0,02s system 0% cpu 10,043 total
What can we see with the traceback?
The main thread 0x00007fff71258000
, is blocked on line 20 of our module which is event.wait(10)
, that’s perfectly normal.
The other thread 0x0000700000401000
, is running on line 11 of our module value.encode('latin-1')
, and calling the function search_function
of the builtin module encoding
which freezes our thread. What does this function do and why is it blocking?
The source code tells us that it tries to import a module which begins with encodings.
.
In [1]: import sys
In [2]: before_modules = sys.modules.keys()
In [3]: '42'.encode('latin-1')
Out[3]: '42'
In [4]: after_modules = sys.modules.keys()
In [5]: print("Imported modules", set(after_modules) - set(before_modules))
('Imported modules', set(['encodings.latin_1']))
Gotcha! str.encode(X)
tries to import the module encodings.X
. So the fix should be easy: import the module in the main thread before using the function, and it should work, right?
import time
import threading
import encodings.latin_1
def to_latin_1(event, value):
print(time.time(), "Before encode")
value.encode('latin-1')
print(time.time(), "After encode")
event.set()
print(time.time(), "Event set")
event = threading.Event()
thread = threading.Thread(target=to_latin_1, args=(event, u'42'))
thread.start()
event.wait(10)
Let’s try this again with python2:
$> time python2 -c "import str_encode_thread"
(1467049098.995541, 'Before encode')
(1467049109.000293, 'After encode')
(1467049109.000324, 'Event set')
python2 -c "import str_encode_thread" 0,03s user 0,02s system 0% cpu 10,054 total
We still have the same problem! What’s wrong?
We have to go deeper
We know that str.encode
tries to import a module which is likely freezing our thread.
At this point, I went to my favorite IRC channel (#python-fr) and asked for help, and we finally found out!
What does import module
do? If we trust the Python documentation, it basically does this:
import imp
import sys
def __import__(name, globals=None, locals=None, fromlist=None):
# Fast path: see if the module has already been imported.
try:
return sys.modules[name]
except KeyError:
pass
# If any of the following calls raises an exception,
# there's a problem we can't handle -- let the caller handle it.
fp, pathname, description = imp.find_module(name)
try:
return imp.load_module(name, fp, pathname, description)
finally:
# Since we may exit via an exception, close fp explicitly.
if fp:
fp.close()
Nothing very fancy there: return the module from sys.modules
if it has already been imported or try to locate the source file for the module, load it and return it.
Nothing should block here either, but if we continue to read the imp
module documentation, we can find a very interesting function named lock_held
:
On multithreaded platforms, a thread executing an import holds an internal lock until the import is complete. This lock blocks other threads from doing an import until the original import completes, which in turn prevents other threads from seeing incomplete module objects constructed by the original thread while in the process of completing its import (and the imports, if any, triggered by that).
We are getting closer…
Locks and threads, a love-hate relationship
When googling python import lock
, we can quickly find this section about Importing in threaded code which simply tells you that it’s a terrible idea, and is likely to create deadlocks.
Let’s sum up what we’ve found:
import imp
import time
import threading
def to_latin_1(event, value):
print(time.time(), "Before encode", threading.current_thread().name, imp.lock_held())
value.encode('latin-1')
print(time.time(), "After encode")
event.set()
print(time.time(), "Event set")
event = threading.Event()
thread = threading.Thread(target=to_latin_1, args=(event, u'42'))
thread.start()
event.wait(10)
$> python2 -c "import str_encode_thread"
(1467050582.565935, 'Before encode', 'Thread-1', True)
(1467050592.567373, 'After encode')
(1467050592.567418, 'Event set')
When we imported our module str_encode_thread
, the import lock was acquired by the MainThread, so when our custom thread tried to execute the value.encode('latin-1')
the stdlib tried to import the encoding.latin_1
module which is blocked since the import lock is already held by the MainThread. We then block the MainThread by doing event.wait(10)
and then both threads are blocked for 10 seconds, too bad… While the wait times out, the import str_encode_thread
ends, releasing the import lock which wakes up our custom thread.
If we hadn’t set a maximum timeout for the event, we could have deadlocked our Python interpreter forever!
When dealing with threads, there is no magic solution. You might wonder what would have happened if you manually imported a module without using a lock, but it would only “solve” the deadlock by raising additional fun bugs.
For example, when you call time.strptime
, it tries to import the _strptime
module in a non-safe manner, which may trigger some hard-to-debug issues, like this boto issue and the corresponding Python bug page. This problem can be solved by doing import _strptime
earlier in the main thread so the thread won’t block due to the lock or see an incomplete module.
Python 3 is the future
But why did it work in Python 3? In addition to breaking all your import names and forcing you to put .encode
and .decode
everywhere in your code (which is a good thing!), the import lock was refactored in Python 3, and global import locking is now handled per-module. You’re now less likely to trigger the problem.
We should all use Python 3
It may have been a very nasty bug, but it was pretty fun to dig into one of the lesser known mechanisms of Python.
TL;DR: In Python 2, there is a single reentrant lock that will block your threads if they try to import something while the lock is held by another thread. In Python 3, the import lock is now per module.