Bug Solving 2: uWSGI boogaloo
May 5th, 2023
1252 words (~6 minutes)
Just another fun bug romp. I don’t have any overarching morals this time (unlike my last bug solving session), just straight describing the bug process. This one jumps around a lot, so I won’t quite explain everything here, though I hope to link to enough outside sources that it’s still helpful.
ALWeaver performance issues
At the beginning of April, we hosted LIT Con, the Suffolk LIT Lab legal tech conference. One of the talks we had showed off the Weaver, a tool that helps make prototype docassemble interviews quickly.
However, earlier in March, my colleague had mentioned that the Weaver had crashed a few times when he tried to have the students in his class access it all at once. If it happened then, it’s likely to happen when we do that with ~75 people. We had a few ideas about what was causing the issue:
- we had recently added
spacy
to one of the libraries we were using, which loaded a huge (~1GB) NLP model into RAM. Maybe the RAM usage was skyrocketing? - maybe the server just couldn’t handle that many people, and the CPU would get locked up?
- something completely different?
The first step was to reproduce the crash; by running Weaver with a lot of simultaneous users.
To start, I did the tests manually:
- I opened 4 browser windows to the locally running copy of the Weaver and tried to run everything.
- Went through once while watching cpu usage with
top
.pdftoppm
does seem to take up some cpu time, but not a ton. - Went through again 4 times, and 3 of them got a 502 error from nginx immediately. There’s this in the logs:
terminate called after throwing an instance of 'std::runtime_error'
what(): pybind11_object_dealloc(): Tried to deallocate unregistered instance!
DAMN ! worker 1 (pid: 570254) died, killed by signal 6 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 573197)
That doesn’t seem like I’d be able to find and fix it. Hopefully it’s a one-off thing (spoiler alert: it was not).
top
now shows this at rest:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
570184 www-data 20 0 1382508 722812 67940 S 0.0 2.9 0:06.79 celery
570185 www-data 20 0 1382512 722736 67924 S 0.0 2.9 0:06.91 celery
570183 www-data 20 0 1382508 722628 67936 S 0.0 2.9 0:06.90 celery
127112 www-data 20 0 1481444 641084 83916 S 0.0 2.6 6:23.12 uwsgi
573197 www-data 20 0 1748708 585700 14924 S 0.3 2.4 0:02.07 uwsgi
570151 www-data 20 0 711112 262160 44092 S 0.0 1.1 0:06.08 celery
570238 www-data 20 0 799316 149492 49956 S 0.0 0.6 0:05.17 python
570188 www-data 20 0 619076 128828 32628 S 0.3 0.5 0:05.00 celery
570075 rabbitmq 20 0 2652672 120676 60900 S 0.0 0.5 0:06.93 beam.smp
Meaning there are now 5 processes with > 500MB in RAM, most of them celery, the background processing task that docassemble uses. I assumed this was because
of the spacy
model being loaded. That’s a descent amount of memory pressure on a server with only 4GB of memory. So, I did a bit of refactoring to
remove that dependency, and move it to a different server, where it would only be loaded once. After all the refactoring, the celery workers
all use 220MB RAM each; this seems to be normal, as importing other docassemble python modules (docassemble_pattern.en
) can take up 190MB.
The main downside; it still crashes at the same place! I ran 4 windows through once, then started over, then once more, and all 4 windows got an immediate 502, same as before.
At this point, I decided to make a quick script that would use our integration tester, ALKiln to run puppeteer (really just a Chrome window) and run the Weaver. That script was pretty simple, and made it much simpler to start 6 of those, and wait until everything crashed:
#!/usr/bin/bash
mydir=$RANDOM;
echo $mydir
mkdir $mydir;
cd $mydir;
cp /home/brycew/Developer/LITLab/code/runtime_config.json .
alkiln-run --sources /home/brycew/Developer/LITLab/code/docassemble-ALWeaver/docassemble/ALWeaver/data/sources/ > bash_console.txt &
cd ..
At this point I saw something in the logs that I thought was useful:
Traceback (most recent call last):
...
File "/usr/share/docassemble/local3.10/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 747, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL error: decryption failed or bad record mac
[SQL: SELECT "user".id AS user_id, ... FROM "user" LEFT OUTER JOIN ... WHERE "user".id = %(pk_1)s]
[parameters: {'pk_1': 1}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
...
File "/usr/share/docassemble/local3.10/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 666, in _revalidate_connection
self._invalid_transaction()
File "/usr/share/docassemble/local3.10/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 656, in _invalid_transaction
raise exc.PendingRollbackError(
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. Please rollback() fully before proceeding (Background on this error at: https://sqlalche.me/e/20/8s2b)
I’d seen an error like this before, and Googling led to a blogs post that matched our problem, but not exactly. The post describes UWSGI with more than one process; there’s only one uwsgi process in docassemble. So how is there a second process getting the connection?
Looking around a bit more, this line appears right before the above stack trace:
terminate called after throwing an instance of 'std::runtime_error'
what(): pybind11_object_dealloc(): Tried to deallocate unregistered instance!
DAMN ! worker 1 (pid: 666329) died, killed by signal 6 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 683455)
So the new theory of what’s happening is that:
- docassemble creates SQL alchemy connections in the main app
- something is throwing a
std:runtime_error
. It has to be from C++, the only language with that syntax and runtime error. - after respawning, the new uWSGI worker is reusing the same SQL alchemy connection, which causes the above error, and crashes until docassemble can recover
I tried with no avail to try to track down where the C++ error could be coming from; I couldn’t find where pybind was being used in uWSGI at all.
After sleeping on it, I realized that it might not be in uWSGI! I looked through our interview code, seeing what docassemble operations are being performed and what external libraries, if any, that we or docassemble were calling that would use some C++ code that used pybind
. The only external library that we were using was pikepdf
, in pdf_concatenate
.
I got stuck for a bit, looking around in the pybind11 source code and the pikepdf
source, until I realized the other obvious part; pybind11_object_dealloc
isn’t being called manually! It’s happening automatically when the object’s deconstructor is called!
I happened to look through the pikepdf
documentation, and it suggests to close the PDF objects. This should have been obvious, but it wasn’t; our code and quite a bit of code in docassemble never closed the pikepdf
objects. Even though the documentation didn’t mention anything other than issues with references and copying pages around, I made a patch to close all of the PDF files after use. With that patch, it actually works much better now! Pre-patch with 6 browser windows I could consistently cause the error, but after the patch I can run up to 10 browsers, and they all complete safely (though a bit slowly).