A tale of inexplicable GDB racy FAILs

A few weeks ago I finally identified and addressed the origin of what was causing some inexplicable intermittent failures in a test I had added originally for all-stop-on-top-of-non-stop in the GDB buildbots for months: the testsuite sporadically kills the wrong process due to PID-reuse races — Fun! Read on for details of the investigation that led to identifying the culprit, and ultimately, the fix.

I’m proud of this fix, as the investigation was somewhat painful and spawned several months.

A while ago, I had added a new test to gdb that constantly spawns new short-lived threads, and then has GDB attach / detach to that process multiple times. That is attach-many-short-lived-threads.exp. At the time, that exposed a number of problems, both in GDB and in glibc’s libthread_db.so that triggered most frequently with all-stop-on-top-of-non-stop. Those were themselves hair-pulling painful to track, but that’s a different story.

Unfortunately, even after fixing the original problems that motivated the test, the buildbots kept showing that it still occasionally randomly failed. The failure logs usually made me suspect of a kernel / ptrace race/problem — ptrace attach would frequently fail saying the process is zombie, when it had just been started… I had tried finding the culprit in the kernel’s source before, but found nothing. Eventually, I shrugged, and for weeks (months?), I was just ignoring the problem.

That is, until I found a buildbot gdb.log showing that the test FAILed because the inferior process got a SIGTERM signal. Ah! At the same time, I was just discussing with Patrick Palka upstream making GDB be better at handling SIGTERM it itself gets, and I knew that there’s a test in the testsuite that spawns GDB and kills it with SIGTERM, in a loop, 50 times (gdb-sigterm.exp). How apropos. So I started suspecting that. Trying to catch the problem in action, I set up to run both the attach-many-short-lived-threads.exp and the gdb-sigterm.exp tests in a loop, in parallel, one in each terminal. And surprise, after a while (sometimes a long while), the attach-many-short-lived-threads.exp would indeed FAIL with a rogue SIGTERM.

But something was still not quite right… Looking at the test, I couldn’t really explain how gdb-sigterm.exp’s signals could end up in the wrong process.

But knowing that the problem was that something was killing the test process was a very good hint already.

In GDB’s testsuite, we had several tests that spawn a process in order for GDB to attach to it with the “attach” command. That process was spawned with Tcl “exec&”. After the test is done, we’d kill the process with “kill -9 $pid”. “Hmm. That’s suspicious.”, I thought. Studying Tcl’s docs and sources, I concluded that that’s a very bad pattern — with “exec&”, Tcl takes care of reaping the wait status in the background, so by the time you issue the “kill -9 $pid”, that PID might have been reused by another process already. I hacked extra logging to GDB’s testsuite, and indeed I saw that most of the time, that “kill” would fail with “error: no process”. Clearly a very dangerous thing to do! And it can well explain some of the racy crashes I was seeing.

So I fixed that by making all affected tests in the GDB testsuite use expect’s “spawn” instead of TCL’s “exec&”. With “spawn”, we control when to reap the process, so we can “kill -9” at will, because the kernel won’t reuse a PID until the exit status is reaped.

Here’s the GDB patch, if you’re curious:
[PATCH] testsuite: tcl exec& -> ‘kill -9 $pid’ is racy (attach-many-short-lived-thread.exp races and others).

Still, that didn’t explain the rogue SIGTERM… Back to staring at gdb-sigterm.exp.

I then played with SystemTap scripts that printed information of senders/receivers of SIGTERM signals, and still I couldn’t manage to map the PIDs of the signal sender to the gdb-sigterm.exp DejaGnu/runtest process tree… And then I noticed that running multiple instances of the attach-many-short-lived-threads.exp test in parallel alone would cause it to FAIL as well… SystemTap was showing that the process that was sending the signal was a “sh”, and its parent was … “init”. WTH!

So gdb-sigterm.exp was very much looking like a red herring.

Grepping around for other sources of rogue signals, I found this in DejaGnu:

    exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"

Ah-ha!

Note the “kill $pgid || kill $pid” — recall that kill with no explicit signal switch/number means “kill with SIGTERM”.

I then spent a while improving the SystemTap scripts, and eventually clearly saw that this was indeed the culprit. The reason that the parent of the “sh” I was seeing before was “init” is that that shell command runs in the background, and by the time it was running and killing the wrong process, the parent DejaGnu/runtest had already gone…

Eventually I came up with a DejaGnu fix for that: [PATCH] DejaGnu kills the wrong process due to PID-reuse races.

And with that in place, all rogue signals are gone.

Ben Elliston pushed the patch to DejaGnu master the next day. Yay!

Turns out that that patch has been fixing several different racy failures. E.g., https://sourceware.org/ml/gdb-patches/2015-08/msg00443.html.

Pedro Alves | thinko's reign