Skip to content

List active signal handlers with GDB

Listing active signal handlers (or dispositions) with GDB

A colleague at Red Hat asked whether it is possible to get GDB to list the currently active signal handlers, aka the signal dispositions.  I.e., the actions taken by the inferior process on receipt of each specific signal.

You could do it manually, like this, once for each signal.  Here done for SIGINT (which is 2):


(gdb) set $p = (struct sigaction *) malloc(sizeof (struct sigaction))
(gdb) print sigaction(2, 0, $p)
$1 = 0
(gdb) print $p->__sigaction_handler.sa_handler
$2 = (__sighandler_t) 0x3ff797bb200 <g_unix_signal_handler>

but that’s… rather cumbersome.  Read on for something nicer.

Knowing what action a signal will take when it is received is important information. For example, it’s a common problem to have a library that “steals” a signal from the main program, or have libraries with conflicting signal handlers. E.g., toolkits such as Qt and language interpreter libraries such as Python often want to install a SIGCHLD handler to track the lifetime of processes they spawn. Or SIGINT handlers to handle Ctrl-C gracefully. Etc. This is a problem that GDB itself has tripped on internally, even. See Bug 14382 – gdb hangs after plotting with matplotlib, for example.

Answer the question already!

 

Back to the original question, the short answer is “no”: GDB currently does not have a built-in command to list the currently registered signal handlers. I’m not aware of any way for a debugger to get this information out of the Linux kernel directly, actually.

However, we can still script the sigaction calls above and wrap it all in a nice user-friendly command. I whipped up something quickly for my colleague, using GDB/CLI scripting. You can find the script here:

https://github.com/palves/misc/blob/master/gdb/signals.gdb

This adds a new info signal-dispositions command to GDB.  Download it somewhere and source it from your ~/.gdbinit to make it always handy and available.

Example output (of gdb debugging itself, on x86-64 Fedora):

(gdb) info signal-dispositions
Number  Name       Description               Disposition
1       SIGHUP     Hangup                    handle_sighup(int) in section .text of build/gdb/gdb
2       SIGINT     Interrupt                 rl_signal_handler in section .text of build/gdb/gdb
3       SIGQUIT    Quit                      rl_signal_handler in section .text of build/gdb/gdb
4       SIGILL     Illegal instruction       SIG_DFL
5       SIGTRAP    Trace/breakpoint trap     SIG_DFL
6       SIGABRT    Aborted                   SIG_DFL
7       SIGBUS     Bus error                 SIG_DFL
8       SIGFPE     Floating point exception  handle_sigfpe(int) in section .text of build/gdb/gdb
9       SIGKILL    Killed                    SIG_DFL
10      SIGUSR1    User defined signal 1     SIG_DFL
11      SIGSEGV    Segmentation fault        SIG_DFL
12      SIGUSR2    User defined signal 2     SIG_DFL
13      SIGPIPE    Broken pipe               SIG_IGN
14      SIGALRM    Alarm clock               rl_signal_handler in section .text of build/gdb/gdb
15      SIGTERM    Terminated                rl_signal_handler in section .text of build/gdb/gdb
16      SIGSTKFLT  Stack fault               SIG_DFL
17      SIGCHLD    Child exited              sigchld_handler(int) in section .text of build/gdb/gdb
18      SIGCONT    Continued                 tui_cont_sig(int) in section .text of build/gdb/gdb
19      SIGSTOP    Stopped (signal)          SIG_DFL
20      SIGTSTP    Stopped                   rl_signal_handler in section .text of build/gdb/gdb
21      SIGTTIN    Stopped (tty input)       rl_signal_handler in section .text of build/gdb/gdb
22      SIGTTOU    Stopped (tty output)      rl_signal_handler in section .text of build/gdb/gdb
23      SIGURG     Urgent I/O condition      SIG_DFL
24      SIGXCPU    CPU time limit exceeded   GC_restart_handler in section .text of /lib64/libgc.so.1
25      SIGXFSZ    File size limit exceeded  SIG_IGN
26      SIGVTALRM  Virtual timer expired     SIG_DFL
27      SIGPROF    Profiling timer expired   SIG_DFL
28      SIGWINCH   Window changed            tui_sigwinch_handler(int) in section .text of build/gdb/gdb
29      SIGIO      I/O possible              SIG_DFL
30      SIGPWR     Power failure             GC_suspend_handler in section .text of /lib64/libgc.so.1
31      SIGSYS     Bad system call           SIG_DFL
34      SIG34      Real-time signal 0        SIG_DFL
35      SIG35      Real-time signal 1        SIG_DFL
[...]

(gdb) info signal-dispositions 2 5
Number  Name       Description               Disposition
2       SIGINT     Interrupt                 rl_signal_handler in section .text of build/gdb/gdb
5       SIGTRAP    Trace/breakpoint trap     SIG_DFL

I wrote it as GDB CLI script, just because that was quicker to prototype. Using GDB Python or Guile scripting would allow for error handling, nicer formatting control and better argument handling. I’m too lazy^Wbusy at the moment to rewrite it though.

Could we do better?

 

I think we could.  The ideal solution would let the debugger retrieve the information without running code in the inferior address space, which is always risky — the inferior might be messed up already, it’s not desirable for seemingly innocent commands to potentially mess it up further.

For example, we could have the kernel expose the set of signal actions in /proc/PID/status or some new /proc file or /proc directory — e.g., /proc/pid/sigaction/$signo, with one entry per signal.

And then for core debugging, the kernel could dump the same info in an ELF core note, similarly to how mapped files end up in the NT_FILE note.

A tale of inexplicable GDB racy FAILs

A few weeks ago I finally identified and addressed the origin of what was causing some inexplicable intermittent failures in a test I had added originally for all-stop-on-top-of-non-stop in the GDB buildbots for months: the testsuite sporadically kills the wrong process due to PID-reuse races — Fun! Read on for details of the investigation that led to identifying the culprit, and ultimately, the fix.

I’m proud of this fix, as the investigation was somewhat painful and spawned several months.

A while ago, I had added a new test to gdb that constantly spawns new short-lived threads, and then has GDB attach / detach to that process multiple times. That is attach-many-short-lived-threads.exp. At the time, that exposed a number of problems, both in GDB and in glibc’s libthread_db.so that triggered most frequently with all-stop-on-top-of-non-stop. Those were themselves hair-pulling painful to track, but that’s a different story.

Unfortunately, even after fixing the original problems that motivated the test, the buildbots kept showing that it still occasionally randomly failed. The failure logs usually made me suspect of a kernel / ptrace race/problem — ptrace attach would frequently fail saying the process is zombie, when it had just been started… I had tried finding the culprit in the kernel’s source before, but found nothing. Eventually, I shrugged, and for weeks (months?), I was just ignoring the problem.

That is, until I found a buildbot gdb.log showing that the test FAILed because the inferior process got a SIGTERM signal. Ah! At the same time, I was just discussing with Patrick Palka upstream making GDB be better at handling SIGTERM it itself gets, and I knew that there’s a test in the testsuite that spawns GDB and kills it with SIGTERM, in a loop, 50 times (gdb-sigterm.exp). How apropos. So I started suspecting that. Trying to catch the problem in action, I set up to run both the attach-many-short-lived-threads.exp and the gdb-sigterm.exp tests in a loop, in parallel, one in each terminal. And surprise, after a while (sometimes a long while), the attach-many-short-lived-threads.exp would indeed FAIL with a rogue SIGTERM.

But something was still not quite right… Looking at the test, I couldn’t really explain how gdb-sigterm.exp’s signals could end up in the wrong process.

But knowing that the problem was that something was killing the test process was a very good hint already.

In GDB’s testsuite, we had several tests that spawn a process in order for GDB to attach to it with the “attach” command. That process was spawned with Tcl “exec&”. After the test is done, we’d kill the process with “kill -9 $pid”. “Hmm. That’s suspicious.”, I thought. Studying Tcl’s docs and sources, I concluded that that’s a very bad pattern — with “exec&”, Tcl takes care of reaping the wait status in the background, so by the time you issue the “kill -9 $pid”, that PID might have been reused by another process already. I hacked extra logging to GDB’s testsuite, and indeed I saw that most of the time, that “kill” would fail with “error: no process”. Clearly a very dangerous thing to do! And it can well explain some of the racy crashes I was seeing.

So I fixed that by making all affected tests in the GDB testsuite use expect’s “spawn” instead of TCL’s “exec&”. With “spawn”, we control when to reap the process, so we can “kill -9” at will, because the kernel won’t reuse a PID until the exit status is reaped.

Here’s the GDB patch, if you’re curious:
[PATCH] testsuite: tcl exec& -> ‘kill -9 $pid’ is racy (attach-many-short-lived-thread.exp races and others).

Still, that didn’t explain the rogue SIGTERM… Back to staring at gdb-sigterm.exp.

Head Scratching Gorilla

I then played with SystemTap scripts that printed information of senders/receivers of SIGTERM signals, and still I couldn’t manage to map the PIDs of the signal sender to the gdb-sigterm.exp DejaGnu/runtest process tree… And then I noticed that running multiple instances of the attach-many-short-lived-threads.exp test in parallel alone would cause it to FAIL as well… SystemTap was showing that the process that was sending the signal was a “sh”, and its parent was … “init”. WTH!

So gdb-sigterm.exp was very much looking like a red herring.

Grepping around for other sources of rogue signals, I found this in DejaGnu:

    exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"

Ah-ha!

Note the “kill $pgid || kill $pid” — recall that kill with no explicit signal switch/number means “kill with SIGTERM”.

I then spent a while improving the SystemTap scripts, and eventually clearly saw that this was indeed the culprit. The reason that the parent of the “sh” I was seeing before was “init” is that that shell command runs in the background, and by the time it was running and killing the wrong process, the parent DejaGnu/runtest had already gone…

Eventually I came up with a DejaGnu fix for that: [PATCH] DejaGnu kills the wrong process due to PID-reuse races.

And with that in place, all rogue signals are gone.

Ben Elliston pushed the patch to DejaGnu master the next day. Yay!

Turns out that that patch has been fixing several different racy failures. E.g., https://sourceware.org/ml/gdb-patches/2015-08/msg00443.html.

Tagged , , , , ,