List active signal handlers with GDB

Listing active signal handlers (or dispositions) with GDB

A colleague at Red Hat asked whether it is possible to get GDB to list the currently active signal handlers, aka the signal dispositions. I.e., the actions taken by the inferior process on receipt of each specific signal.

You could do it manually, like this, once for each signal. Here done for SIGINT (which is 2):

(gdb) set $p = (struct sigaction *) malloc(sizeof (struct sigaction)) (gdb) print sigaction(2, 0, $p) $1 = 0 (gdb) print $p->__sigaction_handler.sa_handler $2 = (__sighandler_t) 0x3ff797bb200 <g_unix_signal_handler>

but that’s… rather cumbersome. Read on for something nicer.

Knowing what action a signal will take when it is received is important information. For example, it’s a common problem to have a library that “steals” a signal from the main program, or have libraries with conflicting signal handlers. E.g., toolkits such as Qt and language interpreter libraries such as Python often want to install a SIGCHLD handler to track the lifetime of processes they spawn. Or SIGINT handlers to handle Ctrl-C gracefully. Etc. This is a problem that GDB itself has tripped on internally, even. See Bug 14382 – gdb hangs after plotting with matplotlib, for example.

Answer the question already!

Back to the original question, the short answer is “no”: GDB currently does not have a built-in command to list the currently registered signal handlers. I’m not aware of any way for a debugger to get this information out of the Linux kernel directly, actually.

However, we can still script the sigaction calls above and wrap it all in a nice user-friendly command. I whipped up something quickly for my colleague, using GDB/CLI scripting. You can find the script here:

https://github.com/palves/misc/blob/master/gdb/signals.gdb

This adds a new info signal-dispositions command to GDB. Download it somewhere and source it from your ~/.gdbinit to make it always handy and available.

Example output (of gdb debugging itself, on x86-64 Fedora):

(gdb) info signal-dispositions Number Name Description Disposition 1 SIGHUP Hangup handle_sighup(int) in section .text of build/gdb/gdb 2 SIGINT Interrupt rl_signal_handler in section .text of build/gdb/gdb 3 SIGQUIT Quit rl_signal_handler in section .text of build/gdb/gdb 4 SIGILL Illegal instruction SIG_DFL 5 SIGTRAP Trace/breakpoint trap SIG_DFL 6 SIGABRT Aborted SIG_DFL 7 SIGBUS Bus error SIG_DFL 8 SIGFPE Floating point exception handle_sigfpe(int) in section .text of build/gdb/gdb 9 SIGKILL Killed SIG_DFL 10 SIGUSR1 User defined signal 1 SIG_DFL 11 SIGSEGV Segmentation fault SIG_DFL 12 SIGUSR2 User defined signal 2 SIG_DFL 13 SIGPIPE Broken pipe SIG_IGN 14 SIGALRM Alarm clock rl_signal_handler in section .text of build/gdb/gdb 15 SIGTERM Terminated rl_signal_handler in section .text of build/gdb/gdb 16 SIGSTKFLT Stack fault SIG_DFL 17 SIGCHLD Child exited sigchld_handler(int) in section .text of build/gdb/gdb 18 SIGCONT Continued tui_cont_sig(int) in section .text of build/gdb/gdb 19 SIGSTOP Stopped (signal) SIG_DFL 20 SIGTSTP Stopped rl_signal_handler in section .text of build/gdb/gdb 21 SIGTTIN Stopped (tty input) rl_signal_handler in section .text of build/gdb/gdb 22 SIGTTOU Stopped (tty output) rl_signal_handler in section .text of build/gdb/gdb 23 SIGURG Urgent I/O condition SIG_DFL 24 SIGXCPU CPU time limit exceeded GC_restart_handler in section .text of /lib64/libgc.so.1 25 SIGXFSZ File size limit exceeded SIG_IGN 26 SIGVTALRM Virtual timer expired SIG_DFL 27 SIGPROF Profiling timer expired SIG_DFL 28 SIGWINCH Window changed tui_sigwinch_handler(int) in section .text of build/gdb/gdb 29 SIGIO I/O possible SIG_DFL 30 SIGPWR Power failure GC_suspend_handler in section .text of /lib64/libgc.so.1 31 SIGSYS Bad system call SIG_DFL 34 SIG34 Real-time signal 0 SIG_DFL 35 SIG35 Real-time signal 1 SIG_DFL [...]

(gdb) info signal-dispositions 2 5 Number Name Description Disposition 2 SIGINT Interrupt rl_signal_handler in section .text of build/gdb/gdb 5 SIGTRAP Trace/breakpoint trap SIG_DFL

I wrote it as GDB CLI script, just because that was quicker to prototype. Using GDB Python or Guile scripting would allow for error handling, nicer formatting control and better argument handling. I’m too lazy^Wbusy at the moment to rewrite it though.

Could we do better?

I think we could. The ideal solution would let the debugger retrieve the information without running code in the inferior address space, which is always risky — the inferior might be messed up already, it’s not desirable for seemingly innocent commands to potentially mess it up further.

For example, we could have the kernel expose the set of signal actions in /proc/PID/status or some new /proc file or /proc directory — e.g., /proc/pid/sigaction/$signo, with one entry per signal.

And then for core debugging, the kernel could dump the same info in an ELF core note, similarly to how mapped files end up in the NT_FILE note.

A tale of inexplicable GDB racy FAILs

A few weeks ago I finally identified and addressed the origin of what was causing some inexplicable intermittent failures in a test I had added originally for all-stop-on-top-of-non-stop in the GDB buildbots for months: the testsuite sporadically kills the wrong process due to PID-reuse races — Fun! Read on for details of the investigation that led to identifying the culprit, and ultimately, the fix.

I’m proud of this fix, as the investigation was somewhat painful and spawned several months.

A while ago, I had added a new test to gdb that constantly spawns new short-lived threads, and then has GDB attach / detach to that process multiple times. That is attach-many-short-lived-threads.exp. At the time, that exposed a number of problems, both in GDB and in glibc’s libthread_db.so that triggered most frequently with all-stop-on-top-of-non-stop. Those were themselves hair-pulling painful to track, but that’s a different story.

Unfortunately, even after fixing the original problems that motivated the test, the buildbots kept showing that it still occasionally randomly failed. The failure logs usually made me suspect of a kernel / ptrace race/problem — ptrace attach would frequently fail saying the process is zombie, when it had just been started… I had tried finding the culprit in the kernel’s source before, but found nothing. Eventually, I shrugged, and for weeks (months?), I was just ignoring the problem.

That is, until I found a buildbot gdb.log showing that the test FAILed because the inferior process got a SIGTERM signal. Ah! At the same time, I was just discussing with Patrick Palka upstream making GDB be better at handling SIGTERM it itself gets, and I knew that there’s a test in the testsuite that spawns GDB and kills it with SIGTERM, in a loop, 50 times (gdb-sigterm.exp). How apropos. So I started suspecting that. Trying to catch the problem in action, I set up to run both the attach-many-short-lived-threads.exp and the gdb-sigterm.exp tests in a loop, in parallel, one in each terminal. And surprise, after a while (sometimes a long while), the attach-many-short-lived-threads.exp would indeed FAIL with a rogue SIGTERM.

But something was still not quite right… Looking at the test, I couldn’t really explain how gdb-sigterm.exp’s signals could end up in the wrong process.

But knowing that the problem was that something was killing the test process was a very good hint already.

In GDB’s testsuite, we had several tests that spawn a process in order for GDB to attach to it with the “attach” command. That process was spawned with Tcl “exec&”. After the test is done, we’d kill the process with “kill -9 $pid”. “Hmm. That’s suspicious.”, I thought. Studying Tcl’s docs and sources, I concluded that that’s a very bad pattern — with “exec&”, Tcl takes care of reaping the wait status in the background, so by the time you issue the “kill -9 $pid”, that PID might have been reused by another process already. I hacked extra logging to GDB’s testsuite, and indeed I saw that most of the time, that “kill” would fail with “error: no process”. Clearly a very dangerous thing to do! And it can well explain some of the racy crashes I was seeing.

So I fixed that by making all affected tests in the GDB testsuite use expect’s “spawn” instead of TCL’s “exec&”. With “spawn”, we control when to reap the process, so we can “kill -9” at will, because the kernel won’t reuse a PID until the exit status is reaped.

Here’s the GDB patch, if you’re curious:
[PATCH] testsuite: tcl exec& -> ‘kill -9 $pid’ is racy (attach-many-short-lived-thread.exp races and others).

Still, that didn’t explain the rogue SIGTERM… Back to staring at gdb-sigterm.exp.

I then played with SystemTap scripts that printed information of senders/receivers of SIGTERM signals, and still I couldn’t manage to map the PIDs of the signal sender to the gdb-sigterm.exp DejaGnu/runtest process tree… And then I noticed that running multiple instances of the attach-many-short-lived-threads.exp test in parallel alone would cause it to FAIL as well… SystemTap was showing that the process that was sending the signal was a “sh”, and its parent was … “init”. WTH!

So gdb-sigterm.exp was very much looking like a red herring.

Grepping around for other sources of rogue signals, I found this in DejaGnu:

    exec sh -c "exec > /dev/null 2>&1 && (kill -2 $pgid || kill -2 $pid) && sleep 5 && (kill $pgid || kill $pid) && sleep 5 && (kill -9 $pgid || kill -9 $pid) &"

Ah-ha!

Note the “kill $pgid || kill $pid” — recall that kill with no explicit signal switch/number means “kill with SIGTERM”.

I then spent a while improving the SystemTap scripts, and eventually clearly saw that this was indeed the culprit. The reason that the parent of the “sh” I was seeing before was “init” is that that shell command runs in the background, and by the time it was running and killing the wrong process, the parent DejaGnu/runtest had already gone…

Eventually I came up with a DejaGnu fix for that: [PATCH] DejaGnu kills the wrong process due to PID-reuse races.

And with that in place, all rogue signals are gone.

Ben Elliston pushed the patch to DejaGnu master the next day. Yay!

Turns out that that patch has been fixing several different racy failures. E.g., https://sourceware.org/ml/gdb-patches/2015-08/msg00443.html.

Tagged buildbot, dejagnu, gdb, hairpull, racyfails, systemtap

Pedro Alves | thinko's reign

List active signal handlers with GDB

Listing active signal handlers (or dispositions) with GDB

Answer the question already!

Could we do better?

A tale of inexplicable GDB racy FAILs

Pages

Recent Posts

Recent Comments

Archives

Categories

Meta

Pedro Alves | thinko's reign

List active signal handlers with GDB

Listing active signal handlers (or dispositions) with GDB

Answer the question already!

Could we do better?

Share this:

A tale of inexplicable GDB racy FAILs

Share this:

Pages

Recent Posts

Recent Comments

Archives

Categories

Meta