[vox-tech] C question: Determining where a signal was raised

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Sat Oct 23 15:48:18 PDT 2004


On Sat, 23 Oct 2004, Peter Jay Salzman wrote:

> My apologies to clc readers; I'm aware this really isn't a C question.  ;-)

It could be regarded as such, if the answer you wanted was "portable C
doesn't support what you are looking for without trace or setjmp/longjmp
scaffolding in your code."

> I have some code that traps floating point errors.  The signal is trapped
> with this code:
> 
>       struct sigaction action;
> 
>       memset(&action, 0, sizeof(action));
>       action.sa_sigaction = fpe_callback;    /* which callback function  */
>       sigemptyset(&action.sa_mask);          /* other signals to block   */
>       action.sa_flags = SA_SIGINFO;          /* give details to callback */
> 
>       if (sigaction(SIGFPE, &action, 0))
>          die("Failed to register signal handler.");
> 
> 
> and, when the signal is raised, the callback function is:
> 
>    void fpe_callback(int sig_number, siginfo_t *info, void *data)

Yeah, non-ANSI function parameters.

>    {
>       data = data;      /* used for SIGIO (see F_SETSIG in fcntl) */
> 
>       if (sig_number != SIGFPE) {
>          fprintf(stderr, "%s:%d %s error: "
>             "recieved wrong signal number %d not %d\n",
>             __FILE__, __LINE__, __FUNCTION__, sig_number, SIGFPE);
>          exit(2);
>       }
> 
>       fprintf(stderr, "%s:%d %s warn: ", __FILE__, __LINE__, __FUNCTION__);
>       fpe_print_cause(stderr, info);
> 
>       exit(1);
>    }
> 
> 
> The function fpe_print_cause() does nothing more than print the cause of the
> floating point error:
> 
> 
>    void fpe_print_cause(FILE *file, siginfo_t *info)
>    {
>       if (info->si_signo != SIGFPE) {      // should never happen
>          die("Somehow got a wrong signo = %d\n", info->si_signo);
>       } else {
>          fprintf(file,
>             "FPE reason %d = \"%s\", from address 0x%X\n",
>             info->si_code,
>             info->si_code == FPE_INTDIV ? "integer divide by zero"     :
>             info->si_code == FPE_INTOVF ? "integer overflow"           :
>             info->si_code == FPE_FLTDIV ? "FP divide by zero"          :
>             info->si_code == FPE_FLTOVF ? "FP overflow"                :
>             info->si_code == FPE_FLTUND ? "FP underflow"               :
>             info->si_code == FPE_FLTRES ? "FP inexact result"          :
>             info->si_code == FPE_FLTINV ? "FP invalid operation"       :
>             info->si_code == FPE_FLTSUB ? "subscript out of range"     :
>             "unknown",
>             (unsigned int) info->si_addr
>          );
>       }
>    }
> 
> 
> 
> The *intent* of fpe_callback() is to print the function and line number that
> was executing when the FPE signal was raised.  However, the function and line
> number that gets printed is fpe_callback().  Useless information.

But you learned something about __FILE__, __LINE__ and __FUNCTION__
macros.

> Is there a way to grab the function, file, and line number of the code that
> was executing when the FPE signal was raised?

Not directly, nor in a standard-conforming way, but for your purposes, the
address at which the processor was running should be on the stack stored
in your corefile.

> Running the executable under GDB is not an option because sometimes it can
> take many, many hours for the FPE to raise.  Also, I thought I was being
> crafty by replacing:
> 
>    exit(1);
> 
> with:
> 
>    abort();
> 
> A core file is generated, which should've given me details of where the code
> was when the FPE was generated, but it looks like the stack blew chunks:
> 
>    p at lucifer$ gdb avataralt core 
>    Using host libthread_db library "/lib/tls/libthread_db.so.1".
>    Core was generated by `./avataralt'.
>    Program terminated with signal 6, Aborted.
> 
>    warning: current_sos: Can't read pathname for load map: Input/output error
> 
>    Reading symbols from /lib/tls/libm.so.6...done.
>    Loaded symbols for /lib/tls/libm.so.6
>    Reading symbols from /lib/tls/libc.so.6...done.
>    Loaded symbols for /lib/tls/libc.so.6
>    Reading symbols from /lib/ld-linux.so.2...done.
>    Loaded symbols for /lib/ld-linux.so.2
>    #0  0x4006cee9 in raise () from /lib/tls/libc.so.6
>    (gdb) bt
>    #0  0x4006cee9 in raise () from /lib/tls/libc.so.6
>    #1  0x4017aedc in ?? () from /lib/tls/libc.so.6
>    #2  0x00003ffe in ?? ()
>    #3  0x4006e781 in abort () from /lib/tls/libc.so.6
>    #4  0x00000000 in ?? ()
>       (snip)
>    #46 0x40016c40 in ?? () from /lib/ld-linux.so.2
>    #47 0x000000a3 in ?? ()
>    #48 0x40016e78 in _r_debug ()
>    #49 0xbfff8b74 in ?? ()
>    #50 0x4000ba16 in _dl_map_object_deps () from /lib/ld-linux.so.2
>    Previous frame inner to this frame (corrupt stack?)
> 
> To be honest, I don't have the slightest clue what happens to the stack
> when an asynchronous signal handler executes or when a long jump happens.
> I assume this is why GDB thinks the stack was corrupt...

For software interrupts (synchronous signals), complicated stuff happens.  
In a non-memory-protected environment (like DOS) the processor registers
would be saved onto your stack, and then some C library code would run,
and it would call your signal handler, and if you returned from the signal
handler, it would try to "return" (restore processor state") and continue
executing the offending code.  The presumption would be that you had fixed
whatever went wrong using some non-standard-conforming techniques.

In Linux, the kernel steps in, storing the processor state on the stack
just as it would at any context switch, but when the kernel is about to
restore context, it runs any handlers you have specified first (in
usermode) due to the way the context was triggered. I haven't dug through
the code to see exactly what steps are being taken, but comments on the
members of siginfo_t (Linux only, possibly version specific, definitely
processor specific) indicate that when processing a SIGFPE
info->_sigfault._addr will be the address at which the exception was
raised.

Setjmp is comparatively straightforward, since it can be implemented
without kernel context switches.  Basically, a bunch of processor register
values and a limited amount of stack data are stored off into the jmp_buf,
and it returns zero.  When you call longjmp, it restores those processor
register values and stack, which effectively makes the "return" at the end
of longjmp discard the stack contents that were below (generated by
function calls) the original setjmp call.

Note that looking at the stack from where you were in the corefile, no
setjmp/longjmp effects would be present, but the kernel has manipulated
the stack without following normal C calling conventions, possibly to go
through kernel space again on the way back to restarting the problem code.

> Trace code will work, but I'm looking for something more elegant than
> sprinkling trace code all over the place.  I'm so busy, the last thing I want
> to do is start putting junk in my code that needs to be taken out.  If I'm
> going to spend time on this, I at least want a return on my investment and
> learn something I didn't know when I woke up this morning...   ;-)

Check out the processor execution address in the siginfo struct mentioned
above, using gdb.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
---------------------------------------------------------------------------



More information about the vox-tech mailing list