[gclist] LinuxThreads+signals => SIGSEGV in system call?

Fergus Henderson fjh@cs.mu.OZ.AU
Thu, 30 Apr 1998 06:56:46 +1000 (EST)


Synopsis:
---------

I'm trying to port the Boehm (et al) conservative garbage collector
to work with LinuxThreads.  I've got it to the point where
it works *some* of the time.  The problem is that it sometimes
fails, apparently getting a segmentation fault in a signal call.
It generates a core file and when I examine the core file in gdb,
the current instruction pointer is always just past the `int $80'
instruction that invokes the system call.

Any suggestions on how I can go about debugging this?


Details:
--------

There's two tricky parts to the port.  One part is determining where
the thread stacks are so that the collector can include them in its
root set.  This part I have got figured out.  My code must depend on
some of the implementation details of LinuxThreads, but otherwise this
part is not too hard.

The other tricky part is implementing the GC_stop_world() function,
which must suspend all the other threads.  The way I have implemented
this is to send them all a "SIG_SUSPEND" signal, and to have the signal
handler first call sem_post() to tell the main thread that they're
ready to suspend, and then call sigsuspend() (or sleep() -- I tried
both) inside the signal handler.  When the GC is done, the collector
calls GC_start_world() which sends all the threads a "SIG_RESTART"
signal.  The SIG_RESTART handler doesn't do anything except return; the
effect of the signal is just to terminate the call to sigsuspend() or
sleep().

(Normally I'd use SIGUSR1 and SIGUSR2 for my SIG_SUSPEND and
SIG_RESTART signals, but LinuxThreads already uses those, so I'm
currently reusing SIGIO and SIGPWR for SIG_SUSPEND and SIG_RESTART.)

Anyway, that's all well and good, and when I run the collector's test
case, about 50% of the time it works.  But the other 50% or so, it dies,
sometimes due to failed assertions, but more often due to what is
apparently a segmentation fault in a system call.

Is the problem due to Linux for some reason not liking code that
suspends inside a signal handler?   If so, why doesn't Linux allow this?
Or alternatively, what else could be causing this problem,
and how can I go about debugging it?

I'm using LinuxThreads 0.6, libc 5.3.12, kernel 2.1.35, gcc 2.7.2,
and gdb 4.16.

--
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>  |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3        |     -- the last words of T. S. Garp.