[gclist] LinuxThreads+signals => SIGSEGV in system call?

Giuliano Carlini GCARLINI@us.oracle.com
29 Apr 98 15:03:34 -0700


--=_ORCL_20178951_0_0
Content-Transfer-Encoding:7bit
Content-Type:text/plain; charset="us-ascii"

Are you running the incremental collector?

If I remember previous discussion correctly, it might be failing due to the
write barrier. If you've implemented the write barrier by write protecting
pages, things get complicated during system calls which may write to your
memory. On some systems, writes to protected memory by system calls don't
call the singal handler properly. Is it possible to check with gdb if the
fault address is on a write protected page? Again if I remember correctly,
one solution used on these systems is to intercept the system call. If one
of the arguments is a pointer to write protected memory, touch it. Then make
the real call.

Since this is linux, another possibility exists. Hack the kernel so that
user code can watch which pages have been dirtied. I believe something like
this exists for the Solaris version of the collector. It has two benefits.
First, it works with system calls. Second, it's a lot more efficient since
your not faulting from user to system mode and back again every time the
write barrier is tripped.

Of course, this is probably a lot more work than you are willing to do. I
mention it only because I've seen some wailing and gnashing of teeth over
the lack of memory primitives that interact well with garbage collection,
persistent heaps, etc. For someone interested in this, Linux might be a good
test bed for devising and implementing primitives that behave better.

g

--=_ORCL_20178951_0_0
Content-Type:message/rfc822

Date: 29 Apr 98 13:56:46
From:Fergus Henderson <fjh@cs.mu.OZ.AU>
To:gclist@iecc.com
Subject:[gclist] LinuxThreads+signals => SIGSEGV in system call?
Return-Path:<majordom-gclist-out-owner-GCARLINI=us.oracle.com@iecc.com>
Received:from mailsun2.us.oracle.com by mailsun3 with SMTP (SMI-8.6/37.9) id OAA10018; Wed, 29 Apr 1998 14:00:40 -0700
Received:from inet16.us.oracle.com by mailsun2.us.oracle.com with ESMTP (SMI-8.6/37.8) id OAA18305; Wed, 29 Apr 1998 14:00:18 -0700
Received:from ivan.iecc.com (123@ivan.iecc.com [205.238.207.65]) by inet16.us.oracle.com (8.8.5/8.8.5) with SMTP id OAA05326 for <GCARLINI@us.oracle.com>; Wed, 29 Apr 1998 14:00:10 -0700 (PDT)
Received:(qmail 18290 invoked by uid 85); 29 Apr 1998 20:56:59 -0000
Delivered-To:majordom-gclist-out@iecc.com
Received:(qmail 18283 invoked by uid 85); 29 Apr 1998 20:56:58 -0000
Received:(qmail 18276 invoked from network); 29 Apr 1998 20:56:55 -0000
Received:from mulga.cs.mu.oz.au (128.250.1.22) by ivan.iecc.com with SMTP; 29 Apr 1998 20:56:55 -0000
Received:from mundook.cs.mu.OZ.AU (mundook.cs.mu.OZ.AU [128.250.37.154]) by mulga.cs.mu.OZ.AU with ESMTP id GAA11540 for <gclist@iecc.com>; Thu, 30 Apr 1998 06:56:46 +1000 (EST)
Received:(from fjh@localhost) by mundook.cs.mu.OZ.AU (8.8.5/8.7.3) id GAA28778; Thu, 30 Apr 1998 06:56:46 +1000 (EST)
Message-Id:<199804292056.GAA28778@mundook.cs.mu.OZ.AU>
Newsgroups:comp.programming.threads,comp.os.linux.development.apps,comp.os.linux.development.system
Sender:owner-gclist@iecc.com
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding:7bit
Content-Type:text/plain; charset="us-ascii"

Synopsis:
---------

I'm trying to port the Boehm (et al) conservative garbage collector
to work with LinuxThreads.  I've got it to the point where
it works *some* of the time.  The problem is that it sometimes
fails, apparently getting a segmentation fault in a signal call.
It generates a core file and when I examine the core file in gdb,
the current instruction pointer is always just past the `int $80'
instruction that invokes the system call.

Any suggestions on how I can go about debugging this?


Details:
--------

There's two tricky parts to the port.  One part is determining where
the thread stacks are so that the collector can include them in its
root set.  This part I have got figured out.  My code must depend on
some of the implementation details of LinuxThreads, but otherwise this
part is not too hard.

The other tricky part is implementing the GC_stop_world() function,
which must suspend all the other threads.  The way I have implemented
this is to send them all a "SIG_SUSPEND" signal, and to have the signal
handler first call sem_post() to tell the main thread that they're
ready to suspend, and then call sigsuspend() (or sleep() -- I tried
both) inside the signal handler.  When the GC is done, the collector
calls GC_start_world() which sends all the threads a "SIG_RESTART"
signal.  The SIG_RESTART handler doesn't do anything except return; the
effect of the signal is just to terminate the call to sigsuspend() or
sleep().

(Normally I'd use SIGUSR1 and SIGUSR2 for my SIG_SUSPEND and
SIG_RESTART signals, but LinuxThreads already uses those, so I'm
currently reusing SIGIO and SIGPWR for SIG_SUSPEND and SIG_RESTART.)

Anyway, that's all well and good, and when I run the collector's test
case, about 50% of the time it works.  But the other 50% or so, it dies,
sometimes due to failed assertions, but more often due to what is
apparently a segmentation fault in a system call.

Is the problem due to Linux for some reason not liking code that
suspends inside a signal handler?   If so, why doesn't Linux allow this?
Or alternatively, what else could be causing this problem,
and how can I go about debugging it?

I'm using LinuxThreads 0.6, libc 5.3.12, kernel 2.1.35, gcc 2.7.2,
and gdb 4.16.

--
Fergus Henderson <fjh@cs.mu.oz.au>  |  "I have always known that the pursuit
WWW: <http://www.cs.mu.oz.au/~fjh>  |  of excellence is a lethal habit"
PGP: finger fjh@128.250.37.3        |     -- the last words of T. S. Garp.

--=_ORCL_20178951_0_0--