[Bug-openmcl] Debugging crashes of dppccl

Gary Byers gb at clozure.com
Fri Mar 5 00:40:02 MST 2004



--On Thursday, March 4, 2004 3:21 PM -0800 Erik Pearson 
<erik at adaptations.com> wrote:


> Well, that is not quite true. The last time this happened the message
> "Abort trap" was printed.
>
> This is running under the bleeding edge, checked out a couple of days ago
> with a few manual tweaks to the threading code.
>

I was going to say the OpenMCL never itself calls abort() or could do 
anything that would cause that message to be printed.  I'd have been wrong; 
in "ccl:lisp-kernel;lisp-exceptions.c", there's something like:

#define MACH_CHECK_ERROR(x) if (x != KERN_SUCCESS) {abort()};

MACH_CHECK_ERROR is used in three places that I can see, all of which are 
related
to setting up a thread's exception handling.  None of these things should 
ever fail,
but things that can't happen should probably throw themselves at the mercy 
of the
kernel debugger instead of aborting the whole process ...

I'm not sure if my kernel sources would even compile at the moment.  If you 
get a
chance, could you please try replacing the macro above with something like:

#define MACH_CHECK_ERROR(context,x) if (x != KERN_SUCCESS) \
    {Bug(NULL, "Mach error while %s : ~d", context, x)};

and replacing the three calls to it (all in lisp-exceptions.c) with:

    MACH_CHECK_ERROR("allocating thread exception_ports",kret);
    MACH_CHECK_ERROR("renaming exception_port",kret);
    MACH_CHECK_ERROR("adding send right to exception_port",kret);

If that's what's causing the abort() you're getting, it should result in 
some sort of message instead.  You probably can't do much in the debugger 
((b)acktrace might work),
but the message would at least tell us which case is failing.

If I had to bet (just because I've seen it before), I'd bet on the second 
(port rename)
case.  In Mach, a "port name" is just a 32-bit number; what we're doing 
here is allocating a port and making that port be the place where the 
kernel will send exception messages, then "renaming" that port to be the 
new thread's TCR.  The callback function that'll be called when an 
exception occurs on a thread will receive the thread's exception port as an 
argument; making the TCR -be- the exception port is a sleazy way to get our 
hands on the thread's TCR.

If a TCR's deallocated, it's still a valid Mach port; the function 
darwin_exception_cleanup makes the TCR stop being a valid port.  If we're 
somehow freeing the TCR without calling darwin_exception_cleanup() on it, 
and subsequently allocating a new TCR at the same address, the new TCR is 
still a valid port name and the rename call will fail.

(The only reason that I know this is that Apple's Exception Manager has the 
same bug.)

If it's one of the other cases, I don't have a good theory; if it's not one 
of the cases above, I get to give my speech about How Horrible It Is To 
Just Call abort().






More information about the Bug-openmcl mailing list