[Bug-openmcl] Debugging crashes of dppccl
Gary Byers
gb at clozure.com
Sat Mar 20 23:43:23 MST 2004
On Thu, 11 Mar 2004, Erik Pearson wrote:
> >>
> >> I seem to be able to run a lot more iterations before crashing; I was
> >> about to send this message saying that I wasn't able to get a
> >> segfault, but I just got one (on the 12,273rd iteration. In my case, it
> >> looked like:
> >>
> >> Unhandled exception 11 at 0x020f1ee0, context->regs at #xf01356c8
> >> Read operation to unmapped address 0xf056000c
> >> While executing: #<Function %STACK-AREA-USABLE-SIZE #x060ef4de>
> >> ? for help
> >> [2244] OpenMCL kernel debugger: Run 2273, result=(1089 NIL NIL NIL)
> >> b
> >>
As it turns out, this is pretty mundane. (Bad, but mundane.)
A new TCR is created from the lisp side via CCL::NEW-TCR (in
"ccl:level-1;l1-lisp-threads.lisp"); the function's just a little
wrapper that calls into the kernel to create a TCR with the right
stack sizes and then turns the pointer that the kernel returns
(which is guaranteed to have at least its low 2 bits clear) into
a fixnum (which is guaranteed to have its low 2 bits clear). The
way that it does this is:
(ash (%int-to-ptr ptr) -2)
-2 would be better as (- target::fixnumshift), but that's not the real
problem: (%INT-TO-PTR) returns an (UNSIGNED-BYTE 32); shifting that
right by 2 bits might return a bignum (and will if the most significant
bit if the pointer address is set.)
That bit will be set if malloc feels like allocating memory "up there",
which it will more likely do in the current bleeding-edge version (in which
the lisp is reserving most of the low 2GB of the address space.)
I've botched this macptr->fixnum calculation often enough that there's
a tiny little LAP function (CCL::MACPTR->FIXNUM) that does the dirty
work. Using MACPTR->FIXNUM in NEW-TCR seems to keep things from
crashing when we first try to reference memory relative to that TCR
(via CCL::%FIXNUM-REF, which doesn't do a very good job of indirecting
through bignums.)
This isn't a subtle timing-sensitive resource contention problem (it's
just a bug that's exposed by the fact that malloc's more likely to
return a "signed" address). I suspect that there may still be a
subtle timing problem or two, but I haven't yet been able to reproduce
one. I'll keep trying.
More information about the Bug-openmcl
mailing list