[info-mcl] process crash
Gary Byers
gb at clozure.com
Sat Aug 1 15:09:38 EDT 2009
I'm fairly sure that adding a couple of lines near the end of the LAP
function %SAVE-STACK-GROUP-CONTEXT (in "ccl:level-1;PPC;ppc-stack-groups.lisp")
will fix the problem that causes PROCESS-RUN-FUNCTION to sometimes crash,
as suggested by the following diff:
Index: level-1/PPC/ppc-stack-groups.lisp
===================================================================
--- level-1/PPC/ppc-stack-groups.lisp (revision 223)
+++ level-1/PPC/ppc-stack-groups.lisp (working copy)
@@ -1685,6 +1685,8 @@
(svset data sg.ts-overflow-limit sg t)
; Prevent stack overflow when we reenter this code (may not be necessary)
(set-global rzero cs-overflow-limit)
+ (set-global rzero vs-overflow-limit)
+ (set-global rzero ts-overflow-limit)
(set-global rzero db-link))
(blr))
The problem (at least the problem that I'm aware of) has to do with how
stack-overflow is detected in RMCL. An (R)MCL thread has 3 stacks, where
the thread's "control stack" is the hardware stack (addressed by r1 on the
PPC) and the "value" and "temp" stacks are used for (roughly) fixed- and
variable-sized lisp objects. In native MCL, overflow on the temp and
values stacks could be detected by write-protecting some guard pages
at the end of the stack and handling the resulting exception. Since
exceptions don't work under Rosetta, in RMCL it's necessary to check
for overflow in software (by comparing the appropriate stack pointer
to a global variable and signaling a stack-overflow if the stack pointer
is "unsigned less than" the limit.
Only one thread can run at a time in (R)MCL, and part of context-switching
between threads (stack-groups, actually ...) involves copying some global
state into the outgoing thread, making the global state "thread neutral",
and then copying the incoming thread's state to the global variables. (These
global variables include the current thread's stack overflow limits.)
One of the first things that a new thread does is to try to determine what
its stack overflow limits should be. Until it's done this (and set the
appropriate global variables), any software stack-overflow checks that the
new thread does have to use the "thread-neutral" value 0 (no stack pointer
can be "unsigned less-than" 0.) Because the outgoing thread neglected
to zero out the global temp- and value-stack limits (in the +-prefixed lines
in the patch above), the first few stack-overflow checks in a new thread
compared the stack pointer to the previously active thread's limit. This
is completely wrong, but it has a 50% chance of being accidentally right
(depending on the relative addresses of the outgoing thread's stack and
the incoming thread's.) Roughly half the time, the new thread would
get a spurious stack overflow before it'd even finished initializing
itself and this generally led to an immediate crash.
AFAICT, this doesn't have anything to do with event-processing per se,
but it may be the case that switching to a new thread from the event
thread would fail and switching to the new thread from (e.g.) the
listener thread would succeed, and this has to do with the more-or-less
arbitrary relative addresses of the incoming and outgoing threads' stacks.
I haven't seen PROCESS-RUN-FUNCTION fail since making the 2-line change
above. That's not conclusive (since I hadn't seen it fail until Andrew
told me about the discussion on this list a few weeks ago), but the explanation
above seems to be consistent with the (somewhat unpredictable) behavior
that people here have reported, and the 2-line fix above seems to fix the
problem.
On Sat, 1 Aug 2009, Terje Norderhaug wrote:
> On Jul 20, 2009, at 4:22 PM, Alexander Repenning wrote:
>> Just wondering... has anybody found any work around the process problems in
>> RMCL. At least so far I have not been able to use process-run-function in a
>> way that is NOT causing a crash.
>
> The enclosed patch attempts to work around the process problems in RMCL, as a
> potential remedy until we have a proper fix. It exploits the observation that
> processes can be started in the event handler without crashes. The patch
> advises process-run-function and should not require changes to any other
> code.
>
> -- Terje Norderhaug
>
More information about the info-mcl
mailing list