;;-*- mode: lisp; package: ccl -*- #| From: gb@clozure.com Subject: Re: [info-mcl] process crash Date: August 1, 2009 12:09:38 PM PDT To: info-mcl@clozure.com Reply-To: info-mcl@clozure.com I'm fairly sure that adding a couple of lines near the end of the LAP function %SAVE-STACK-GROUP-CONTEXT (in "ccl:level-1;PPC;ppc-stack-groups.lisp") will fix the problem that causes PROCESS-RUN-FUNCTION to sometimes crash, as suggested by the following diff: Index: level-1/PPC/ppc-stack-groups.lisp =================================================================== --- level-1/PPC/ppc-stack-groups.lisp (revision 223) +++ level-1/PPC/ppc-stack-groups.lisp (working copy) @@ -1685,6 +1685,8 @@ (svset data sg.ts-overflow-limit sg t) ; Prevent stack overflow when we reenter this code (may not be necessary) (set-global rzero cs-overflow-limit) + (set-global rzero vs-overflow-limit) + (set-global rzero ts-overflow-limit) (set-global rzero db-link)) (blr)) The problem (at least the problem that I'm aware of) has to do with how stack-overflow is detected in RMCL. An (R)MCL thread has 3 stacks, where the thread's "control stack" is the hardware stack (addressed by r1 on the PPC) and the "value" and "temp" stacks are used for (roughly) fixed- and variable-sized lisp objects. In native MCL, overflow on the temp and values stacks could be detected by write-protecting some guard pages at the end of the stack and handling the resulting exception. Since exceptions don't work under Rosetta, in RMCL it's necessary to check for overflow in software (by comparing the appropriate stack pointer to a global variable and signaling a stack-overflow if the stack pointer is "unsigned less than" the limit. Only one thread can run at a time in (R)MCL, and part of context-switching between threads (stack-groups, actually ...) involves copying some global state into the outgoing thread, making the global state "thread neutral", and then copying the incoming thread's state to the global variables. (These global variables include the current thread's stack overflow limits.) One of the first things that a new thread does is to try to determine what its stack overflow limits should be. Until it's done this (and set the appropriate global variables), any software stack-overflow checks that the new thread does have to use the "thread-neutral" value 0 (no stack pointer can be "unsigned less-than" 0.) Because the outgoing thread neglected to zero out the global temp- and value-stack limits (in the +-prefixed lines in the patch above), the first few stack-overflow checks in a new thread compared the stack pointer to the previously active thread's limit. This is completely wrong, but it has a 50% chance of being accidentally right (depending on the relative addresses of the outgoing thread's stack and the incoming thread's.) Roughly half the time, the new thread would get a spurious stack overflow before it'd even finished initializing itself and this generally led to an immediate crash. AFAICT, this doesn't have anything to do with event-processing per se, but it may be the case that switching to a new thread from the event thread would fail and switching to the new thread from (e.g.) the listener thread would succeed, and this has to do with the more-or-less arbitrary relative addresses of the incoming and outgoing threads' stacks. I haven't seen PROCESS-RUN-FUNCTION fail since making the 2-line change above. That's not conclusive (since I hadn't seen it fail until Andrew told me about the discussion on this list a few weeks ago), but the explanation above seems to be consistent with the (somewhat unpredictable) behavior that people here have reported, and the 2-line fix above seems to fix the problem. |# (in-package :ccl) #+rmcl (let ((*WARN-IF-REDEFINE-KERNEL* nil) (*warn-if-redefine* nil)) (defppclapfunction %save-stack-group-context ((sg arg_z)) (let ((address imm0) (data imm1)) ; Update active pointer for vsp area to include the 8 words pushed by .SPffcall (ref-global address current-vs) (la data (- (* 4 8)) vsp) ; .SPffcall pushes the 8 saved registers on the VSP (stw data ppc::area.active address) (ref-global data cs-overflow-limit) (svset data sg.cs-overflow-limit sg t) (ref-global data vs-overflow-limit) (svset data sg.vs-overflow-limit sg t) (ref-global data ts-overflow-limit) (svset data sg.ts-overflow-limit sg t) ; Prevent stack overflow when we reenter this code (may not be necessary) (set-global rzero cs-overflow-limit) (set-global rzero vs-overflow-limit) (set-global rzero ts-overflow-limit) (set-global rzero db-link)) (blr)) ) ; end redefine