[Openmcl-devel] Need advice to debug segfault when running concurrent selects in clsql/postgresql
gb at clozure.com
Thu Oct 31 15:45:13 CDT 2013
On 10/31/13 3:48 AM, Paul Meurer wrote:
> Am 31.10.2013 um 01:15 schrieb Gary Byers <gb at clozure.com
> <mailto:gb at clozure.com>>:
>> On Wed, 30 Oct 2013, Paul Meurer wrote:
>>> I run it now with --no-init and in the shell, with no difference.
>>> Immediate failure with :consing in *features*,
>>> bogus objects etc. after several rounds without :consing.
>> So, I can't rant and rave about the sorry state of 3rd-party CL
>> libraries, and
>> anyone reading this won't be subjected to me doing so ?
>> Oh well.
>> I was able to reproduce the problem by running your test 100 times,
> I am not able to provoke it at all on the MacBook, and I tried a lot.
>> so apparently
>> I won't be able to blame this on some aspect of your machine. (Also
>> since my ability to diagnose problems that only occur on 16-core
>> machines depends
>> on my ability to borrow such machines for a few months.)
> I think you can do without a 16-core machine. I am able to reproduce
> the failure quite reliably on an older 4-core machine with Xeon CPUs
> and SuSE, with slightly different code (perhaps to get the timing right):
For the last several years (since the Pentium II ?) have treated x86
instructions as a kind of bytecode that's dynamically translated
into code for a (largely undocumented) RISC-y microengine. Different x86
implementations do this translation a little differently
(and may implement somewhat different microengines); some sequences of
x86 instructions (bytecodes) may be treated as
a single micro operation in some implementations and not others, and the
factors that govern this can be quite complex.
(Agner Fog has done a lot of research into this - as far as I know, it's
all based on reverse-engineering - and maintains his
This is potentially relevant here in that if it's the case that if the
GC misinterprets a thread's state if that thread is stopped at
a particular x86 instruction (e.g., when entering or returning from
foreign code), it may be the case that some x86 implementations
never (or very rarely) see that particular instruction as a separate
instruction and other implementations always/often do.
I tried 100 iterations of your original test on a Core i7 laptop, and
was just about to conclude that I couldn't reproduce the
problem when it failed; I believe you if you say that you haven't been
able to get it to fail on anything but a Xeon. I'd be a little
more confident in this theory than I am if I understood why I ever
failed on my laptop (does the translation behave differently
in some cases than in others on the same machine ?), but I suspect that
if I read Agner Fog's papers carefully I'd understand that
a bit better.
I think that the Intel ATOM (which was used in netbooks a few years ago
and which they're still trying to refine so that it could
be used on mobile devices) is different from both the Xeon and the
Core-2/Core-i machines at this level, and am curious
about whether it fails on an ATOM-based netbook. (I don't have any
working Xeons, but still have a netbook and can use
something else to prop that door open ...
> If you really need a 16-core machine to debug this I can give you
> access to mine. :-)
Thanks, but I'd need physical access to the machine, possibly for many
after the problem's solved.
>> It's unlikely that this change directly avoids the bug (whatever it
>> is); it's more
>> likely that it affects timing (exactly what happens when.) I don't
>> yet know what
>> the bug is, but I think that it's likely that it's fair to
>> characterize the bug
>> as being "timing-sensitive". (For example: from the GC's point of
>> view, whether
>> a thread is running Lisp or foreign code when that thread is
>> suspended by the GC.
If anyone actually cares, this sentence should probably read "... is
suspended by the
GC is significant; it affects whether the values in the thread's
registers should be
interpreted as "references to Lisp objects" or as "random bit patterns
of no interest
to the GC."
>> The transition between Lisp and foreign code takes a few
>> instructions, and if
>> a thread is suspended in the middle of that instruction sequence and
>> the GC
>> misintrprets its state, very bad things like what you're seeing could
>> That's not supposed to be possible, but something broadly similar
>> seems to be
I was able to attach GDB to the crashed CCL after I provoked the crash
on my laptop,
and what I can tell about what's happened is consistent with this theory.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Openmcl-devel