[Bug-openmcl] Debugging crashes of dppccl
Gary Byers
gb at clozure.com
Sat Mar 6 15:24:04 MST 2004
On Sat, 6 Mar 2004, Erik Pearson wrote:
> Gary,
>
> Comments below on latest testing, thrown at your mercy while I muddle
> through the code.
>
> Erik.
>
>
> Okay, here is a short version of what I've done with this in the last day.
>
> I installed the debugging patches above, and first found that it was indeed
> the renaming bug that was revealed. (This was after finding that Bug()
> wouldn't work in this context because things were too broken -- after
> changing that to a raw fprintf() the debugging lines were printed.)
>
Ahem. Bug() wants a TCR, so that it can use it to suspend other TCRs.
> This is with egc off, and without using the Bug() call, as it doesn't work
> at this point. So no debugging output.
>
> bad: renaming exception_port, 13, 0x105540
> bad: adding send right to exception_port, 268435466, 0x105540
> Couldn't setup exception handler - error = -308
The most likely theory is that 0x105540 was recently the address of another
(defunct) TCR, and that it was freed without losing its status as a port.
That shouldn't happen if all of the cleanup code runs reliably every time.
We can possibly catch this earlier (by somehow making sure that a newly-
malloced TCR isn't already the name of a Mach port and either destroying
the port or mallocing another TCR in that case.)
As to why this is happening at all: I tend to suspect the fact that
the GC is accessing all_areas without locking the corrsponding lock
is a likely culprit. I'm not sure -how- that would cause the things
that you're seeing to go wrong.
>
> I haven't debugged this too much further, since I ran into the issue below
> after turning on egc. BTW this happens at a semi-predictable point in the
> code, which is usually about 1300 iterations during which there are two
> "monitoring checks" of websites, which involve one thread and network
> connection each. If this is run for one monitoring check, then it runs for
> approximately twice as long. This makes me think that some resource is
> being exhausted, or some limit is being reached.
>
>
> If I turn egc back on, it does not error out at this point, but gets a
> "Can't find active area" error.
Is there any way that you can isolate this to a reproducible test case ?
If I'd need to run the whole shebang to see the behavior, is that possible ?
More information about the Bug-openmcl
mailing list