[Openmcl-devel] Porting the OpenMCL Compiler
gb at clozure.com
Wed Jul 6 06:58:42 EDT 2005
On Wed, 6 Jul 2005, James Bielman wrote:
> I've been spending a fair bit of time studying the OpenMCL internals
> and I now think I'm dangerous enough to consider a port to ARM
> (probably Linux first, then hopefully Windows CE).
I actually have a Zaurus (under a pile of papers on my desk, I think ...)
and occasionally think of the same thing.
Any processor with a "SoftWare Interrupt if Not Equal" (SWINE)
instruction can't be all bad ...
> Obviously this is a huge task and I certainly don't expect to get very
> far anytime soon, but hopefully I can learn a lot from the process
> either way.
> I'm planning to start out implementing kernel subprimitives and the
> LAP assembler, but I'd like to get some advice on register usage,
> since the ARM has far fewer registers than PowerPC.
> Basically, there are 16 GPRs, r0-r15, except r15 is the PC, r14 is the
> link register, and r13 is typically the control stack pointer. So,
> apart from any tricks to be done with reusing lr for other purposes
> (although I'd think that, being a GPR, it could fulfill the same
> purpose as the LOC-PC register on PPC?), we are left with 13 GPRs.
Yes; I don't think that you'd need a separate LOC-PC register on the
ARM (or if you did, you wouldn't need it very often.)
> I don't know if there are any guidelines about how many registers are
> necessary for it to be worth paritioning into boxed and unboxed. If
> this isn't enough then obviously life gets more complicated...
On the 68K (which only had 16 registers), I remember people who wrote
LAP code saying that there never seemed to be enough immediate
registers. (I don't remember how many there were, and they came in
two flavors, so the problem was often that there weren't enough
unboxed data or address registers but more than enough of the other
I think that when the PPC compiler does (SETF (SBIT bv idx val)) -
and neither "idx" or "val" is a constant - there are about 4 live values
in immediate registers. You can do things a little differently
(re-calculate some of these values), and it may be more convenient
to spill some of these values to a stack than it's been on the PPC,
but it's also desirable to make primitives fast (and, on the ARM,
> I wrote a little Lisp program to loop over all the fbound symbols in
> the OpenMCL image and disassemble them to a file, then grepped the
> output (hopefully correctly) to count register usage. Here are the
> ARG_Z 132643
> IMM0 50129
> VSP 48012
> ARG_Y 34665
> FN 33836
> SAVE0 32224
> TEMP3 19587
> SAVE1 18077
> ALLOCPTR 11647
> SAVE2 11001
> TEMP4 9398
> LOC-PC 9239
> ARG_X 8769
> NARGS 8744
> SAVE3 7772
> TEMP0 7249
> TSP 5528
> SAVE4 5155
> TEMP2 4285
> SAVE5 3717
> SAVE7 3431
> SP 2963
> SAVE6 2747
> IMM1 2436
> ALLOCBASE 2045
> IMM2 839
> IMM3 517
> TEMP1 446
> RCONTEXT 293
> IMM4 237
> IMM5 13
> Based on this (and some possibly incorrect common sense), here's what
> I've got so far:
The static breakdown is interesting. I'm not sure how to obtain a
dynamic breakdown; I'd -guess- that it'd show similar results, but any
differences would also be interesting.
> r0 imm0 unboxed temp reg
> r1 imm1 unboxed temp reg
> r2 temp0 boxed temp reg
> r3 temp1 boxed temp reg
> r4 save0 boxed caller-save reg
> r5 save1 boxed caller-save reg
> r6 arg_y second to last argument
> r7 arg_z last argument
> r8 nargs number of function arguments
> r9 allocptr heap free pointer
> r10 fn current function object
> r11 rcontext thread context register
> r12 vsp value stack pointer
> r13 sp control stack pointer
> r14 lr link register
> r15 pc program counter
Back in the 80s, people did some perforamnce studies (there was one or
more from the University of Utah and there were some by Benjamin Zorn
at the University of Colorado that I remember) of lisp programs;
someone determined that mean number of arguments to a function was
a little under 2 (counted dynamically) and a little over 2, so 2 argument
registers sounds about right.
Some PPC Linuces (I don't know about ARM Linux or WinCE) want to keep
thread-specific information in a register (DarwinPPC64 wants to keep
a pointer to the current pthread in R13), and the C runtime often
gets confused and upset if this convention is violated. (OpenMCL's
been trying to get away with violating it while lisp code is running,
but weird things happen during exception handling and the next
release will avoid Angering The TLS Gods.) If the OS supports the
concept of thread-local storage (TLS), it may be possible to make
the lisp TCR be a thread-local variable (with a known offset within
the block of thread-local variables that the ABI's thread-pointer points
to), which would keep both the OS and Lisp happy without burning a
It's nice to be able to cons inline, but I'm not sure if it's
On the PPC, "nargs" is only used in limited contexts (#args/#values),
but (as far as the GC is concerned) it's just an immediate register.
The sole reason why nargs isn't used (e.g., as another imm register
in (SETF SBIT)) more generally is to simplify the interpretation
of certain PPC trap instructions:
(twnei nargs 0) ; means "trap if the current function got other
; than 0 arguments"
(twnei imm0 0) ; means "the object whose tag was extracted to imm0
; isn't a fixnum, but it really should be."
If you worked harder at interpreting such traps, you could remove
this restriction and make "nargs" a general-purpose immediate register.
There might also be ways of getting some flexibility (in some sense
of the word) and still keeping a preemptively scheduled GC happy.
Suppose that you were about to enter a loop where you really needed
a bunch of IMM regs and had no use for some node regs (in that loop).
You -might- be able to do something like:
(li save0 0) ; I'm lapsing into PPC assembler here ...
(li save1 0)
;; now set some bits in the TCR somewhere that says that
;; save0/save1 are IMM regs, temporarily.
(load save0 unboxed-stuff)
(add save0 save1 save0) ; etc
(li save0 0)
(li save1 0)
;; clear those hypothetical TCR bits.
The GC'd have to cooperate somehow, and you'd have to be very
disciplined about using this, but it looks like it could be
made to work safely (and might be very useful.)
(I'm thinking about doing a port to a totally bizarre register-starved
architecture, and would think seriously about this approach.)
> This is assuming the temp stack pointer could be put in memory
> somewhere, perhaps in the tcr? Also, this doesn't seem like very many
> immediate registers, but according to the register counts for PowerPC,
> maybe this isn't so bad?
It's certainly true that the compiler rarely uses more than one or
two imm regs at a time (I think that (SETF (SBIT ...)) is either the
worst case or very close to it.) I guess that the question becomes
"if you -need- more than 2 imm regs, how bad is it not to have them ?",
and this probably comes up in a few LAP functions and subprimitives.
My intuition is to want at least 3 and possibly 4, but there may be
ways of avoiding the hard cases.
> I'm not sure what to use for NFN or FNAME either, hmm.
NFN's basically an extra argument on the PPC (we sort of call the
code vector and pass NFN as an argument.) Splitting things up that
way has some nice properties (code-vectors are position-independent
and can be kept in readonly memory, FUNCTIONs are very orthogonal
and easy for the GC to deal with.) On the ARM, it -may- be better
to keep the code and constants in the same object, so the whole
NFN/FN thing may disappear. (The "current function" is "where the
PC is", sort of.)
FNAME is part of the canonical calling sequence only so that
if you call the thing that goes in the function cell of non-fbound
symbols it can say what symbol had no function definition.
> Also, I'm curious why OpenMCL uses registers for the last arguments
> instead of the first, is there a sneaky reason why this is so?
It was a little more dramatic on the 68K than it is on the PPC, and
it only really makes a difference when calling functions that take
arguments that have to be passed on the stack. CL requires that
arguments be evaluated from left to right (or at least that this is
true with respect to side-effects.) Suppose that we have a call to
the function FOO which takes 5 arguments, in this case the results
of callin FN0, FN1, FN2, FN3, and FN4 (each of which take 0 args
but which are assumed to have side-effects, so we must call them
in that order.)
If we passed the first 3 arguments in registers (arg_a, arg_b, arg_c)
and remaining arguments on the stack, we'd get code like this:
(vpush canonical-result-reg) ; it doesn't matter here what
; register is "canonical-result-reg"
(load-word arg_a 4 vsp)
(load-word arg_b 3 vsp)
(load-word arg_c 2 vsp)
At this point, we can make the call to FOO; there are two outgoing
arguments on the top of the stack and 3 words (used to evaluate the
first 3 outgoing args) underneath them. If we did things this way,
the caller would have to discard those 3 words either before the
call (moving the outgoing arguments down) or after. (You can certainly
avoid this worst-case scenario by using other temporary stack locations
to hold the first 3 args or by using non-volatile registers instead
of stack temporaries, but neither of these strategies is absolutely
In the "last N args in registers" case, the naive approach is:
(vpush arg_z) ; it's handy for the return value to go
; in arg_z
That's probably only slightly better (VPOP is 2 instructions on the
PPC, so we're only saving a VPUSH and a load to get to this point),
but we also don't have anything under the outgoing arguments that'd
have to be cleaned up eventually. That seems to add up to a slight
win; on the 68K, VPOP was a smaller/faster instruction than the LOADS
would have been, so the difference was more pronounced.
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
More information about the Openmcl-devel