[Openmcl-devel] Unicode in OpenMCL
gb at clozure.com
Fri Jun 25 23:24:03 EDT 2004
On Fri, 25 Jun 2004, Steve Jenson wrote:
> On Jun 24, 2004, at 12:31 PM, Duncan Rose wrote:
> > If this is on the cards, why not move straight to 2-octet ISO 10646
> > (UCS-4)?
> > At least this is a "natural" size (for want of a better term) being 32
> > bits
> > already...
> I agree that UCS-4 is probably best. It's important to note that
> openmcl can currently print and store UTF-8 without harming it. I
> tested it with some vicious Plane 1 characters:
Just to pick a nit (and to address Duncan's question):
UCS-4 differs from UTF-32 in that the former defines some "private use"
codepoint ranges that aren't within the 21-bit Unicode range. According to:
there is or was a proposal on the table to amend the standard that defines
UCS-4 to eliminate these ranges. If/when that passes (if it hasn't already
done so), UCS-4 and UTF-32 will be nearly synonomous.
It's desirable - from an implementation point of view - that
CHAR-CODE-LIMIT be a fixnum (and therefore that all valid CHAR-CODEs
be fixnums), and it's desirable that a CHARACTER be an immediate
object (if there are M bits of CHAR-CODE information and N bits of tag
information that identify the object as being a CHARACTER, then (+ M
N) needs to be <= 32 on a 32-bit implementation.) It'd be awkward to
make N less than 4 in OpenMCL, and there some reasons for preferring
N=8, which is basically what the current implementation uses.
M <= 24 is (conveniently) a few more bits than we need to represent
any 21-bit Unicode code point, but even 28 bits wouldn't be enough
to represent some code point that's legal (if "private") in pre-amended
[Hey, I said that this was a minor nit!]
> (defvar *utf8* "ð¸ð¹ð¶ð°ð¹: ð
> (format t "~A" *utf8*)
> $ openmcl -l testing-utf8.lisp
> ; loading system definition from
> ; /Users/stevej/.openmcl/asdf-install/asdf-install.asd into #<Package
> ; registering #<SYSTEM ASDF-INSTALL #x63BD006> as ASDF-INSTALL
> ð¸ð¹ð¶ð°ð¹: ð
> Welcome to OpenMCL Version (Beta: Darwin) 0.14.2-p1!
> If those characters look like blocks to you, then you should get the
> Code2001 font: http://home.att.net/~jameskass/code2001.htm
Well, the good news is that it's not stripping off the high bit or
anything like that, but of course the actual interpretation of that
UTF-8 encoded string is happening elsewhere (perhaps in your terminal
program or in Emacs.) It'd certainly be nice to have this work for
the right reason ...
More information about the Openmcl-devel