[Openmcl-devel] Unicode Composition
gb at clozure.com
Tue Dec 18 17:41:27 CST 2012
The function that Matt mentioned - CCL::PRECOMPOSE-SIMPLE-STRING -
exists because OSX filenames are stored in some canonical decomposed
form; people generally seem to expect to see and deal with namestrings
in composed form (even if they know that Unicode says that the two forms
are equivalent.) Different revisions of the Unicode standard may have
different notions about what characters can be combined or decomposed,
and I believe that it's the case that OSX pathnames use rules from a
particular (now possibly quite old) version of the standard. (Better
that than having to explain why filenames change whenever OSX decides
to use newer rules.)
Given that background, I think that it'd be useful for anyone thinking
of fleshing out support for Unicode normalization in CCL look at that
function and recognize that this stuff isn't rocket science: the code's
pretty simple and the process is entirely data-driven, and the complexity
that's there mostly has to do with accessing a sparse representation of
that data. The data in question came from (me looking at) the OSX
filesystem sources, and unicode.org provides the data (in some neutral text
format) for each revision of the standard.
Getting that data (and especially getting it into a usable format)
requires some thought and effort; putting more thought and effort into
it than CCL::PRECOMPOSE-SIMPLE-STRING does might be worthwhile in some
cases, but that's ... not rocket science either.
If you're thinking about using ICU or some third-party lisp library to
implement the missing bits ... I guess that all I can ask is that people
try to show some consideration for others before saying that out loud.
(Laptops aren't cheap, coffee and laptops are natural enemies and coffee
always wins, and let's just say that some people like to drink coffee
while reading the day's email.)
On Tue, 18 Dec 2012, R. Matthew Emerson wrote:
> On Dec 18, 2012, at 9:33 AM, Martial B <martialhb at gmail.com> wrote:
>> I would like to know if there is a plan to build a Unicode NFC Clozure CL. Is it useful for you? I am dealing with french and vietnamese characters and not being able to get a char/codepoint (but instead an array of bytes for each char-unit and its diacritic elements) annoys me a bit (I am new to ccl switching from sbcl). I will try to normalize strings with an external-call with something like icu4c but I'd be glad to know if there's move in this direction in the future.
> There's an internal function ccl::precompose-simple-string that might help you. It might be worth creating a ticket requesting exported functions to produce the various Unicode normalization forms.
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
More information about the Openmcl-devel