[Openmcl-devel] Plans for Unicode support within OpenMCL?
bsder at allcaps.org
Sat Apr 1 16:49:44 EST 2006
Takehiko Abe wrote:
> If having multiple string types is not desirable, I think UTF-16
> string is a good compromise. The 4-fold increase of string size
> is too much.
I disagree, personally. UTF-32 doesn't really bother me. However, it
doesn't completely help either. Indexing on fully compliant Unicode
strings is always O(n) worst case. However, I have seen some pretty
smart speed optimizations in languages with boxed types (ie. store the
fact that it actually is an ASCII-only or BMP-only or non-combining
string and change internal encoding. However, this means that you need
to do at least one O(n) scan on a string to set up the optimization).
The big advantage that everybody forgets about with UTF-8 is the fact
that all the nice low-level, null-termination-expecting strcpy(),
strncpy(), etc all work just fine with UTF-8. That's not true for any
other encoding. UTF-8 also has no endianness confusion.
UTF-8 might be the easiest method to make OpenMCL Unicode compliant as
there is no need to upgrade the entire system all at once. Declare the
system UTF-8 compliant, touch some of the basic text input/output
functions to handle encoding, create a Unicode reader macro, and *bingo*
instant Unicode compliance.
This does, of course, gloss over things like performance issues with
non-ASCII characters, indexing, combining characters, etc. However,
things run at roughly the same speed as before and you get 80%+ of the
Unicode compliance you need. Then you can start chopping off the rough
edges as people hit them rather than having to try to get them all at once.
> Unicode has combining characters and covers lots of scripts/writing
> systems. Handling them is inherently hard and having characters
> with unicode direct codepoints does not make it easier much, imo.
Yup. There is really no way around the fact that some Unicode string
operations are O(n). Once Unicode decided that combining characters
were a good idea, that die was cast.
Unicode strings are *not* vectors/arrays in spite of the fact that they
"almost" are. We all need to just suck it up and get over it.
Quoting the standard about combining characters:
Q: How should characters (particularly composite characters) be counted,
for the purposes of length, substrings, positions in a string, etc.
A: In general, there are 3 different ways to count characters. Each is
illustrated with the following sample string.
"a" + umlaut + greek_alpha + \uE0000.
(the latter is a private use character)
1. Code Units: e.g. how many bytes are in the physical representation of
the string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E
2. Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count
divided by 4.
3. Graphemes: what end-users consider as characters.
A default grapheme cluster is specified by the Unicode Standard 4.0, and
is also in UTR #18 Regular Expressions at
The choice of which one to use depends on the tradeoffs between
efficiency and comprehension. For example, Java, Windows and ICU use #1
with UTF-16 for all low-level string operations, and then also supply
layers above that provide for #2 and #3 boundaries when circumstances
require them. This approach allows for efficient processing, with
allowance for higher-level usage. However, for a very high level
application, such as word-processing macros, graphemes alone will
probably be sufficient. [MD]
More information about the Openmcl-devel