[info-mcl] (char-code ) inconsistencies

Ron Garret ron at flownet.com
Tue Mar 23 13:14:49 CDT 2010


I couple of people have written me privately to ask me to expand on this.

It turns out that RMCL's unicode support is apparently different from CCL's.  CCL has a global variable CCL:*DEFAULT-FILE-CHARACTER-ENCODING* that does the obvious thing.  But RMCL does not have this variable, nor anything equivalent that I could find.  Maybe Terje can shed some light on this.

However, as long as I'm writing I have a couple of other miscellaneous comments:

First, there is no such thing as an ascii character with code > 127.  Ascii by definition is a seven-bit encoding (which is exactly why it causes so many problems).

Second, the OP wrote:

> Thus the problem surfaces when you try to coerce text (say, in an image file) to its ASCII
> equivalent, but it won't appear if you call (char-code ) from its inverse, (character ). 


This actually makes no sense.  The inverse of CHAR-CODE is not CHARACTER but CHAR-CODE.  The implementation of CHARACTER in RMCL actually accepts integers and acts like CODE-CHAR, but this is a bug in RMCL.  But be that as it may, CHAR-CODE and CODE-CHAR (or CHARACTER) are indeed inverses of each other, as can be easily demonstrated:

? (dotimes (i 1000) (if (not (= i (char-code (character i)))) (print i)))
NIL
? 

Furthermore, what does "text in an image file" mean?

Most likely what is going on here is confusion between unicode code points (which is an abstract mapping of characters to numbers and is what char-code returns in a unicode-enabled Lisp) and the values of the bytes in the underlying representation according to some encoding.  Nine times out of ten the confusion arises because the two most common encodings -- UTF-8 and Latin-1 -- are incompatible in the range 128-255.  So if you take Latin-1 text and interpret it as UTF-8 or vice-versa the result is garbage.  Nine times out of ten if you see funny characters in non-asian text that is what happened.

It is also possible (since image files were mentioned) that the OP is trying to read binary data as text.  That is always a Bad Idea (tm).  In the non-unicode world you could often get away with it.  In the unicode world you usually can't.

HTH,
rg



On Mar 22, 2010, at 9:26 PM, Ron Garret wrote:

> More likely all you need to do is change the default encoding to latin-1.
> 
> On Mar 22, 2010, at 8:23 PM, Raymond Lee wrote:
> 
>> Hi Peter and Terje,
>> 
>> Many thanks for your prompt and informative replies.  Because several uncompressed image 
>> file formats (e.g., .psd  and  .bis) use the extended ASCII character-integer mapping, I'll need 
>> to write a workaround for opening/saving such images.  But at least I know what the issue is 
>> now.  Thanks again for your help,
>> 
>> Ray Lee
>> ---- Original message ----
>>> Date: Mon, 22 Mar 2010 17:44:44 +0000
>>> From: peter <p2.edoc at googlemail.com>  
>>> Subject: Re: [info-mcl] (char-code ) inconsistencies  
>>> To: Discussion list for MCL users <info-mcl at clozure.com>
>>> 
>>> At 10:34 AM -0700 10/3/22, Terje Norderhaug wrote:
>>>> On Mar 22, 2010, at 10:06 AM, Raymond Lee wrote:
>>>>> 
>>>>> The (char-code ) function's behavior has 
>>>>> changed from MCL 5.x to RMCL 5.2.1 and now
>>>>> appears quite inconsistent for ASCII 
>>>>> characters > 127.  For example, in most fonts 
>>>>> ASCII
>>>>> character #  179 is mapped to the "greater than 
>>>>> or equal to" character.  But now in RMCL
>>>>> 
>>>>> (char-code (character "„")) --> 8805
>>>>>         whereas
>>>>> (char-code (character 179)) --> 179
>>>>> 
>>>>> Thus the problem surfaces when you try to 
>>>>> coerce text (say, in an image file) to its ASCII
>>>>> equivalent, but it won't appear if you call 
>>>>> (char-code ) from its inverse, (character ). 
>>>>> Most ASCII
>>>>> characters > 127 exhibit similar problems.  Any 
>>>>> solutions or ideas come to mind?
>>>> 
>>>> This is due that MCL 5.2 was upgraded to use 
>>>> Unicode, for which 8805 (2265 hex) is the 
>>>> "greater than or equal to" character.
>>> 
>>> Presumably hence char-code's fine, we're just 
>>> inputting unicode now where we were using roman.
>>> 
>>> Isn't the lisp function >= rather than „.
>>> _______________________________________________
>>> info-mcl mailing list
>>> info-mcl at clozure.com
>>> http://clozure.com/mailman/listinfo/info-mcl
>> _______________________________________________
>> info-mcl mailing list
>> info-mcl at clozure.com
>> http://clozure.com/mailman/listinfo/info-mcl
> 
> _______________________________________________
> info-mcl mailing list
> info-mcl at clozure.com
> http://clozure.com/mailman/listinfo/info-mcl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clozure.com/pipermail/info-mcl/attachments/20100323/a62b62dc/attachment.html>


More information about the info-mcl mailing list