[Openmcl-devel] *default-character-encoding* should be :utf-8
ron at flownet.com
Mon Mar 5 20:36:25 CST 2012
On Mar 5, 2012, at 5:14 PM, Gary Byers wrote:
> On Mon, 5 Mar 2012, Ron Garret wrote:
>> On Mar 4, 2012, at 5:53 PM, Gary Byers wrote:
>>> If your sources are in some legacy encoding - MacRoman is an example
>>> that still comes up from time to tine - then you obviously need to
>>> process them with that encoding in effect or you'll lose information.
>> If you're using such legacy sources, you first step should be to
>> convert them to UTF-8 and then never touch the original again.
>> (The> same goes for latin-1, except that latin-1 is not a legacy
>> encoding. It's in common use today, which is the main reason this
>> is a real problem.)
> I agree, but the people who have these legacy-encoded sources that really
> should have been converted to utf-8 long ago have all kinds of flimsy excuses
> for not wanting to do so. "It costs time", "it costs money"
Those really are flimsy excuses. Converting character encodings on modern processors can be done at a rate of gigabytes per minute. You could probably convert the entire corpus of all computer source code ever produced by humans for about $100.
> "it requires expertise", "it breaks backward compatibility"
Those are slightly less flimsy excuses. But expertise can be hired or acquired. Backwards compatibility can be a real concern in certain application domains, but I'd be surprised to learn that CCL is being used in any of them.
> ... Sheesh. It's almost as if these people live in the real world or something.
Those don't sound like real-world concerns to me. To the contrary, those sound more like the concerns of people who want to cling to the belief that it's still the 20th century, and OS 9 is still a viable operating system.
> At some point, people with legacy code do need to invest in its viability
> (and in many cases that point was probably "years ago.") It doesn't always
> happen, and this so-called "real world" thing that I keep hearing about seems
> to have something to do with that. Given that situation (and the general lack
> of awareness of encoding issues that sometimes accompanies it), a default
> encoding that loses less information (ISO-8859-1) has more practical value
> than one that loses as much information as UTF-8 can.
Maybe there's something I'm missing here. How does UTF-8 lose information?
> So, let's see. There doesn't seem to be as much of a performance hit
> for repeatedly doing READ-CHAR on utf-8 encoded files (whose contents are
> all STANDARD-CHAR/ASCII) as I'd remembered, so changing the default terminal
> and file encodings (in the trunk) seems like a worthwhile experiment. It may
> be easier to evaluate some of these things with those changes in effect, and
> it's entirely possible that the change is neither a particularly good nor
> a particularly bad idea.
You don't actually have to change the CCL defaults if you think it would upset a significant constituency. This is more of a social issue than a technical one. It's enough to just encourage people to put the following in their CCL-INIT files:
(setf CCL:*DEFAULT-FILE-CHARACTER-ENCODING* :utf-8)
and then have zero-tolerance for any source code that doesn't work as a result. I've had that line in my ccl-init for so long that I don't even know what the default encoding that ccl ships with is any more.
More information about the Openmcl-devel