[Openmcl-devel] Extracting unicode from an external source via FFI
gb at clozure.com
Sat Feb 21 21:09:05 EST 2009
On Sun, 22 Feb 2009, John McAleely wrote:
> I'm attempting to get some unicode strings from an external source (a
> MySQL database) into a form I can use within CCL (This would be most
> convenient if the c data 'became' a native lisp string). I am having
> problems with reading them in, and want to ask what the options are
> within the CCL FFI. If anyone's been down this route before, I'd be
> grateful for pointers.
> I'm using:
> Welcome to Clozure Common Lisp Version 1.2-r72:73M-ccl (DarwinX8664)!
Some of this stuff has changed and/or had bugs fixed since then.
> (Note that the revision number reflects storage in my own subversion
> repository. I'm using an unmodified, locally built, version synced
> about a month ago.)
> My investigations to date (I'm also using clsql 4.0.3/uffi 1.6.0)
> suggest that data can make it from a lisp string into the SQL database
> (how I've not looked into yet - but the mysql command line sees the
> data correctly). When strings come back across the connection, they
> arrive garbled. A two character Chinese string in the SQL table
> becomes a six character lisp string.
> Rummaging into CLSQL/UFFI, I think that ultimately this bit of code
> reads strings from the mysql c interface:
> #+openmcl ,@(if length
> `((ccl:%str-from-ptr ,stored-obj ,length))
> `((ccl:%get-cstring ,stored-obj)))
> Having looked at the ccl code, there is a function near %get-cstring
> (defun %get-utf-8-cstring (pointer) ....)
> This seems interesting. I speculate:
> + The mysql_c interface is sending over c-style strings, in a
> character set of its choice.
> + The uffi code chooses to read this with %get-cstring, which chops
> the string into 8 bit bytes, and assumes each is one is a character in
> some 256 element character set.
> + The CLSQL code then takes this and passes it back to me as a lisp
> + I wonder if I could convince mysql to use utf-8 within its c
> strings, and the uffi code to use %get-utf-8-cstring, then I could
> successfully read unicode from the database into lisp strings?
> So, if you've been down a similar path, does my speculation sound
> Does my gentle use of grep and google appear to have tumbled on the
> 'right' CCL functions for this work. Is there a 'better' CCL API for
> reading a foreign string in some unicode character set?
The most general thing (in 1.3-rc1) is exported but not documented:
(ccl:get-encoded-string encoding pointer noctets)
ENCODING is either a CHARACTER-ENCODING object or a keyword that names
such an object
POINTER is a foriegn pointer, presumed to point to a string encoded in
that character encoding
NOCTETS is the number of octets (8-bit-bytes) in the encoded string,
not including any #\nul octets that may be used as end-of-string markers.
This function returns a lisp (SIMPLE-)string.
(ccl::get-encoded-cstring encoding pointer)
which determines the number of octets (scans forward from POINTER looking
for a 0-valued 8/16/32-bit element depending on the encoding) and calls
GET-ENCODED-STRING for you.
(CCL:%GET-CSTRING ptr) is functionally equivalent to
(CCL::GET-ENCODED-CSTRING :ISO-8859-1 ptr), and
(CCL::%GET-UTF-8-CSTRING ptr) is functionally equivalent to
(CCL::GET-ENCODED-CSTRING :UTF-8 ptr)
CCL:%GET-CSTRING and CCL::%GET-UTF-8-CSTRING exist for some combination
of reasons involving -
- legacy issues
- bootstrapping issues
- performance, though I don't know how significant this is.
> Looking at the API docs, telling MySQL to use UTF8 seems
> straightforward, and I'm willing to hack UFFI/CLSQL to make this work.
> Before I start hacking, I thought I'd ask what my options are for
> interfacing into CCL's unicode support.
CCL::%GET-UTF-8-CSTRING needs to exist for bootstrapping reasons
and it's more concise than other things; it really should be exported
and it's almost certainly the right thing to use in situations like
the one that you describe.
> Openmcl-devel mailing list
> Openmcl-devel at clozure.com
More information about the Openmcl-devel