slate unicode support

Brian T. Rice water at tunes.org
Thu Jul 29 10:34:47 PDT 2004


Cool, I'm glad someone is thinking about this. You have the right 
perspective on most of this; I'm just replying with a few notes.

� wrote:

> Hi.
> 
> I've looked for different Unicode implementations for basis for slate
> Unicode support. I've found four alternatives: the squeak Unicode
> Project; squeak m17n stuff; an Unicode implementation in
> haskell (part of a xml library), which can be found at
> http://www.ninebynine.org/Software/HaskellUtils/HXmlToolbox-4.00/hparser/Unicode.hs; 
> 
> and a scheme implementation from http://synthcode.com/scheme/.
> 
> I didn't look much of the squeak Unicode Project stuff, but it seemed
> a bit cumbersome to me. I might be wrong, I'll have to take a new look 
> at it.
> 
> I think the haskell implementation is simplest and cleanest, and the
> best base for slate Unicode support. It contains types for Unicode
> characters and strings and UTF8 characters and strings, and
> conversions between these and latin1. The problem is it thinks Unicode
> = UCS-2, so only BMP (first 65536 characters) is supported, and support
> for other planes will have to be written or adapted from other
> sources.
> 
> Squeak m17n stuff contains support and ifrastructure to lots of
> different encodings, which would be good for inclusion later on, but
> too complicated at the moment.
> 
> I suggest using the haskell implementation as a basis, taking hints
> and ideas (and probably code, too) from the others, and dropping the
> xml stuff in it. Squeak m17n stuff seems to be the most complete, and
> best adapted for object-oriented languages, so I'm not yet sure about
> this.
> 
> Then we'll come to selecting the internal representation and external
> format. UTF8 is probably the best external format. So files would be
> stored in UTF8 etc. This has the advantage that if used with ascii
> text nothing has to be changed. Only texts containing characters with
> code > 255, are different. The same parser/lexer could read ascii and
> UTF8 files.
> 
> The selection of internal representation is somewhat more
> complicated. UTF8 has many disadvantages. String access cannot happen
> simply by index, because some characters are multiple bytes and some
> only one. Also mutating strings would mean copying the rest of the
> string from the mutating point. On the other hand, UTF8 is only 8
> bits, and a string could carry a tag whether it contains characters
> 
>> 255. The vm wouldn't need much changes for this.
> 
> 
> 32 bits would be enough for all unicode characters, but that's lots of
> load to carry with you everywhere. 16 bits is sufficient for most of
> the time, but because of problems in Unicode standard regarding
> Chinese, Japanese, Korean and some others, more is needed sometimes.
> 
> I think a String should be an array of 16bit characters, with a tag
> whether it contains characters in other than BMP, and some additional
> slots for extra information in that case. So most of the strings could
> be treated as simple 16bit arrays, and checks would be made if they
> contain characters in other planes. One way to store this information
> is in a RunArray, which contains the plane for every character in a
> string. This would be the most efficient way I could come up with.
> 
> Characters would be stored as UnicodeCharacters as 32bit. So a String
> would not be an array of Characters. We would also need some RunArrays
> (or mappings or whatever) for information about characters in
> different codepoints, so one could check if U+326A is a letter for
> example.

Sure. To keep things simple, the 16-bit element array could be 
implemented as a ByteArray without too many problems (SmallInteger 
addition and shifting can occur without non-stack consing).

> Regardless of what is chosen as the internal format, String should be
> lifted from the vm to image, leaving ByteArray management for vm.
> I don't know whether lexer/parser need changes. From a quick glance it
> looks like the lexer doesn't use the String and Character methods in
> library but some of it's own.

Re: Lifting String into the image: agreed. The lexer and parser only 
depend on being able to grab characters from a ReadStream, so there 
should not be a problem. The lexer's behavior should be as it is, since 
Slate perspectives on character use is particular to the syntax, and not 
general.

> The C complier obviously needs changes. Maybe the variable/method
> names could be stored in basic ascii for [A-z0-9] and some special
> notation would be used for other characters, for example
> my{some-unicode-character here}method would be stored as myU4324method
> or something similar. The names are never read back, so it doesn't
> matter for anything else than debugging. One could also use the hashes
> of the names or what ever.

This is a reasonable solution, but the feature can wait, though.

> Switching to unicode would possibly also break some other things, for
> example the regexp library.

It's broken, anyway. Or really, the lexer hasn't been ported/written.

> Ok, this turned out to be a long post. Sorry :) What do you think?

Go for it! I'll assist however I can.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: water.vcf
Type: text/x-vcard
Size: 218 bytes
Desc: not available
Url : /archives/slate/attachments/20040729/7a14e48f/water.vcf


More information about the Slate mailing list