slate unicode support

Thu Jul 29 09:29:13 PDT 2004

Hello.

> Hi.
>
> I've looked for different Unicode implementations for basis for slate
> Unicode support. I've found four alternatives: the squeak Unicode
> Project; squeak m17n stuff; an Unicode implementation in
> haskell (part of a xml library), which can be found at
>
http://www.ninebynine.org/Software/HaskellUtils/HXmlToolbox-4.00/hparser/Unicode.hs;
> and a scheme implementation from http://synthcode.com/scheme/.
>

You also might want to take a look at GNUstep's implementation of multibyte
characters, classes, and conversions. It is quite mature currently.

AFAIK conversions to and from, simple enough interface

> I didn't look much of the squeak Unicode Project stuff, but it seemed
> a bit cumbersome to me. I might be wrong, I'll have to take a new look
> at it.
>
> I think the haskell implementation is simplest and cleanest, and the
> best base for slate Unicode support. It contains types for Unicode
> characters and strings and UTF8 characters and strings, and
> conversions between these and latin1. The problem is it thinks Unicode
> = UCS-2, so only BMP (first 65536 characters) is supported, and support
> for other planes will have to be written or adapted from other
> sources.

=)

>
> Squeak m17n stuff contains support and ifrastructure to lots of
> different encodings, which would be good for inclusion later on, but
> too complicated at the moment.
>
> I suggest using the haskell implementation as a basis, taking hints
> and ideas (and probably code, too) from the others, and dropping the
> xml stuff in it. Squeak m17n stuff seems to be the most complete, and
> best adapted for object-oriented languages, so I'm not yet sure about
> this.
>
> Then we'll come to selecting the internal representation and external
> format. UTF8 is probably the best external format. So files would be
> stored in UTF8 etc. This has the advantage that if used with ascii
> text nothing has to be changed. Only texts containing characters with
> code > 255, are different. The same parser/lexer could read ascii and
> UTF8 files.
>
> The selection of internal representation is somewhat more
> complicated. UTF8 has many disadvantages. String access cannot happen
> simply by index, because some characters are multiple bytes and some
> only one. Also mutating strings would mean copying the rest of the
> string from the mutating point. On the other hand, UTF8 is only 8
> bits, and a string could carry a tag whether it contains characters
> > 255. The vm wouldn't need much changes for this.

It is 16bits per uchar. How about a run-length encoding indexing map?
Perhaps that would be fundamentally useful here...

Though, I think if one string is encoded at utf-8, it is 16bits per char
regardless of "that character's charset controlcode". It would be a normal
16bit array. One may not really want multiple character codes per string...
especially not putting character encoding within the Character
class/framework, but only higher at String level.

>
> 32 bits would be enough for all unicode characters, but that's lots of
> load to carry with you everywhere. 16 bits is sufficient for most of
> the time, but because of problems in Unicode standard regarding
> Chinese, Japanese, Korean and some others, more is needed sometimes.
>

Backward compatibility is also nice =) iso-2022-jp on a utf-8-expectant
stream for example; so conversions to all from all. I'm not sure on
NSString's internal representation, but I'm sure it is abstract enough. I
know this code well, so working with it is something I could be easily
convinced of.

> I think a String should be an array of 16bit characters, with a tag
> whether it contains characters in other than BMP, and some additional
> slots for extra information in that case. So most of the strings could
> be treated as simple 16bit arrays, and checks would be made if they
> contain characters in other planes. One way to store this information
> is in a RunArray, which contains the plane for every character in a
> string. This would be the most efficient way I could come up with.
>
> Characters would be stored as UnicodeCharacters as 32bit. So a String
> would not be an array of Characters. We would also need some RunArrays
> (or mappings or whatever) for information about characters in
> different codepoints, so one could check if U+326A is a letter for
> example.
>

Naw, as it stands currently, they should still be 16bit, two 8bit chars
representing utf-8. I am not sure about utf-16, but utf-7 will fit here
nicely too.

> Regardless of what is chosen as the internal format, String should be
> lifted from the vm to image, leaving ByteArray management for vm.
> I don't know whether lexer/parser need changes. From a quick glance it
> looks like the lexer doesn't use the String and Character methods in
> library but some of it's own.
>
> The C complier obviously needs changes. Maybe the variable/method
> names could be stored in basic ascii for [A-z0-9] and some special
> notation would be used for other characters, for example
> my{some-unicode-character here}method would be stored as myU4324method
> or something similar. The names are never read back, so it doesn't
> matter for anything else than debugging. One could also use the hashes
> of the names or what ever.

In Objective-C, the format @"StringContents" is used as the instance of
class NXConstantString (or something else, through command-line switch). If
modifying a C compiler is desired, a look into GCC's handling of this could
be some start toward simplistically achieved completion.

I think a plain old C conversion function for each literal found in C code,
would be sufficiently lacking extra hassle.

>
> Switching to unicode would possibly also break some other things, for
> example the regexp library.
>
> Ok, this turned out to be a long post. Sorry :) What do you think?
>
>
> Olli
>

--Lyndon