slate unicode support

Fri Jul 30 04:20:19 PDT 2004

> You also might want to take a look at GNUstep's implementation of multibyte
> characters, classes, and conversions. It is quite mature currently.

Well, GNUstep is GPL, so we can't use their code in slate. Still worth
looking for ideas, i guess.

> > I think the haskell implementation is simplest and cleanest, and the
> > best base for slate Unicode support. It contains types for Unicode
> > characters and strings and UTF8 characters and strings, and
> > conversions between these and latin1. The problem is it thinks Unicode
> > = UCS-2, so only BMP (first 65536 characters) is supported, and support
> > for other planes will have to be written or adapted from other
> > sources.
> 
> =)

Having looked at it again, this seems not to be the case. It handles
Unicode properly. It is very simple, which I think is good at this
point. Have to take a better look at it (and GNUstep and others, especially
squeak m17n) when I have the time (which is rare these days).

> > The selection of internal representation is somewhat more
> > complicated. UTF8 has many disadvantages. String access cannot happen
> > simply by index, because some characters are multiple bytes and some
> > only one. Also mutating strings would mean copying the rest of the
> > string from the mutating point. On the other hand, UTF8 is only 8
> > bits, and a string could carry a tag whether it contains characters
> > > 255. The vm wouldn't need much changes for this.
> 
> It is 16bits per uchar. How about a run-length encoding indexing map?
> Perhaps that would be fundamentally useful here...
> 
> Though, I think if one string is encoded at utf-8, it is 16bits per char
> regardless of "that character's charset controlcode". It would be a normal
> 16bit array.

The problem is that a character encoded in UTF-8 can be 1..6 bytes, so
two 8bit chars can't store every UTF-8 character. UTF-16 has the same
problems as UTF-8: some characters require more bytes. That's why I
thought of the RunArray (= RLE) scheme, but with 16bit arrays.

One could also use 8bit arrays and store more than one byte in the
RunArray, but I don't know which one would be better. You just can't store
every Unicode character in less than 21 bits.

> > Characters would be stored as UnicodeCharacters as 32bit. So a String
> > would not be an array of Characters. We would also need some RunArrays
> > (or mappings or whatever) for information about characters in
> > different codepoints, so one could check if U+326A is a letter for
> > example.
> >
> 
> Naw, as it stands currently, they should still be 16bit, two 8bit chars
> representing utf-8. I am not sure about utf-16, but utf-7 will fit here
> nicely too.

As I wrote above, 21 bits is needed, regardless of the encoding used.

> Backward compatibility is also nice =) iso-2022-jp on a utf-8-expectant
> stream for example; so conversions to all from all.

Sure, eventually. At this point I think we should concentrate on
getting UTF-8 working. Let's keep things simple until we have it
running, and then extend.

> > The C complier obviously needs changes. Maybe the variable/method
> > names could be stored in basic ascii for [A-z0-9] and some special
> > notation would be used for other characters, for example
> > my{some-unicode-character here}method would be stored as myU4324method
> > or something similar. The names are never read back, so it doesn't
> > matter for anything else than debugging. One could also use the hashes
> > of the names or what ever.
> 
> In Objective-C, the format @"StringContents" is used as the instance of
> class NXConstantString (or something else, through command-line switch). If
> modifying a C compiler is desired, a look into GCC's handling of this could
> be some start toward simplistically achieved completion.
> 
> I think a plain old C conversion function for each literal found in C code,
> would be sufficiently lacking extra hassle.

I'm not quite sure I understand. I meant modifying the Slate->C
compiler, not C->native compiler. Or what did you mean? Anyways,
this is at low priority now.

Olli