slate unicode support
=?X-UNKNOWN?Q?Olli_Pietil=E4inen?=
ollip at freeshell.org
Thu Jul 29 06:21:20 PDT 2004
Hi.
I've looked for different Unicode implementations for basis for slate
Unicode support. I've found four alternatives: the squeak Unicode
Project; squeak m17n stuff; an Unicode implementation in
haskell (part of a xml library), which can be found at
http://www.ninebynine.org/Software/HaskellUtils/HXmlToolbox-4.00/hparser/Unicode.hs;
and a scheme implementation from http://synthcode.com/scheme/.
I didn't look much of the squeak Unicode Project stuff, but it seemed
a bit cumbersome to me. I might be wrong, I'll have to take a new look
at it.
I think the haskell implementation is simplest and cleanest, and the
best base for slate Unicode support. It contains types for Unicode
characters and strings and UTF8 characters and strings, and
conversions between these and latin1. The problem is it thinks Unicode
= UCS-2, so only BMP (first 65536 characters) is supported, and support
for other planes will have to be written or adapted from other
sources.
Squeak m17n stuff contains support and ifrastructure to lots of
different encodings, which would be good for inclusion later on, but
too complicated at the moment.
I suggest using the haskell implementation as a basis, taking hints
and ideas (and probably code, too) from the others, and dropping the
xml stuff in it. Squeak m17n stuff seems to be the most complete, and
best adapted for object-oriented languages, so I'm not yet sure about
this.
Then we'll come to selecting the internal representation and external
format. UTF8 is probably the best external format. So files would be
stored in UTF8 etc. This has the advantage that if used with ascii
text nothing has to be changed. Only texts containing characters with
code > 255, are different. The same parser/lexer could read ascii and
UTF8 files.
The selection of internal representation is somewhat more
complicated. UTF8 has many disadvantages. String access cannot happen
simply by index, because some characters are multiple bytes and some
only one. Also mutating strings would mean copying the rest of the
string from the mutating point. On the other hand, UTF8 is only 8
bits, and a string could carry a tag whether it contains characters
> 255. The vm wouldn't need much changes for this.
32 bits would be enough for all unicode characters, but that's lots of
load to carry with you everywhere. 16 bits is sufficient for most of
the time, but because of problems in Unicode standard regarding
Chinese, Japanese, Korean and some others, more is needed sometimes.
I think a String should be an array of 16bit characters, with a tag
whether it contains characters in other than BMP, and some additional
slots for extra information in that case. So most of the strings could
be treated as simple 16bit arrays, and checks would be made if they
contain characters in other planes. One way to store this information
is in a RunArray, which contains the plane for every character in a
string. This would be the most efficient way I could come up with.
Characters would be stored as UnicodeCharacters as 32bit. So a String
would not be an array of Characters. We would also need some RunArrays
(or mappings or whatever) for information about characters in
different codepoints, so one could check if U+326A is a letter for
example.
Regardless of what is chosen as the internal format, String should be
lifted from the vm to image, leaving ByteArray management for vm.
I don't know whether lexer/parser need changes. From a quick glance it
looks like the lexer doesn't use the String and Character methods in
library but some of it's own.
The C complier obviously needs changes. Maybe the variable/method
names could be stored in basic ascii for [A-z0-9] and some special
notation would be used for other characters, for example
my{some-unicode-character here}method would be stored as myU4324method
or something similar. The names are never read back, so it doesn't
matter for anything else than debugging. One could also use the hashes
of the names or what ever.
Switching to unicode would possibly also break some other things, for
example the regexp library.
Ok, this turned out to be a long post. Sorry :) What do you think?
Olli
More information about the Slate
mailing list