slate unicode support

Fri Jul 30 04:36:10 PDT 2004

On Thu, 29 Jul 2004, Brian T. Rice wrote:

>> I think a String should be an array of 16bit characters, with a tag
>> whether it contains characters in other than BMP, and some additional
>> slots for extra information in that case. So most of the strings could
>> be treated as simple 16bit arrays, and checks would be made if they
>> contain characters in other planes. One way to store this information
>> is in a RunArray, which contains the plane for every character in a
>> string. This would be the most efficient way I could come up with.
>> 
>> Characters would be stored as UnicodeCharacters as 32bit. So a String
>> would not be an array of Characters. We would also need some RunArrays
>> (or mappings or whatever) for information about characters in
>> different codepoints, so one could check if U+326A is a letter for
>> example.
>
> Sure. To keep things simple, the 16-bit element array could be implemented as 
> a ByteArray without too many problems (SmallInteger addition and shifting can 
> occur without non-stack consing).

Yes, definately.

>> Regardless of what is chosen as the internal format, String should be
>> lifted from the vm to image, leaving ByteArray management for vm.
>> I don't know whether lexer/parser need changes. From a quick glance it
>> looks like the lexer doesn't use the String and Character methods in
>> library but some of it's own.
>
> Re: Lifting String into the image: agreed. The lexer and parser only depend 
> on being able to grab characters from a ReadStream, so there should not be a 
> problem. The lexer's behavior should be as it is, since Slate perspectives on 
> character use is particular to the syntax, and not general.

What do you mean by the last sentence? That one shouldn't use any other
letters in slot/method/whatever names than [A-z0-9]? How would using for
example chinese or tamil letters break the syntax?

>> The C complier obviously needs changes. Maybe the variable/method
>> names could be stored in basic ascii for [A-z0-9] and some special
>> notation would be used for other characters, for example
>> my{some-unicode-character here}method would be stored as myU4324method
>> or something similar. The names are never read back, so it doesn't
>> matter for anything else than debugging. One could also use the hashes
>> of the names or what ever.
>
> This is a reasonable solution, but the feature can wait, though.
>

That's what I was thinking too: this feature should be implemented the last.

Olli