New i18n code

Brian Rice water at tunes.org
Thu Feb 10 15:11:23 PST 2005


Excellent! I'll incorporate this into CVS shortly. I'll also add a 
README with your notes so that this usage information and guide is easy 
to track.

On Feb 10, 2005, at 2:52 PM, Olli Pietiläinen wrote:

> Hi.
>
> My recent work on the i18n code can be found at
> http://ollip.freeshell.org/slate-i18n.tar.bz2. I don't want to flood
> everyones mailboxes by attaching it here.
>
> Much has changed from the previous version, the biggest difference 
> being
> changed character data handling. Now character data is parsed from
> UnicodeData.txt and stored in a "two-phase-table" (a kind of trie). 
> That
> means it's a table of tables. The first table is indexed with bits 
> over 7 of
> the code point, and the resulting table is indexed by the lowest 7 
> bits.
> Empty blocks all point to a shared empty block, and all blocks with
> duplicate data point to shared blocks too. Also, every duplicate item 
> in the
> blocks (the properties for single characters) is also shared. This 
> reduces
> the image size from over 8Mb with flat table to only 4Mb. I think this 
> can
> still be tweaked to take even less memory, with no impact on access 
> speed.
>
> This should be much faster than the old code-based hack. At least 
> access
> time is constant, and maintenance is much easier. This has the 
> drawback of
> growing image size, but that's still not much I think. If smaller 
> image size
> is wanted, only the needed parts of the data can be used. That's easy 
> with
> the two-phase-table: just make the unneeded blocks to point to a shared
> empty block.
>
> New additions are normalization to all four normalization forms, which 
> makes
> strings that look the same to the user also look the same to the 
> system and
> is required by many operations like sorting, and UTF-16 (including 
> UTF-16BE
> and UTF-16LE) encoding/decoding. There are also lots of small fixes and
> minor enhancements.
>
> utils.slate includes stuff that I think should be elsewhere. Take a 
> look at
> them and put them where you think is their right place, or leave them 
> there.
> splitPreservingEmptys: should probably be incorporated with splitWith: 
> in
> sequence.slate. splitWith: has the keyword &includeEmpty: which 
> currently
> doesn't do anything. I don't know where Int16(Read|Write)Stream
> LittleEndian/BigEndian might belong, or if it should be named 
> differently.
>
> layout-builder.slate is bit of a hack, but it's not supposed to be 
> used by
> users. It's used to generate the cross-link data for the tables, and 
> that
> should change only when the implementation is changed or Unicode 
> consortium
> releases a new version of the standard. The cross-links are stored in
> Links1.data and Links2.data, which are read by the table building 
> routines.
>
> Also, there's no mappings.slate anymore, its functionality is handled 
> by
> properties.slate.
>
> Usage is simple: load 'src/i18n/init.slate' and run buildUnicodeTable. 
> Most
> of the things that can be done with the old strings can be done with
> UnicodeStrings too, although I haven't checked most of the old 
> functionality
> after my changes.
>
> Olli

--
Brian T. Rice
LOGOS Research and Development
http://tunes.org/~water/




More information about the Slate mailing list