New i18n code

Olli Pietiläinen ollip at freeshell.org
Thu Feb 10 14:52:17 PST 2005


Hi.

My recent work on the i18n code can be found at
http://ollip.freeshell.org/slate-i18n.tar.bz2. I don't want to flood
everyones mailboxes by attaching it here.

Much has changed from the previous version, the biggest difference being
changed character data handling. Now character data is parsed from
UnicodeData.txt and stored in a "two-phase-table" (a kind of trie). That
means it's a table of tables. The first table is indexed with bits over 7 of
the code point, and the resulting table is indexed by the lowest 7 bits.
Empty blocks all point to a shared empty block, and all blocks with
duplicate data point to shared blocks too. Also, every duplicate item in the
blocks (the properties for single characters) is also shared. This reduces
the image size from over 8Mb with flat table to only 4Mb. I think this can
still be tweaked to take even less memory, with no impact on access speed.

This should be much faster than the old code-based hack. At least access
time is constant, and maintenance is much easier. This has the drawback of
growing image size, but that's still not much I think. If smaller image size
is wanted, only the needed parts of the data can be used. That's easy with
the two-phase-table: just make the unneeded blocks to point to a shared
empty block.

New additions are normalization to all four normalization forms, which makes
strings that look the same to the user also look the same to the system and
is required by many operations like sorting, and UTF-16 (including UTF-16BE
and UTF-16LE) encoding/decoding. There are also lots of small fixes and
minor enhancements.

utils.slate includes stuff that I think should be elsewhere. Take a look at
them and put them where you think is their right place, or leave them there.
splitPreservingEmptys: should probably be incorporated with splitWith: in
sequence.slate. splitWith: has the keyword &includeEmpty: which currently
doesn't do anything. I don't know where Int16(Read|Write)Stream
LittleEndian/BigEndian might belong, or if it should be named differently.

layout-builder.slate is bit of a hack, but it's not supposed to be used by
users. It's used to generate the cross-link data for the tables, and that
should change only when the implementation is changed or Unicode consortium
releases a new version of the standard. The cross-links are stored in
Links1.data and Links2.data, which are read by the table building routines.

Also, there's no mappings.slate anymore, its functionality is handled by
properties.slate.

Usage is simple: load 'src/i18n/init.slate' and run buildUnicodeTable. Most
of the things that can be done with the old strings can be done with
UnicodeStrings too, although I haven't checked most of the old functionality
after my changes.


Olli




More information about the Slate mailing list