New i18n code

Lendvai Attila Attila.Lendvai at netvisor.hu
Fri Feb 11 02:42:15 PST 2005


Yeah, great, thanks for this! :)

I'll take a look and make the windoze console code use utf16...

- 101

:: Excellent! I'll incorporate this into CVS shortly. I'll also add a 
:: README with your notes so that this usage information and 
:: guide is easy 
:: to track.
:: 
:: On Feb 10, 2005, at 2:52 PM, Olli Pietiläinen wrote:
:: 
:: > Hi.
:: >
:: > My recent work on the i18n code can be found at 
:: > http://ollip.freeshell.org/slate-i18n.tar.bz2. I don't 
:: want to flood 
:: > everyones mailboxes by attaching it here.
:: >
:: > Much has changed from the previous version, the biggest difference
:: > being
:: > changed character data handling. Now character data is parsed from
:: > UnicodeData.txt and stored in a "two-phase-table" (a kind 
:: of trie). 
:: > That
:: > means it's a table of tables. The first table is indexed with bits 
:: > over 7 of
:: > the code point, and the resulting table is indexed by the lowest 7 
:: > bits.
:: > Empty blocks all point to a shared empty block, and all blocks with
:: > duplicate data point to shared blocks too. Also, every 
:: duplicate item 
:: > in the
:: > blocks (the properties for single characters) is also shared. This 
:: > reduces
:: > the image size from over 8Mb with flat table to only 4Mb. 
:: I think this 
:: > can
:: > still be tweaked to take even less memory, with no impact 
:: on access 
:: > speed.
:: >
:: > This should be much faster than the old code-based hack. At least
:: > access
:: > time is constant, and maintenance is much easier. This has the 
:: > drawback of
:: > growing image size, but that's still not much I think. If smaller 
:: > image size
:: > is wanted, only the needed parts of the data can be used. 
:: That's easy 
:: > with
:: > the two-phase-table: just make the unneeded blocks to 
:: point to a shared
:: > empty block.
:: >
:: > New additions are normalization to all four normalization 
:: forms, which
:: > makes
:: > strings that look the same to the user also look the same to the 
:: > system and
:: > is required by many operations like sorting, and UTF-16 (including 
:: > UTF-16BE
:: > and UTF-16LE) encoding/decoding. There are also lots of 
:: small fixes and
:: > minor enhancements.
:: >
:: > utils.slate includes stuff that I think should be elsewhere. Take a
:: > look at
:: > them and put them where you think is their right place, or 
:: leave them 
:: > there.
:: > splitPreservingEmptys: should probably be incorporated 
:: with splitWith: 
:: > in
:: > sequence.slate. splitWith: has the keyword &includeEmpty: which 
:: > currently
:: > doesn't do anything. I don't know where Int16(Read|Write)Stream
:: > LittleEndian/BigEndian might belong, or if it should be named 
:: > differently.
:: >
:: > layout-builder.slate is bit of a hack, but it's not supposed to be
:: > used by
:: > users. It's used to generate the cross-link data for the 
:: tables, and 
:: > that
:: > should change only when the implementation is changed or Unicode 
:: > consortium
:: > releases a new version of the standard. The cross-links 
:: are stored in
:: > Links1.data and Links2.data, which are read by the table building 
:: > routines.
:: >
:: > Also, there's no mappings.slate anymore, its functionality 
:: is handled
:: > by
:: > properties.slate.
:: >
:: > Usage is simple: load 'src/i18n/init.slate' and run 
:: buildUnicodeTable.
:: > Most
:: > of the things that can be done with the old strings can be 
:: done with
:: > UnicodeStrings too, although I haven't checked most of the old 
:: > functionality
:: > after my changes.
:: >
:: > Olli
:: 
:: --
:: Brian T. Rice
:: LOGOS Research and Development
:: http://tunes.org/~water/
:: 
:: 




More information about the Slate mailing list