Unicode

Mike Wilson cmwilson@uncc.edu
Fri, 9 May 1997 16:40:10 -0400


(I have an interest in this area, but neither experience nor a large
knowledge base, so grain-of-salt time.)

How about giving text objects encoding-system and language properties?

Text manipulation functions should be able to look at the encoding-system
property and do the right thing without needing a `standard' encoding
system and without effort by the high-level programmer.

If you receive mail in english from the outside in ascii, why convert
it to unicode, store it at twice its size, convert your english response
back to ascii and send it out?

What about linguists who deal with languages not covered by unicode,
i.e. ancient japanese, ancient egyptian, Klingon, etc.
They'll need to use non-standard unicode, which counts to me
as a unique encoding system.

Pure unicode doesn't include language information, leaving that to a
higher level standard of language tags or not specified at all.
Perhaps as a performance issue, if unicode text is known to be
monolingual, it can be handled without looking for language tags.
Tagged unicode can be handled as a separate encoding system.

The language property can be used to specify the language of monolingual
text or set to `multilingual' or some such for texts of that nature.
In addition, things like `file' names and the like can be sorted in
correct international order when displayed, without assuming everything
to be in the language of the user's locale.

Also, knowing the language can give hints to a word processor
as to which Input Method to use.  If a document is known to be in
japanese, then it can default to a japanese input method.
No need to hope the text is tagged, and if it is to go through
the whole thing counting languages used.  Just knowing if the
text if monolingual or multilingual probably buys a lot,
performance-wise.

Anyway, this approach has always seemed the most flexible given the
diverse language encoding arena we all live in.  (And of course,
in many people's eyes, Unicode has not `won'.  Word has it that
the Japanese Ministry of International Trade and Industry has
commissioned research on yet another encoding system to address what
the participating countries consider problems with Unicode.)

Mike

Mike Wilson     cmwilson@uncc.edu     Debian Linux!
dare no koto mo uranjainai yo  tada otonatachi ni homerareru youna
baka niwa naritakunai          soshite naifu wo motte tatte ta
--The Blue Hearts [shounen no shi]