Unicode (Was: mikemac's proposal)

Gilbert Baumann gilbert@ma2s2.mathematik.uni-karlsruhe.de
10 May 1997 04:24:24 +0200


"Peter.VanEynde" <s950045@uia.ua.ac.be> writes:
> 
> On Fri, 9 May 1997, Pierpaolo Bernardi wrote:
> 
> > That's great.  I actually was suggesting to use Unicode _if_ we were
> > going to write something from scratch, and to use whatever there was
> > available if using cmucl.  Of course, Unicode in cmucl would be the best.
> 
> With Unicode support in the kernel and libc6 the time should be right to
> leave ascii.
> 
> > If you don't already know it, the following paper may be useful:
> ...
> I'll look at them, thanks!
> 
> > How do you plan to make readtables in Unicode work?  or are you
> > considering unicode only for characters and strings, and not for
> > source code?
> Huh?
> 

Hey, I want to code using unicode. I want to have the greek glyphs
available, so that you do not have to type-out lambda but merely insert
the lambda symbol. Furthermore umlauts should not be forbidden in
symbols. And I see absolutely no problem with unicode readtables,
simply compress it, nearly all characters above 0177 will be
constituent anyway.

BTW. Could somebody tell me why the readtable is in the global
variable *READTABLE* and not associated with the stream?  That would
much more useful. Same goes with all the other variables, which
control the printing and reading of lisp data. This always bothered
me.

> As the lisp compiler is written in lisp I can't see how it should lack
> support. The right approach is to redefine what we mean with character,
> right? (I should read up in the HyperSpec on this) The only problem I
> still have is if all characters should be Unicode or if we should support
> a "compressed" character that's ascii only. (and 7 bits only to avoid
> problems)

I want unicode strings behaving as natural as possible 7-bit
strings. The unicode strings should go thru all the string functions
as smoothly as ascii strings with current implementations. Also
symbol-names should be unicode-strings. And furthermore: CHAR-UPCASE
and CHAR-DOWNCASE should work perfectly well with unicode
characters. Another issue is STRING-UPCASE and STRING-DOWNCASE. There
are characters (for instance the german sz lignature), which simply do
not exist as upper case character. (sz is sustituted by two #\S's,
when the word is written all letters upper case.) I opt for real 16-bit
unicode characters. Nowadays the space is not that short to not
support them.

Passing these strings down some stream is another issue, you will have
to specify the format of the stream. It may be coded as UTF-8 or
ASCII-7 or as ISO-LATIN-n, etc. There is allready the :EXTERNAL-FORMAT
keyword to the OPEN function in the ansi standard.

Thus, I do not see any real problem. Just extend the BASE-CHAR to hold
16bit. I checked with my version of the draft ansi standard (the
dpANS.info file) and and it explicitly states that BASE-CHAR could
contain any number of characters >=96 and in particular 2**16
characters. It should not break any Common Lisp program.

> <flame>
> As a multi-lingual European I'm often amazed how much software _still_
> thinks that everybody is an American. The amount of work I have to do to
> get X and xemacs to accept dead-keys is frustrating. And even then it
> doesn't work 100%. (like ispell) :-(
> </flame>

And I get really angry, when some program simply uses the 8th bit to
store additional info or strips it. Sometime ago I wanted
chracters from the upper half of iso-latin-1 in some emacs info
file. But they just don't get thru' makeinfo, judging the output the
8th bit was probably justed striped.

Gilbert.