Fwd: Re: Slate i18n bootstrap/nagging :) (fwd)

Tue Sep 27 15:44:31 PDT 2005

Olli seems to have taken several months' break from computers and
unsubscribed himself. He sent me the message and I'm forwarding it here:

Forwarded From: Olli Pietiläinen <ollip at freeshell.org>

> ---------- Forwarded message ----------
> Date: Tue, 27 Sep 2005 23:16:41 +0300 (EEST)
> From: Olli Pietiläinen <ollip at freeshell.org>
> To: Lendvai Attila <Attila.Lendvai at netvisor.hu>
> Cc: slate at tunes.org, Olli Pietiläinen <ollip at freeshell.org>
> Subject: Re: Slate i18n bootstrap/nagging :)
> 
> Hi!
> 
> Disclaimer: I don't know what has changed in slate since I last
worked with
> it, so some of this might be outdated.
> 
> Your ideas about the bootstrap look like what I was planning too.
The plan
> was that there would be an abstract String, from which UnicodeString,
> ASCIIString etc. would derive. UnicodeString would be the default
string
> representation, generated by generateLiteral:. So, also the basis for
> UnicodeString (which is basically just an array) would have to be
> bootstrapped.
> 
> The string shouldn't delegate to ByteArray, it's abstract.
UnicodeString
> doesn't even store it's contents as a ByteArray, so no ByteArrays
are ever
> needed in storing strings in the default case. Can the Sequence
delegation
> be added later, after loading sequence?
> 
> Anyway, bootstrapping String means you have to bootstrap also
Character and
> Symbol.
> 
> How to store symbols? The easiest way would be to convert the
strings to
> UTF-8 before interning them (or generating a literal), which wouldn't
> require any changes to the current internal symbol handling. Before
> converting to UTF-8, the strings should be normalized (to NFC), see
> normalization.slate. Otherwise there might be clashes between
strings that
> look the same to the user but have a different internal
representation. This
> is simple, no big deal.
> 
> The case with Character is the same as with String: an abstract
Character,
> from which UnicodeCharacter etc. derive.
> 
> There's still one problem, though. Acquiring information
(isAlphabetic,
> isNumeric etc.) about Unicode characters requires a huge database.
This
> database could either be built on the first loading of the image,
or it
> could be copied directly from the originating image. The latter
would be
> much easier, but I don't know how to do it (probably not a big deal).
> Loading it at the bootstrap would be much more complicated, since
the data
> is read from a text file, which involves lots of string parsing. A
> chicken-and-egg problem. The data is ASCII, so it would be
possible to first
> load ASCIIString, then build the Unicode database with that.
However, I'd go
> for directly passing the db from the old image, since generating the
> database at every bootstrap also takes time.
> 
> The source should possibly be read in as UTF-8. This wouldn't
break any old
> code.
> 
> I think this is all I can come up with now. Good luck, I'll be
glad to help.
> 
> Olli
> 
> 
> 
> On Tue, 27 Sep 2005, Lendvai Attila wrote:
> 
> >
> > Hi Olli!
> >
> > Long time you've been around Slate... what's up with you?
Recently I was looking at how your i18n code could be bootstrapped
and I think I could start playing with it or help you in the
bootstrapping of it.
> >
> > As a fist question I don't know what internal representation we
should use: utf16 or utf8?
> >
> > Otherwise here's a quick (possibly partial) list of things to do:
> >
> > In bootstrap.slate:
> >   - gen generatePrototype: 'String' &layout: VM Bootstrap
ByteArray... must be altered to build the proper String prototype
(configurable internal representation as an optional quest that
makes it even more interesting :)
> >   - gen@(VM Bootstrap Generator traits) generateLiteral:
s@(String traits) must be prepared to encode the string in
[utf8/utf16/whatever configured] into the generated ByteArray
> >
> > FFI:
> >   - extend the FFI to support encoding conversion for String
parameters both in to/from C direction (with a configurable
condition/error rised when the conversion turns out to be "lossy")
> >
> > Without the FFI task done things can work as before if we use
UTF8 because its compatible with ANSI as long as no special chars
are used.
> >
> > An open issue: Should the new String still delegate to
ByteArray? If not (my vote) is it possible for it to have a slot
with a ByteArray storing the encoded representation? (I think yes)
Because if it delegates to Sequence as it does currently in the i18n
code then Sequence must also be bootstrapped (not preferred, but
maybe inevitable).
> >
> > And what about the current bootstrap process? How mature is it?
Is it worth adding the new String code to it or will it change a lot
later? I'm thinking of the old dual-heap idea I was pondering about,
with an automatic heap minimizer that extracts the new bootstrapped
heap from the user heap... So something magic that renders the
current one obsolete.
> >
> > Novo? Should we/I try to proceed with these or find something
else for now? Opinions about String representation/layout?
> >
> > That's it for now,
> >
> > - 101
> >
> > PS: I cc this to Olli, in case he's not on the list anymore...