[TDP] Alternative to mark-up languages (was: On Tunes Distributed Publishing)

Tue Mar 27 00:01:58 PDT 2007

Tom Novelli, Sat, Mar 24, 2007 2:15 PM, wrote:

> Thanks, Massimo..

Well, I don't know if you will thank me at the end
of this e-mail...

> That's A LOT of material... It'll take a long time
> to digest,

Skim at least this:

  NetBook - a data model to support knowledge exploration
    http://www.vldb.org/dblp/db/conf/vldb/Shasha85.html

It is a quite different approach than the Web, so it helps
to think outside the box.

> .. giving you a welcome respite. :-)

On the contrary, after what you wrote below you put me under pressure
to write even more. :-S

Tom I don't want to be harsh but you make my intention really hard
to pursue (when too much is too much).

> Speaking of markup.... The WWW has improved considerably since Fare
> wrote "Tunes vs. the WWW".

Sorry, but... these are plain bullshit for me. Predictable from
reading my take on "TUNES vs the WWW":

  http://tunes.org/wiki/TUNES_vs_the_WWW

> XML, CSS, separation of structure & presentation, _useful_ client
> side scripting...

You seems seduced by the "Angle-Bracket Crowd" (ABC).

As I said elsewhere, CSS is the unique piece of the puzzle that
I salvage: the concept more than the implementation, which needs
to deal with these stupid tags.

> There's light at the end of the tunnel.

No, they are digging a hole in the ground, a hole without
a bottom. You cannot "reform" the mass of crap coming out
of the W3C at an allarming rate. They make hard what is
simple and impossible what is hard.

> I just discovered that modern web browsers have a built-in "Design
> Mode" WYSIWYG editor, which is quite easy to incorporate into a CMS.
> I'm tempted to abandon the Wiki-text approach...

You need to think of a (possibly little) set of /input methods/ that
are *simple to type*, nothing more.

Something like: *bold* /italic/ _underline_ `fixed´ (vs proportional)
as in Gnus GNU News Reader (IIRC). With immediate feedback possibly.
(Note: it is probably a modal input method, we need to ascertain this
and eventually come up with something better).

Each of these /input methods/ add, under the hood, a "style" annotation
(bold, italic, underline..) to the corresponding text part. That is,
users specify ad-hoc "style". But most of the "style" will be applyed
from a "stylesheet" by pattern matching/parsing (don't worry, I will
return to this in more details in section "AN ALTERNATIVE TO MARK-UP
LANGUAGES" below).

Most of the *superficial* structure (with which XML is concerned when
ABC speak about structure) is quite evident, and so parseable, in
any text (a paragraph is trivially delimited by blank lines; headings
can probably be parsed, otherwise an /input method/ suffice; OCR programs
usually can discover columns automatically, and so on).

A SHORT NOTE ON "INFORMATION" AND "STRUCTURE"

"Unstructured" or "semi-structured", terms (ab-)used by "Angle-Bracket
Crowd", are at odds with "information".

Something is structured or not; the fact that some ("deep") structure
is difficult for us (or for machines) to recognize (parse) does not
change this.

If it has no structure at all (no observable pattern) is not
information, is real white noise, generated by a random process
(quite difficult to find: ask cryptographers). See Information Theory.

ON BETTER HUMAN-COMPUTER INTERACTION: ARCHY

A more general /input method/, from which we can draw inspiration,
exempt from modes and other shortcomings, is that developed by
Jef Raskin in "The Humane Interface":

  Summary of The Humane Interface
  http://jef.raskincenter.org/humane_interface/summary_of_thi.html

There is a prototype in Python, Archy:

  http://rchi.raskincenter.org/
  http://en.wikipedia.org/wiki/Archy

Here some demos in Flash:

  Demos
  http://rchi.raskincenter.org/index.php?title=Demos

In Archy search and selection are unified; commands are executed
(see demo "EX6 - Commands") first selecting objects (parameters)
and then, holding down CAPS-LOCK (a quasi-mode), typing a verb
(command) which appears in a traslucent window (verbs that partially
match are presented while typing), so you always see what you
have selected. A command is executed when CAPS-LOCK is released.

Another way is to first select an expression or a Python program
and then issuing CALC or RUN commands, respectively. Think of it
as a free-form spreadsheet. No more static menus, every list or
sequence of text that you write can be a list of commands.
See:

  http://rchi.raskincenter.org/index.php?title=Using_Archy#Some_Archy_Commands

Archy dispenses with the concept of "applications" and prefer
a much modular and consistent approach instead, sets of commands:
you add and learn gradually each command you need.

It offer true interoperability: your "content" is not segregated
in application specific file formats.

This avoid redundancies: no more 5 spell checkers, one is enough
and is system-wide; no more dozens of micro-editors (fields in forms,
for example) each with some functionality lacking. True fine-grain
reuse: what O-O always promised and never really delivered
(note the "fine-grain").

Unfortunately developers do not understand that this (the tyranny of
file formats) is *exactly* the reason for which DBMSs where invented
and were using -- guess what -- XML! But this is changing, probably
for performance problems, and they are moving toward...

  [Archy_dev] Development plan: moving Archy to object database
    http://raskincenter.org/pipermail/archy_dev_raskincenter.org/2005-November/000180.html

Depressing. I predict that an unmaintainable, fragile mess
will follow.

Anyway, the message is: I want to think outside the web browser,
the application and the desktop boxes.

RIDICULOUS SEMANTIC MARKUP

Regarding so-called "semantic markup", quoting, for example,
a Wikipedia article:

  Semantic HTML
    http://en.wikipedia.org/wiki/HTML#Semantic_HTML

  Second, semantic HTML frees authors from the need to
  concern themselves with presentation details. When writing
  the number two, for example, should it be written out in
  words ("two"), or should it be written as a numeral (2)? A
  semantic markup might enter something like
  <number>2</number> and leave presentation details to the
  stylesheet designers.

I don't see how tags are needed at all here. Parsing "2" or "two"
and then converting to the other case (rispectively "two" or "2")
is trivial, one don't need the obfuscation of <number>2</number>.

  Similarly, an author might wonder where to break out
  quotations into separate indented blocks of text - with
  purely semantic HTML, such details would be left up to
  stylesheet designers. Authors would simply indicate
  quotations when they occur in the text, and not concern
  themselves with presentation.

Again, an input method or even, I guess, an algorithm that work
with indentation levels can do the same, without bothering authors
with <blockquote></blockquote> and the like.

THE MAJOR SIN OF XML

The context here is use of XML as a mark-up language for
documents. Use of XML for data management is even worse,
an unconceivable idiocy.

As Erik Naggum, on comp.lang.lisp, succinctly describe in
"The Next Generation of Lisp Programmers" thread:

  http://tinyurl.com/2wvzxh

  .. the edges are much too visible and verbose with the stupid
  tags that very quickly consume more of the large volume of
  characters in SGML/XML documents than the contents.

In the same thread:

  http://groups.google.com/group/comp.lang.lisp/msg/758de50bbb0acf55

  * Ziv Caspi
  | It is not obvious that use of \i{foo} (or {i foo}) is always better than use
  | of <i>foo</i>.  In TeX and LaTeX, for example, once the scope gets "too
  | big", a switch is made to the \start{}...\stop{} way of doing things.  While
  | redundant, it helps in catching the types of mistakes people (as opposed to
  | computers and geeks) tend to make.

  This is a good point, but my counter-argument is that your editor should
  make these things easier for you if the element contents becomes too large.

> If we use XML instead, I think we can deal with links and footnotes
> more robustly.

I am afraid to ask how XML can possibly help because I'm thinking
of the horrible hacks to which XML can force you.

To be more specific:

1) how XML can help the user to input links and footnotes
   without bothering him with its idiotic, repellent,
   redundant syntax?

2) How XML can simplify our software so that such links and
   footnotes go directly into a DBMS, to facilitate their
   update and manipulation? XML add the burden of parsing
   itself to that of parsing anything else.

   I want to be able to do something simple like (just
   a trivial example):

    UPDATE URIs
       SET base-uri = "http://citeseer.ist.psu.edu" -- new URI
     WHERE base-uri = "http://citeseer.nj.nec.com"  -- old URI

   which update the base URI of all links that refer to
   Citeseer. Or even:

    SELECT DISTINCT base-uri  -- (§)
               FROM URIs
          WHERE NOT ping(base-uri)

   This list all URIs that do not respond to ping. "ping()"
   is a function from the type of base-uri to boolean; the
   type of base-uri can be something like the PostgreSQL
   URI Type:

     http://pgfoundry.org/projects/uri

   §) note the ugliness of SQL: if you omit DISTINCT it can
      possibly return a NON-relation (a table where base-uri
      is not unique -- not a key -- forcing to execute "ping()"
      more than once on some site).

3) Are footnotes anything special? I think not, they are yet
   another type of "annotation" not really different from "style"
   annotations, comments, links... (the idea is that you
   establish a relationship between n "entity").

XQuery? No, thanks:

  ON JIM GRAY'S “CALL TO ARMS”
  by Lee Fesperman
    http://www.dbdebunk.com/page/page/2341185.htm

  .. XML and XQuery are insolubly flawed, the former being just
  another incarnation of the discredited hierarchical data
  model that RM [Relational data-Model] superseded. XQuery
  brings the excessive complexity that hierarchy instilled
  in the products that preceded SQL. A Relational query
  language uses ***only one method*** to access and manipulate
  information — by data values. ***XQuery has at least three***
  — by value, by navigating the hierarchy and through external
  linkage. This adds complexity, but no power, exactly the
  problem that Codd intended to eliminate with RM. ..

Emphasis mine.

AN ALTERNATIVE TO MARK-UP LANGUAGES

I invite to read about and/or play with LAPIS:

  http://groups.csail.mit.edu/uid/lapis/

  Lightweight Structure is the ability to recognize text
  structure automatically, using an extensible library of
  patterns and parsers. Structure can be detected in lots
  of ways: grammars (e.g. Java or HTML), regular
  expressions, even manual selections by the user. With
  lightweight structure, it doesn't matter how the
  structure was detected--whether by a regular expression,
  or by a grammar, or by a hand-coded parser. All that
  matters is the result: a region set, which is a set of
  intervals in the text.

Also from:

  Lightweight Structure in Text
  PhD thesis by Robert C. Miller

  http://www.cs.cmu.edu/%7Ercm/papers/thesis/

  Pattern matching is heavily used for searching, filtering,
  and transforming text, but existing pattern languages
  offer few opportunities for reuse. Lightweight structure
  is a new approach that solves the reuse problem.
  Lightweight structure has three parts: a model of text
  structure as contiguous segments of text, or regions; an
  extensible library of structure abstractions (e.g., HTML
  elements, Java expressions, or English sentences) that can
  be implemented by any kind of pattern or parser; and a
  region algebra for composing and reusing structure
  abstractions. Lightweight structure does for text pattern
  matching what procedure abstraction does for programming,
  enabling construction of a reusable library

Note carefully: "region set" and "region algebra". I bet that
modelling these sets relationally is possible to "implement"
this region algebra with relational algebra (i.e. a morphism
between the two algebra exists). I need to investigate this.
If someone can help I will appreciate greatly.

Now, imagine a typical scenario: a user write and edit its text
without interrupting the flow of his thoughts. The computer
parse his text incrementally, without disturbing him.

After this, the user can choose to see what the computer was
able to parse and eventually correct and augment the result,
wich at the end is stored in a DB as an annotation (a foreign
key-primary key constraint) to the text, as a value with a
proper type (so you can query, for example: give me all US
phone number or e-mail cited on such and such sites. I can
imagine people complicating the subject: if you are thinking
"ontologies" wait and read below).

In turn, these "semantic annotations" (just to mock the
Angle-Bracket Crowd; as we have seen is parsing) can
help "stylistic" annotations: one can decide to
display nation's names in bold and with a box around.

More schematically, what I propose is:

1) to move the matching/parsing/classification/indexing phase *before*
   publishing (possibly during editing) vs current web model, where
   only tags are parsed every time a "document" is rertieved;

2) to parse/classify MEANINGFUL patterns of text (eventually other
   medium of expression) no more vasting machine and HUMAN resources
   in parsing/writing these stupid tags; note that is possible to extend
   a basic framework (similar to LAPIS) in different directions, beyond
   parsing (summarization, bayesian classifiers, ...);

Also, as I mentioned in my last e-mail ("On Tunes Distributed Publishing"
http://lists.tunes.org/archives/tunes/2007-March/004067.html)

3) to give up with the idea of "documents" (the same goes for
   "applications" being they web or desktop applications),
   giving the possibility to users/agents (programs) to
   retrieve only fragments that *they* deem interesting;

4) to use RDBMS technology(§) to store, annotate, manipulate and
   "re-assemble" (see paper about NetBook at beginning of this
   e-mail) these fragments in arbitrary structures (note that
   *I* intend with "fragment" any chunk of text, down to a
   single character).

   §) I mean: the theory and some advanced implementation
      techniques, not the bloat/administrative burden of
      traditional DBMSs (that, nonetheless, can be used for
      prototyping).
      I'm speaking about a self-tuning/administering *lite*
      massively distributed/decentralized data management system.
      There is ongoing research toward adapting past techniques
      developed for Distributed DBMSs to scale toward Internet,
      on overlay networks (P2P) with million of nodes.
      Try this Google search:

      http://www.google.com/search?q=Data+Management+in+%22Peer-to-Peer%22+Systems

   Essentially: Information Retrieval (and beyond, see NetBook)
   with data management services provided by a DBMS.

   SQL sucks, of course. We need to monitor "new" developments;
   for example: on The Third Manifeto front (http://www.thethirdmanifesto.com/)
   and on that of Deductive DBMSs.

Note that this is complementary to all the research done in
the field of information extraction from web sites. See, for
example:

  An SQL-Wrapper for the WWW: How to send SQL-Queries to
  the DBLP Web site
    http://davis.wpi.edu/dsrg/SQL_WRAPPER/index.html

  Monadic datalog and the expressive power of languages for
  Web information extraction
    http://portal.acm.org/citation.cfm?id=962450

It is ironic that most of the data present on the web come from
DBMSs: DBMS -> HTML/XML -parsing-> DBMS (intended as: the resource
seen as a DBMS).

Suggestion: give (read-only) access to your DBMS on the 'net.
A decent DBMS will take care of security with views and ACLs.
If you don't trust your DBMS, write a wrapper.

A NOTE ON MODULAR PARSING

This will be relevant for our Meta-Translator Subproject also:

  Concrete Syntax for Objects. Domain-Specific Language Embedding
  and Assimilation without Restrictions

  http://swerl.tudelft.nl/bin/view/EelcoVisser/ConcreteSyntaxForObjects

  [p. 13, top-left]
  An interesting formal result is that there is no proper
  subclass of the context-free grammars that is closed under
  union. As a consequence, grammar languages that are
  restricted to some proper subset of the context-free
  languages cannot be modular since the combination of two,
  for example LR, syntax definitions is not guaranteed to be
  in this subset. Therefore, SDF supports the full class of
  contextfree grammars.

Another appetizer from the abstract:

  .. In this paper we describe MetaBorg, a method for
  providing concrete syntax for domain abstractions to
  application programmers. The method consists of embedding
  domain-specific languages in a general purpose host
  language and assimilating the embedded domain code into
  the surrounding host code. ..

  .. Indeed, MetaBorg can be considered a method for
  promoting APIs to the language level. ..

ON "ONTOLOGIES", "SCHEMA RECONCILIATION" AND THE FRAUD
  OF "SELF-DESCRIBING" XML

Essentially, the problem subtended by these terms can
be illustrated by a trivial example (oriented toward
the framework I'm speaking about):

given a fully distributed/decentralized system, if peers X,
Y and Z possess information about an "entity" with

  name:string=John (attribute:type=value)

are these "entities" the same? Are attribute "name" and
type "string" the same?

Now, there is no magic, one can wrap diligently his
data in hordes of XML tags, spread "ontologies' buds"
all arond but this will solve nothing. Consider:

a) you need agreement on semantic and pragmatic issues
   before exchanging information; there is not such
   a thing as a "self-describing" mark-up. Refer to:

     http://en.wikipedia.org/wiki/Semantics
     http://en.wikipedia.org/wiki/Pragmatics

b) Jean-Luc Delatre in this conversation:

     RDF is a monkey wrench?
       http://tinyurl.com/26x6z4

     There is no single "true" Ontology, not even finitely
     many of them because Ontologies are NOT models of the real
     world but steps in a process of organizing evolving
     consensus.

   Also Marcia J. Bates in:

     After the Dot-Bomb: Getting Web Information Retrieval
     Right This Time
       http://firstmonday.org/issues/issue7_7/bates/

     2. Succumbing to the "ontology" fallacy.

        ...

        Long-term solutions to the problems of indexing the
        Web will probably involve multiple overlapping methods
        of classifying and indexing knowledge, so that people
        coming from every possible angle can find their way to
        resources that work best for them. Instead of calling
        it an "ontology," label the system of description what
        it really is - a classification, thesaurus, set of
        concept clusters, or whatever (see also Soergel,
        1999.).

        ...

     3. Using standard dictionaries or Roget's Thesaurus for
        information retrieval.

        ...

        However, linguists are not experts in information
        retrieval. Through decades of experimentation, the IR
        community has learned how ineffectual such conventional
        dictionary and thesaurus sources are for real-world
        information retrieval. Instead, another type of thesaurus
        has been developed specifically for the case where the
        purpose is information description for later retrieval.
        These IR thesauri number in the hundreds, and have been
        developed for virtually every kind of subject matter.
        Many "ontologists" are truly re-inventing the wheel - an
        already-developed thesaurus for their subject matter may
        be hiding in the stacks of their local university
        library.

Summarizing:

1) the problem seems really hard, nearly impossible if you pretend
   to solve it once and for all;
2) various techniques from information sciences can help;
3) an extensible framework like what I sketched here could
   be, in its starting form, better than a simple keyword
   search.

ON DECENTRALIZATION <> DISTRIBUTION

I like these short definitions:

  Dr. Rohit Khare
    http://www.ics.uci.edu/~rohit/

  "Decentralization <> Distribution", as shown in this
  summary quad chart. Basically:

  * distributed systems are about letting multiple parties
    come to consensus on a single decision;

  * decentralization requires permitting independent parties
    to make their own decisions.

What I left out? Oh yes, URLs:

  Tunes Distributed Publishing
    http://tunes.org/Interfaces/tunesvswww.html

  .. using URLs is like having to work with pointers,
  not objects [in TUNES sense], in an environment with
  an unreliable buggy moving GC ..

Probably the subject of another e-mail. But I think
I explained satisfactorly this:

  .. Tunes is a "Meta-browsing" system, that allows arbitrary
  extraction of information from the document view as a semantically
  meaningful object instead of a meaningless syntactical stream
  of bits.

What do you people think?

> I'm all for doing this the easy way, as there's no sense investing
> more than necessary in such an immature system.

No sense at all in investing in another stupid "CMS".

> A big example -
> http://oneclick.mozdev.org/sidebar/cune/cuneAform.htm
>
> A simple version I'm trying out - http://tom.bespin.org/edit.html

People can do incredible things even with the most stupid
technologies, given enough time and resources. I ask: can you
imagine what these same people can do with tools and a
mind framework that empower them (vs dumb them down)?

> Does this look promising?
>
> - Tom

Not to me, sorry.

P.S.: Tom, I don't see how is possible to access old static
      content pages from the home page. This definitely is
      NOT acceptable.

Regards.

--
Massimo Dentico