[TDP] Alternative to mark-up languages (was: On Tunes Distributed
Publishing)
Massimo Dentico
m.dentico at virgilio.it
Tue Mar 27 00:01:58 PDT 2007
Tom Novelli, Sat, Mar 24, 2007 2:15 PM, wrote:
> Thanks, Massimo..
Well, I don't know if you will thank me at the end
of this e-mail...
> That's A LOT of material... It'll take a long time
> to digest,
Skim at least this:
NetBook - a data model to support knowledge exploration
http://www.vldb.org/dblp/db/conf/vldb/Shasha85.html
It is a quite different approach than the Web, so it helps
to think outside the box.
> .. giving you a welcome respite. :-)
On the contrary, after what you wrote below you put me under pressure
to write even more. :-S
Tom I don't want to be harsh but you make my intention really hard
to pursue (when too much is too much).
> Speaking of markup.... The WWW has improved considerably since Fare
> wrote "Tunes vs. the WWW".
Sorry, but... these are plain bullshit for me. Predictable from
reading my take on "TUNES vs the WWW":
http://tunes.org/wiki/TUNES_vs_the_WWW
> XML, CSS, separation of structure & presentation, _useful_ client
> side scripting...
You seems seduced by the "Angle-Bracket Crowd" (ABC).
As I said elsewhere, CSS is the unique piece of the puzzle that
I salvage: the concept more than the implementation, which needs
to deal with these stupid tags.
> There's light at the end of the tunnel.
No, they are digging a hole in the ground, a hole without
a bottom. You cannot "reform" the mass of crap coming out
of the W3C at an allarming rate. They make hard what is
simple and impossible what is hard.
> I just discovered that modern web browsers have a built-in "Design
> Mode" WYSIWYG editor, which is quite easy to incorporate into a CMS.
> I'm tempted to abandon the Wiki-text approach...
You need to think of a (possibly little) set of /input methods/ that
are *simple to type*, nothing more.
Something like: *bold* /italic/ _underline_ `fixed´ (vs proportional)
as in Gnus GNU News Reader (IIRC). With immediate feedback possibly.
(Note: it is probably a modal input method, we need to ascertain this
and eventually come up with something better).
Each of these /input methods/ add, under the hood, a "style" annotation
(bold, italic, underline..) to the corresponding text part. That is,
users specify ad-hoc "style". But most of the "style" will be applyed
from a "stylesheet" by pattern matching/parsing (don't worry, I will
return to this in more details in section "AN ALTERNATIVE TO MARK-UP
LANGUAGES" below).
Most of the *superficial* structure (with which XML is concerned when
ABC speak about structure) is quite evident, and so parseable, in
any text (a paragraph is trivially delimited by blank lines; headings
can probably be parsed, otherwise an /input method/ suffice; OCR programs
usually can discover columns automatically, and so on).
A SHORT NOTE ON "INFORMATION" AND "STRUCTURE"
"Unstructured" or "semi-structured", terms (ab-)used by "Angle-Bracket
Crowd", are at odds with "information".
Something is structured or not; the fact that some ("deep") structure
is difficult for us (or for machines) to recognize (parse) does not
change this.
If it has no structure at all (no observable pattern) is not
information, is real white noise, generated by a random process
(quite difficult to find: ask cryptographers). See Information Theory.
ON BETTER HUMAN-COMPUTER INTERACTION: ARCHY
A more general /input method/, from which we can draw inspiration,
exempt from modes and other shortcomings, is that developed by
Jef Raskin in "The Humane Interface":
Summary of The Humane Interface
http://jef.raskincenter.org/humane_interface/summary_of_thi.html
There is a prototype in Python, Archy:
http://rchi.raskincenter.org/
http://en.wikipedia.org/wiki/Archy
Here some demos in Flash:
Demos
http://rchi.raskincenter.org/index.php?title=Demos
In Archy search and selection are unified; commands are executed
(see demo "EX6 - Commands") first selecting objects (parameters)
and then, holding down CAPS-LOCK (a quasi-mode), typing a verb
(command) which appears in a traslucent window (verbs that partially
match are presented while typing), so you always see what you
have selected. A command is executed when CAPS-LOCK is released.
Another way is to first select an expression or a Python program
and then issuing CALC or RUN commands, respectively. Think of it
as a free-form spreadsheet. No more static menus, every list or
sequence of text that you write can be a list of commands.
See:
http://rchi.raskincenter.org/index.php?title=Using_Archy#Some_Archy_Commands
Archy dispenses with the concept of "applications" and prefer
a much modular and consistent approach instead, sets of commands:
you add and learn gradually each command you need.
It offer true interoperability: your "content" is not segregated
in application specific file formats.
This avoid redundancies: no more 5 spell checkers, one is enough
and is system-wide; no more dozens of micro-editors (fields in forms,
for example) each with some functionality lacking. True fine-grain
reuse: what O-O always promised and never really delivered
(note the "fine-grain").
Unfortunately developers do not understand that this (the tyranny of
file formats) is *exactly* the reason for which DBMSs where invented
and were using -- guess what -- XML! But this is changing, probably
for performance problems, and they are moving toward...
[Archy_dev] Development plan: moving Archy to object database
http://raskincenter.org/pipermail/archy_dev_raskincenter.org/2005-November/000180.html
Depressing. I predict that an unmaintainable, fragile mess
will follow.
Anyway, the message is: I want to think outside the web browser,
the application and the desktop boxes.
RIDICULOUS SEMANTIC MARKUP
Regarding so-called "semantic markup", quoting, for example,
a Wikipedia article:
Semantic HTML
http://en.wikipedia.org/wiki/HTML#Semantic_HTML
Second, semantic HTML frees authors from the need to
concern themselves with presentation details. When writing
the number two, for example, should it be written out in
words ("two"), or should it be written as a numeral (2)? A
semantic markup might enter something like
<number>2</number> and leave presentation details to the
stylesheet designers.
I don't see how tags are needed at all here. Parsing "2" or "two"
and then converting to the other case (rispectively "two" or "2")
is trivial, one don't need the obfuscation of <number>2</number>.
Similarly, an author might wonder where to break out
quotations into separate indented blocks of text - with
purely semantic HTML, such details would be left up to
stylesheet designers. Authors would simply indicate
quotations when they occur in the text, and not concern
themselves with presentation.
Again, an input method or even, I guess, an algorithm that work
with indentation levels can do the same, without bothering authors
with <blockquote></blockquote> and the like.
THE MAJOR SIN OF XML
The context here is use of XML as a mark-up language for
documents. Use of XML for data management is even worse,
an unconceivable idiocy.
As Erik Naggum, on comp.lang.lisp, succinctly describe in
"The Next Generation of Lisp Programmers" thread:
http://tinyurl.com/2wvzxh
.. the edges are much too visible and verbose with the stupid
tags that very quickly consume more of the large volume of
characters in SGML/XML documents than the contents.
In the same thread:
http://groups.google.com/group/comp.lang.lisp/msg/758de50bbb0acf55
* Ziv Caspi
| It is not obvious that use of \i{foo} (or {i foo}) is always better than use
| of <i>foo</i>. In TeX and LaTeX, for example, once the scope gets "too
| big", a switch is made to the \start{}...\stop{} way of doing things. While
| redundant, it helps in catching the types of mistakes people (as opposed to
| computers and geeks) tend to make.
This is a good point, but my counter-argument is that your editor should
make these things easier for you if the element contents becomes too large.
> If we use XML instead, I think we can deal with links and footnotes
> more robustly.
I am afraid to ask how XML can possibly help because I'm thinking
of the horrible hacks to which XML can force you.
To be more specific:
1) how XML can help the user to input links and footnotes
without bothering him with its idiotic, repellent,
redundant syntax?
2) How XML can simplify our software so that such links and
footnotes go directly into a DBMS, to facilitate their
update and manipulation? XML add the burden of parsing
itself to that of parsing anything else.
I want to be able to do something simple like (just
a trivial example):
UPDATE URIs
SET base-uri = "http://citeseer.ist.psu.edu" -- new URI
WHERE base-uri = "http://citeseer.nj.nec.com" -- old URI
which update the base URI of all links that refer to
Citeseer. Or even:
SELECT DISTINCT base-uri -- (§)
FROM URIs
WHERE NOT ping(base-uri)
This list all URIs that do not respond to ping. "ping()"
is a function from the type of base-uri to boolean; the
type of base-uri can be something like the PostgreSQL
URI Type:
http://pgfoundry.org/projects/uri
§) note the ugliness of SQL: if you omit DISTINCT it can
possibly return a NON-relation (a table where base-uri
is not unique -- not a key -- forcing to execute "ping()"
more than once on some site).
3) Are footnotes anything special? I think not, they are yet
another type of "annotation" not really different from "style"
annotations, comments, links... (the idea is that you
establish a relationship between n "entity").
XQuery? No, thanks:
ON JIM GRAY'S “CALL TO ARMS”
by Lee Fesperman
http://www.dbdebunk.com/page/page/2341185.htm
.. XML and XQuery are insolubly flawed, the former being just
another incarnation of the discredited hierarchical data
model that RM [Relational data-Model] superseded. XQuery
brings the excessive complexity that hierarchy instilled
in the products that preceded SQL. A Relational query
language uses ***only one method*** to access and manipulate
information — by data values. ***XQuery has at least three***
— by value, by navigating the hierarchy and through external
linkage. This adds complexity, but no power, exactly the
problem that Codd intended to eliminate with RM. ..
Emphasis mine.
AN ALTERNATIVE TO MARK-UP LANGUAGES
I invite to read about and/or play with LAPIS:
http://groups.csail.mit.edu/uid/lapis/
Lightweight Structure is the ability to recognize text
structure automatically, using an extensible library of
patterns and parsers. Structure can be detected in lots
of ways: grammars (e.g. Java or HTML), regular
expressions, even manual selections by the user. With
lightweight structure, it doesn't matter how the
structure was detected--whether by a regular expression,
or by a grammar, or by a hand-coded parser. All that
matters is the result: a region set, which is a set of
intervals in the text.
Also from:
Lightweight Structure in Text
PhD thesis by Robert C. Miller
http://www.cs.cmu.edu/%7Ercm/papers/thesis/
Pattern matching is heavily used for searching, filtering,
and transforming text, but existing pattern languages
offer few opportunities for reuse. Lightweight structure
is a new approach that solves the reuse problem.
Lightweight structure has three parts: a model of text
structure as contiguous segments of text, or regions; an
extensible library of structure abstractions (e.g., HTML
elements, Java expressions, or English sentences) that can
be implemented by any kind of pattern or parser; and a
region algebra for composing and reusing structure
abstractions. Lightweight structure does for text pattern
matching what procedure abstraction does for programming,
enabling construction of a reusable library
Note carefully: "region set" and "region algebra". I bet that
modelling these sets relationally is possible to "implement"
this region algebra with relational algebra (i.e. a morphism
between the two algebra exists). I need to investigate this.
If someone can help I will appreciate greatly.
Now, imagine a typical scenario: a user write and edit its text
without interrupting the flow of his thoughts. The computer
parse his text incrementally, without disturbing him.
After this, the user can choose to see what the computer was
able to parse and eventually correct and augment the result,
wich at the end is stored in a DB as an annotation (a foreign
key-primary key constraint) to the text, as a value with a
proper type (so you can query, for example: give me all US
phone number or e-mail cited on such and such sites. I can
imagine people complicating the subject: if you are thinking
"ontologies" wait and read below).
In turn, these "semantic annotations" (just to mock the
Angle-Bracket Crowd; as we have seen is parsing) can
help "stylistic" annotations: one can decide to
display nation's names in bold and with a box around.
More schematically, what I propose is:
1) to move the matching/parsing/classification/indexing phase *before*
publishing (possibly during editing) vs current web model, where
only tags are parsed every time a "document" is rertieved;
2) to parse/classify MEANINGFUL patterns of text (eventually other
medium of expression) no more vasting machine and HUMAN resources
in parsing/writing these stupid tags; note that is possible to extend
a basic framework (similar to LAPIS) in different directions, beyond
parsing (summarization, bayesian classifiers, ...);
Also, as I mentioned in my last e-mail ("On Tunes Distributed Publishing"
http://lists.tunes.org/archives/tunes/2007-March/004067.html)
3) to give up with the idea of "documents" (the same goes for
"applications" being they web or desktop applications),
giving the possibility to users/agents (programs) to
retrieve only fragments that *they* deem interesting;
4) to use RDBMS technology(§) to store, annotate, manipulate and
"re-assemble" (see paper about NetBook at beginning of this
e-mail) these fragments in arbitrary structures (note that
*I* intend with "fragment" any chunk of text, down to a
single character).
§) I mean: the theory and some advanced implementation
techniques, not the bloat/administrative burden of
traditional DBMSs (that, nonetheless, can be used for
prototyping).
I'm speaking about a self-tuning/administering *lite*
massively distributed/decentralized data management system.
There is ongoing research toward adapting past techniques
developed for Distributed DBMSs to scale toward Internet,
on overlay networks (P2P) with million of nodes.
Try this Google search:
http://www.google.com/search?q=Data+Management+in+%22Peer-to-Peer%22+Systems
Essentially: Information Retrieval (and beyond, see NetBook)
with data management services provided by a DBMS.
SQL sucks, of course. We need to monitor "new" developments;
for example: on The Third Manifeto front (http://www.thethirdmanifesto.com/)
and on that of Deductive DBMSs.
Note that this is complementary to all the research done in
the field of information extraction from web sites. See, for
example:
An SQL-Wrapper for the WWW: How to send SQL-Queries to
the DBLP Web site
http://davis.wpi.edu/dsrg/SQL_WRAPPER/index.html
Monadic datalog and the expressive power of languages for
Web information extraction
http://portal.acm.org/citation.cfm?id=962450
It is ironic that most of the data present on the web come from
DBMSs: DBMS -> HTML/XML -parsing-> DBMS (intended as: the resource
seen as a DBMS).
Suggestion: give (read-only) access to your DBMS on the 'net.
A decent DBMS will take care of security with views and ACLs.
If you don't trust your DBMS, write a wrapper.
A NOTE ON MODULAR PARSING
This will be relevant for our Meta-Translator Subproject also:
Concrete Syntax for Objects. Domain-Specific Language Embedding
and Assimilation without Restrictions
http://swerl.tudelft.nl/bin/view/EelcoVisser/ConcreteSyntaxForObjects
[p. 13, top-left]
An interesting formal result is that there is no proper
subclass of the context-free grammars that is closed under
union. As a consequence, grammar languages that are
restricted to some proper subset of the context-free
languages cannot be modular since the combination of two,
for example LR, syntax definitions is not guaranteed to be
in this subset. Therefore, SDF supports the full class of
contextfree grammars.
Another appetizer from the abstract:
.. In this paper we describe MetaBorg, a method for
providing concrete syntax for domain abstractions to
application programmers. The method consists of embedding
domain-specific languages in a general purpose host
language and assimilating the embedded domain code into
the surrounding host code. ..
.. Indeed, MetaBorg can be considered a method for
promoting APIs to the language level. ..
ON "ONTOLOGIES", "SCHEMA RECONCILIATION" AND THE FRAUD
OF "SELF-DESCRIBING" XML
Essentially, the problem subtended by these terms can
be illustrated by a trivial example (oriented toward
the framework I'm speaking about):
given a fully distributed/decentralized system, if peers X,
Y and Z possess information about an "entity" with
name:string=John (attribute:type=value)
are these "entities" the same? Are attribute "name" and
type "string" the same?
Now, there is no magic, one can wrap diligently his
data in hordes of XML tags, spread "ontologies' buds"
all arond but this will solve nothing. Consider:
a) you need agreement on semantic and pragmatic issues
before exchanging information; there is not such
a thing as a "self-describing" mark-up. Refer to:
http://en.wikipedia.org/wiki/Semantics
http://en.wikipedia.org/wiki/Pragmatics
b) Jean-Luc Delatre in this conversation:
RDF is a monkey wrench?
http://tinyurl.com/26x6z4
There is no single "true" Ontology, not even finitely
many of them because Ontologies are NOT models of the real
world but steps in a process of organizing evolving
consensus.
Also Marcia J. Bates in:
After the Dot-Bomb: Getting Web Information Retrieval
Right This Time
http://firstmonday.org/issues/issue7_7/bates/
2. Succumbing to the "ontology" fallacy.
...
Long-term solutions to the problems of indexing the
Web will probably involve multiple overlapping methods
of classifying and indexing knowledge, so that people
coming from every possible angle can find their way to
resources that work best for them. Instead of calling
it an "ontology," label the system of description what
it really is - a classification, thesaurus, set of
concept clusters, or whatever (see also Soergel,
1999.).
...
3. Using standard dictionaries or Roget's Thesaurus for
information retrieval.
...
However, linguists are not experts in information
retrieval. Through decades of experimentation, the IR
community has learned how ineffectual such conventional
dictionary and thesaurus sources are for real-world
information retrieval. Instead, another type of thesaurus
has been developed specifically for the case where the
purpose is information description for later retrieval.
These IR thesauri number in the hundreds, and have been
developed for virtually every kind of subject matter.
Many "ontologists" are truly re-inventing the wheel - an
already-developed thesaurus for their subject matter may
be hiding in the stacks of their local university
library.
Summarizing:
1) the problem seems really hard, nearly impossible if you pretend
to solve it once and for all;
2) various techniques from information sciences can help;
3) an extensible framework like what I sketched here could
be, in its starting form, better than a simple keyword
search.
ON DECENTRALIZATION <> DISTRIBUTION
I like these short definitions:
Dr. Rohit Khare
http://www.ics.uci.edu/~rohit/
"Decentralization <> Distribution", as shown in this
summary quad chart. Basically:
* distributed systems are about letting multiple parties
come to consensus on a single decision;
* decentralization requires permitting independent parties
to make their own decisions.
What I left out? Oh yes, URLs:
Tunes Distributed Publishing
http://tunes.org/Interfaces/tunesvswww.html
.. using URLs is like having to work with pointers,
not objects [in TUNES sense], in an environment with
an unreliable buggy moving GC ..
Probably the subject of another e-mail. But I think
I explained satisfactorly this:
.. Tunes is a "Meta-browsing" system, that allows arbitrary
extraction of information from the document view as a semantically
meaningful object instead of a meaningless syntactical stream
of bits.
What do you people think?
> I'm all for doing this the easy way, as there's no sense investing
> more than necessary in such an immature system.
No sense at all in investing in another stupid "CMS".
> A big example -
> http://oneclick.mozdev.org/sidebar/cune/cuneAform.htm
>
> A simple version I'm trying out - http://tom.bespin.org/edit.html
People can do incredible things even with the most stupid
technologies, given enough time and resources. I ask: can you
imagine what these same people can do with tools and a
mind framework that empower them (vs dumb them down)?
> Does this look promising?
>
> - Tom
Not to me, sorry.
P.S.: Tom, I don't see how is possible to access old static
content pages from the home page. This definitely is
NOT acceptable.
Regards.
--
Massimo Dentico
More information about the TUNES
mailing list