Language ideas

Justin Sampson
Mon, 01 Apr 1996 15:25:16 -0800

I've been playing with a fairly general grammar-based computational model
that may be appropriate for Tunes. The basic idea is to specify a set of
grammar productions for parsing a string of symbols, and a set of
syntax-guided transformations to be applied once the string is parsed.
As it stands now, I've only been able to think of meaningful definitions
for using unambiguous, context-free grammars; the transformations, though
allow the model to simulate a Turing machine, so it is powerful. I would
like, however, to allow unrestricted grammars (for generality) and
ambiguity (for nondeterminism), so it still needs some work.

This is similar to the idea behind a parser generator, but it fully
defines the semantics of a language as well as the syntax; and unlike
programs such as yacc, no coding is necessary in a different language.
I'll describe in more detail what I'm thinking of, but please tell me if
you've heard of anything similar.

(I'm not about to give a course in formal language theory, so I apoligize
 to anyone without a computer science background.)

The first part of the computation definition is simply a context-free
grammar. This is used to parse the input string at the start of and
throughout the computation.

The real meat of the definition is in the second part, the transformation
productions. Each transformation consists of a non-terminal symbol from
the grammar and two strings of terminals and non-terminals which may be
derived from the given non-terminal. Additionally, non-terminals in the
first string may be bound to variable names, to be used in the second
string; these are analogous to the parameters of a function.

After the input string is initially parsed, the transformations are
applied from the bottom up (since there are probably many leaves of the
parse tree, this is naturally parallelizable). For each node of the parse
tree whose children are all leaves, if there is a transformation matching
the node, it is applied (as defined below). If there is no matching
transformation, the children are pushed up the tree, taking the place of
the parent node; the process is continued until the root of the parse
tree is replaced.

For a transformation to match a node, the non-terminal at that node must
be the same as the non-terminal associated with the transformation, and
the children of the node must be derivable from the first string of the
transformation (an example will be given later). If a non-terminal in the
string is bound to a variable, the symbols derived from that non-terminal
are substituted for the variable name in the second string of the
transformation. The second string, with any substitutions, then replaces
the leaves in the tree, and the string is parsed and transformed again.

I hope that makes sense. Here's an example:

The notation I use should be familiar. Non-terminal symbols are enclosed
in angled-braces (e.g., <expr>). Terminal symbols are enclosed in quotes
("+"). Variables are simple identifiers. A non-terminal bound to a
variable is signified by putting the variable name within the brackets
of the non-terminal (<number x>). The empty string is <>. A grammar
production starts with a non-terminal, followed by "::=", and the various
derivatives are separated by "|". A production is terminated by a period.
A transformation rule is given as "non-terminal ... string1 => string2 ."
Multiple transformations with the same initial non-terminal may be
separated by bars ("non-terminal ... string1 => string2 | string3 =>
string4 ."). Similarly, grammar productions and transformation rules may
be combined by replacing the "." of the production with the "..." of the

This describes the simple computation of a sum of unary numbers:

  <expr>   ::= <expr> "+" <number> | <number>
           ... <number x> "+" <number y> => x y .
  <number> ::= "1" <number> | <> .

To operate on "111+1+11", the system first parses the string. Since there
are no transformations bound to <number>, I will treat entire numbers as
units. So, the string is parsed as (111+1)+11; the <expr> ::= "111+1" is
transformed first, by binding "111" to x and "1" to y, and concatenating
them by the rule "x y" to get "1111". This is parsed again, but since
there are no transformations on numbers, the string is eventually popped
up to replace the <expr> node in the parse tree, leaving "1111+11", which
matches the transformation rule; x takes the value "1111", y takes the
value "11", and they are concatenated to give "111111", which is the
final result (3+1+2=6).

More complex expressions are formed similarly; to add parentheses, for

  <expr>   ::= <expr> "+" <term> | <term>
           ... <number x> "+" <number y> => x y .
  <term>   ::= <number> | "(" <expr> ")"
           ... "(" <number n> ")" => n .
  <number> ::= "1" <number> | <> .

System objects such as I/O streams can be implemented as partially-
evaluated strings. Two-way communication between portions of the parse
tree are possible, but would be more natural with an unrestricted
grammar. Variable binding in transformations becomes tricky when the
grammar is unrestricted, though. Allowing ambiguity may be useful for
some strange computations--and that would include allowing the second
string of a transformation to be not derivable from the non-terminal,
forcing the entire string to be re-parsed after any transformation.
Unrestricted grammars would require three strings for a transformation,
rather than a non-terminal symbol and two strings, which may also affect
the ability to locally restrict re-parsing.

The specification of productions and rules could be combined with the
string to be operated upon and fed to a universal machine. This would
allow any language to be used, and would make the specification language
reflective. For example, the input file could be something like:

       <expr>   ::= <expr> "+" <term> | <term>
                ... <number x> "+" <number y> => x y .
       <term>   ::= <number> | "(" <expr> ")"
                ... "(" <number n> ")" => n .
       <number> ::= "1" <number> | <> .

The output would then be "11111111111111111".

I have more ideas regarding modularity and implementation efficiency, but
this is getting long, so I'll wait for comments.

- Justin Sampson - -