Other useful information...
The directory examples/glr
contains several examples
from the small to the large. Please consult these or use them as a
base for your experiments.
If you run the examples with GHC, each
run will produce a file out.daVinci
. This is a
graph in the format expected by the daVinci
graph visualization tool.
(See http://www.informatik.uni-bremen.de/~davinci/
for more information. Educational use licenses are currently
available without charge.)
We highly recommend looking at graphs of parse results - it really helps to understand the results. The graphs files are created with Sven Panne's library for communicating with daVinci, supplemented with some extensions due to Callaghan. Copies of this code are included in the examples directory, for convenience. If you are trying to view large and complex graphs, contact Paul Callaghan (there are tools and techniques to make the graphs more manageable).
GLR parsing (and related techniques) aren't just for badly written grammars or for things like natural language (NL) where ambiguity is inescapable. There are applications where ambiguity can represent possible alternatives in pattern-matching tasks, and the flexibility of these parsing techniques and the resulting graphs support deep analyses. Below, we briefly discuss some examples, a mixture from our recent work and from the literature.
Combinations of structures within gene sequences can be expressed as a grammar, for example a "start" combination followed by a "promoter" combination then the gene proper. A recent undergraduate project has used this GLR implementation to detect candiate matches in data, and then to filter these matches with a mixture of local and global information.
Rhythmic patterns in (English) poetry obey certain rules, and in more modern poetry can break rules in particular ways to achieve certain effects. The standard rhythmic patterns (eg. iambic pentameter) can be encoded as a grammar, and deviations from the patterns also encoded as rules. The neutral reading can be parsed with this grammar, to give a forest of alternative matches. The forest can be analysed to give a preferred reading, and to highlight certain technical features of the poetry. An undergraduate project in Durham has used this implementation for this purpose, with promising results.
Recent work has phrased the translation problem in compilers from intermediate representation to an instruction set for a given processor as a matching problem. Different constructs at the intermediate level can map to several combinations of machine instructions. This knowledge can be expressed as a grammar, and instances of the problem solved by parsing. The parse forest represents competing solutions, and allows selection of optimum solutions according to various measures.
The extra flexibility of GLR parsing can simplify parsing of formal languages where a degree of `informality' is allowed. For example, Html parsing. Modern browsers contain complex parsers which are designed to try to extract useful information from Html text which doesn't follow the rules precisely, eg missing start tags or missing end tags. Html with missing tags can be written as an ambiguous grammar, and it should be a simple matter to extract a usable interpretation from a forest of parses. Notice the technique: we widen the scope of the grammar, parse with GLR, then extract a reasonable solution. This is arguably simpler than pushing an LR(1) or LL(1) parser past its limits, and also more maintainable.
Ambiguity is inescapable in the syntax of most human languages. In realistic systems, parse forests are useful to encode competing analyses in an efficient way, and they also provide a framework for further analysis and disambiguation. Note that ambiguity can have many forms, from simple phrase attachment uncertainty to more subtle forms involving mixtures of word senses. If some degree of ungrammaticality is to be tolerated in a system, which can be done by extending the grammar with productions incorporating common forms of infelicity, the degree of ambiguity increases further. For systems used on arbitrary text, such as on newspapers, it is not uncommon that many sentences permit several hundred or more analyses. With such grammars, parse forest techniques are essential. Many recent NLP systems use such techniques, including the Durham's earlier LOLITA system - which was mostly written in Haskell.
The original implementation was developed by Ben Medlock, as his undergraduate final year project, using ideas from Peter Ljungloef's Licentiate thesis (see http://www.cs.chalmers.se/~peb/parsing, and we recommend the thesis for its clear analysis of parsing algorithms). Ljungloef's version produces lists of parse trees, but Medlock adapted this to produce an explicit graph containing parse structure information. He also incorporated the code into Happy.
After Medlock's graduation, Callaghan extended the code to incorporate semantic information, and made several improvements to the original code, such as improved local packing and support for hidden left recursion. The performance of the code was significantly improved, after changes of representation (eg to a chart-style data structure) and technique. Medlock's code was also used in several student projects, including analysis of gene sequences (Fischer) and analysis of rhythmic patterns in poetry (Henderson).
The current code implements the standard GLR algorithm extended
to handle hidden left recursion. Such recursion, as in the grammar
below from Rekers [1992], causes the standard algorithm to loop
because the empty reduction A ->
is always
possible and the LR parser will not change state. Alternatively,
there is a problem because an unknown (at the start of parsing)
number of A
items are required, to match the number of i
tokens in the input.
S -> A Q i | + A ->
The solution to this is not surprising. Problematic recursions
are detected as zero-span reductions in a state which has a
goto
table entry looping to itself. A special
symbol is pushed to the stack on the first such reduction,
and such reductions are done at most once for any token
alternative for any input position.
When popping from the stack, if the last token being popped
is such a special symbol, then two stack tails are returned: one
corresponding to a conventional pop (which removes the
symbol) and the other to a duplication of the special symbol
(the stack is not changed, but a copy of the symbol is returned).
This allows sufficient copies of the empty symbol to appear
on some stack, hence allowing the parse to complete.
The forest is held in a chart-style data structure, and this supports local ambiguity packing (chart parsing is discussed in Ljungloef's thesis, among other places). A limited amount of packing of live stacks is also done, to avoid some repetition of work.
[Rekers 1992] Parser Generation for Interactive Environments, PhD thesis, University of Amsterdam, 1992.
You might have noticed this GLR-related option. It is an experimental feature intended to restrict the amount of structure retained in the forest by discarding everything not required for the semantic results. It may or it may not work, and may be fixed in a future release.
The parser supports hidden left recursion, but makes no attempt
to handle cyclic grammars that have rules which do not consume any
input. If you have a grammar like this, for example with rules like
S -> S
or
S -> A S | x; A -> empty
, the implementation will
loop until you run out of stack - but if it will happen, it often
happens quite quickly!
The code has been used and tested frequently over the past few years, including being used in several undergraduate projects. It should be fairly stable, but as usual, can't be guaranteed bug-free. One day I will write it in Epigram!
If you have suggestions for improvements, or requests for features, please contact Paul Callaghan. There are some changes I am considering, and some views and/or encouragement from users will be much appreciated. Further information can be found on Callaghan's GLR parser page.