This section discusses the options for including semantic information in grammars.
Semantic information may be attached to productions in the conventional way, but when more than one analysis is possible, the use of the semantic information must change. Two schemes have been implemented, which we call tree decoding and label decoding. The former is for simple applications, where there is not much ambiguity and hence where the effective unpacking of the parse forest isn't a factor. This mode is quite similar to the standard mode in Happy. The latter is for serious applications, where sharing is important and where processing of the forest (eg filtering) is needed. Here, the emphasis is about providing rich labels in nodes of the the parse forest, to support such processing.
The default mode is labelling. If you want the tree decode mode,
use the --decode
flag.
Tree decoding corresponds to unpacking the parse forest to individual
trees and collecting the list of semantic results computed from
each of these. It is a mode intended for simple applications,
where there is limited ambiguity.
You may access semantic results from components of a reduction
using the dollar variables.
As a working example, the following is taken from the
expr-tree
grammar in the examples.
Note that the type signature is required, else the types in use
can't be determined by the parser generator.
E :: {Int} -- type signature needed : E '+' E { $1 + $3 } | E '*' E { $1 * $3 } | i { $1 }
This mode works by converting each of the semantic rules into
functions (abstracted over the dollar variables mentioned),
and labelling each Branch
created from a
reduction of that rule with the function value.
This amounts to delaying the action of the
rule, since we must wait until we know the results of all of
the sub-analyses before computing any of the results. (Certain
cases of packing can add new analyses at a later stage.)
At the end of parsing, the functions are applied across relevant
sub-analyses via a recursive descent. The main interface to this
is via the class and entry function below. Typically,
decode
should be called on the root of the
forest, also supplying a function which maps node names to their
list of analyses (typically a partial application of lookup in
the forest value).
The result is a list of semantic values.
Note that the context of the call to decode
should (eventually) supply a concrete type to allow selection
of appropriate instance. Ie, you have to indicate in some way
what type the semantic result should have.
Decode_Result a
is a synonym generated by
Happy: for non-monadic semantics,
it is equivalent to a
; when monads are
in use, it becomes the declared monad type.
See the full expr-eval
example for more
information.
class TreeDecode a where decode_b :: (ForestId -> [Branch]) -> Branch -> [Decode_Result a] decode :: TreeDecode a => (ForestId -> [Branch]) -> ForestId -> [Decode_Result a]
The GLR parser generator identifies the types involved in each
semantic rule, hence the types of the functions, then creates
a union containing distinct types. Values of this union are
stored in the branches. (The union is actually a bit more complex:
it must also distinguish patterns of dollar-variable usage, eg
a function \x y -> x + y
could be applied to
the first and second constituents, or to the first and third.)
The parser generator also creates instances of the
TreeDecode
class, which unpacks the semantic
function and applies it across the decodings of the possible
combinations of children. Effectively, it does a cartesian product
operation across the lists of semantic results from each of the
children. Eg [1,2] "+" [3,4]
produces
[4,5,5,6]
.
Information is extracted from token values using the patterns
supplied by the user when declaring tokens and their Haskell
representation, so the dollar-dollar convention works also.
The decoding process could be made more efficient by using memoisation techniques, but this hasn't been implemented since we believe the other (label) decoding mode is more useful. (If someone sends in a patch, we may include it in a future release -- but this might be tricky, eg require higher-order polymorphism? Plus, are there other ways of using this form of semantic function?)
The labelling mode aims to label branches in the forest with
information that supports subsequent processing, for example
the filtering and prioritisation of analyses prior to extraction
of favoured solutions. As above, code fragments are given in
braces and can contain dollar-variables. But these variables
are expanded to node names in the graph, with the intention of
easing navigation.
The following grammar is from the expr-tree
example.
E :: {Tree ForestId Int} : E '+' E { Plus $1 $3 } | E '*' E { Times $1 $3 } | i { Const $1 }
Here, the semantic values provide more meaningful labels than
the plain structural information. In particular, only the
interesting parts of the branch are represented, and the
programmer can clearly select or label the useful constituents
if required. There is no need to remember that it is the first
and third child in the branch which we need to extract, because
the label only contains those values (the `noise' has been dropped).
Consider also the difference between concrete and abstract syntax.
The labels are oriented towards abstract syntax.
Tokens are handled slightly differently here: when they appear
as children in a reduction, their informational content can
be extracted directly, hence the Const
value
above will be built with the Int
value from
the token, not some ForestId
.
Note the useful technique of making the label types polymorphic in the position used for forest indices. This allows replacement at a later stage with more appropriate values, eg. inserting lists of actual subtrees from the final decoding.
Use of these labels is supported by a type class
LabelDecode
, which unpacks values of the
automatically-generated union type GSem
to the original type(s). The parser generator will create
appropriate instances of this class, based on the type information
in the grammar file. (Note that omitting type information leads
to a default of ()
.)
Observe that use of the labels is often like traversing an abstract
syntax, and the structure of the abstract syntax type usually
constrains the types of constituents; so once the overall type
is fixed (eg. with a type cast or signature) then there are no
problems with resolution of class instances.
class LabelDecode a where unpack :: GSem -> a
Internally, the semantic values are packed in a union type as
before, but there is no direct abstraction step. Instead, the
ForestId
values (from the dollar-variables)
are bound when the corresponding branch is created from the
list of constituent nodes. At this stage, token information
is also extracted, using the patterns supplied by the user
when declaring the tokens.
You can use the %monad
directive in the
tree-decode mode.
Essentially, the decoding process now creates a list of monadic
values, using the monad type declared in the directive.
The default handling of the semantic functions is to apply the
relevant return
function to the value being
returned. You can over-ride this using the {% ... }
convention. The declared (>>=)
function is
used to assemble the computations.
Note that no attempt is made to share the results of monadic computations from sub-trees. (You could possibly do this by supplying a memoising lookup function for the decoding process.) Hence, the usual behaviour is that decoding produces whole monadic computations, each part of which is computed afresh (in depth-first order) when the whole is computed. Hence you should take care to initialise any relevant state before computing the results from multiple solutions.
This facility is experimental, and we welcome comments or
observations on the approach taken!
An example is provided (examples/glr/expr-monad
).
It is the standard example of arithmetic expressions, except that
the IO
monad is used, and a user exception is
thrown when the second argument to addition is an odd number.
Running this example will show a zero (from the exception handler)
instead of the expected number amongst the results from the other
parses.