9.2 Extendable grammars
Grammar rules can be added with the Grammar command. This
command is just an interface towards Camlp4, providing the
semantic actions so that they build the expected AST. A simple grammar
command has the following syntax:
Grammar entry nonterminal := rule-name LMP ->
action .
The components have the following meaning:
-
a grammar name: defined by a parser entry and a non-terminal.
Non-terminals are packed in an entry (also called
universe). One can have two non-terminals of the same name if they
are in different entries. A non-terminal can have the same name as
its entry.
- a rule (sometimes called production), formed by a name, a left
member of production and an action, which generalizes constructive
expressions.
grammar |
::= |
Grammar entry gram-entry with ... with gram-entry |
entry |
::= |
ident |
gram-entry |
::= |
rule-name [: entry-type] := [production | ... | production] |
rule-name |
::= |
ident |
entry-type |
::= |
Ast | List |
production |
::= |
rule-name [ [prod-item ... prod-item] ] ->
action |
rule-name |
::= |
ident |
prod-item |
::= |
string |
|
| |
[entry :] entry-name [( meta )] |
action |
::= |
[ [ast ... ast] ] |
|
| |
let pattern = action in action |
|
| |
case action [: entry-type] of [case | ... | case] esac |
case |
::= |
[pattern ... pattern] -> action |
pattern |
::= |
ast |
Figure 9.2: Syntax of the grammar extension command
The exact syntax of the Grammar command is defined
in Fig. 9.2.
It is possible to extend a grammar with several rules at once.
Grammar entry nonterminal |
:= |
production1 |
|
| |
· · · |
|
| |
productionn . |
Productions are entered in reverse order (i.e. productionn before
production1), so that the first rules have priority over the last
ones. The set of rules can be read as an usual pattern matching.
Also, we can extend several grammars of a given universe at
the same time. The order of non-terminals does not matter since they
extend different grammars.
Grammar |
entry |
nonterminal1 |
:= |
production11 |
|
|
|
| |
· · · |
|
|
|
| |
|
|
with |
· · · |
|
|
|
with |
nonterminalp |
:= |
production1p |
|
|
|
| |
· · · |
|
|
|
| |
|
9.2.1 Grammar entries
Let us describe the four predefined entries. Each of them (except prim) possesses an initial grammar for starting the parsing process.
-
prim
: it is the entry of the primitive grammars. Most of
them cannot be defined by the extendable grammar mechanism. They are
encoded inside the system. The entry contains the following
non-terminals:
-
var
: variable grammar. Parses an identifier and builds
an AST which is a variable.
ident
: identifier grammar. Parses an identifier and
builds an AST which is an identifier such as {x}
.
number
: number grammar. Parses a positive integer.
string
: string grammar. Parses a quoted string.
path
: section path grammar.
ast
: AST grammar.
astpat
: AST pattern grammar.
astact
: action grammar.
The primitive grammars are used as the other grammars; for instance
the variables of terms are parsed by prim:var($id)
.
command
: it is the term entry. It allows to have a
pretty syntax for terms. Its initial grammar is command
command. This entry contains several non-terminals, among them command0 to command10 which stratify the terms according to
priority levels (0
to 10
). These priority levels allow
us also to specify the order of associativity of operators.
vernac
: it is the vernacular command entry, with vernac vernac as initial grammar. Thanks to it, the developers can
define the syntax of new commands they add to the system. As to
users, they can change the syntax of the predefined vernacular
commands.
tactic
: it is the tactic entry with tactics tactic
as initial grammar. This entry allows to define the syntax of new
tactics. See chapter 10 about user-defined tactics
for more details.
The user can define new entries and new non-terminals, using the
grammar extension command. A grammar does not have to be explicitly
defined. But the grammars in the left member of rules must all be
defined, possibly by the current grammar command. It may be convenient
to define an empty grammar, just so that it may be called by other
grammars, and extend this empty grammar later. Assume that the command command13 does not exist. The next command defines it with
zero productions.
Coq < Grammar command command13 := .
The grammars of new entries do not have an initial grammar. To use
them, they must be called (directly or indirectly) by grammars of
predefined entries. We give an example of a (direct) call of the
grammar newentry nonterm by command command. This
following rule allows to use the syntax a&b
for the conjunction
a/\b
.
Coq < Grammar newentry nonterm :=
Coq < ampersand [ "&" command:command($c) ] -> [$c].
Coq < Grammar command command :=
Coq < new_and [ command8($a) newentry:nonterm($b) ] -> [<<$a/\$b>>].
9.2.2 Left member of productions (LMP)
A LMP is composed of a combination of terminals (enclosed between
double quotes) and grammar calls specifying the entry. It is enclosed
between ``[
'' and ``]
''. The empty LMP, represented by
[ ]
, corresponds to e in formal language theory.
A grammar call is done by entry:nonterminal($id)
where:
-
entry
and nonterminal
specifies the entry of the grammar, and the non-terminal.
$id
is a metavariable that will receive the AST or AST
list resulting from the call to the grammar.
The elements entry
and $id
are optional. The grammar
entry can be omitted if it is the same as the entry of the
non-terminal we are extending. Also, $id
is omitted if we do
not want to get back the AST result. Thus a grammar call can be
reduced to a non-terminal.
Each terminal must contain exactly one token. This token does not need
to be already defined. If not, it will be automatically
added. Nevertheless, any string cannot be a token (e.g. blanks should
not appear in tokens since parsing would then depend on
indentation). We introduce the notion of valid token, as a
sequence, without blanks, of characters taken from the following list:
< > / \ - + = ; , | ! @ # % ^ & * ( ) ? : ~ $ _ ` ' a..z A..Z 0..9
that do not start with a character from
$ _ a..z A..Z ' 0..9
When an LMP is used in the parsing process of an expression, it is
analyzed from left to right. Every token met in the LMP should
correspond to the current token of the expression. As for the grammars
calls, they are performed in order to recognize parts of the initial
expression.
Warning: Unlike destructive expressions, if a variable appears several times in
the LMP, the last binding hides the previous ones. Comparison can be
performed only in the actions.
Example 1: Defining a syntax for inequality
The rule below allows us to use the syntax t1#t2
for the term
~t1=t2
.
Coq < Grammar command command1 :=
Coq < not_eq [ command0($a) "#" command0($b) ] -> [<<~($a=$b)>>].
The level 1 of the grammar of terms is extended with one rule named
not_eq. When this rule is selected, its LMP calls the
grammar command
command0
. This grammar recognizes a term
that it binds to the metavariable $a
. Then it meets the token
``#
'' and finally it calls the grammar command
command0
. This grammar returns the recognized term in
$b
. The action constructs the term ~($a=$b)
. Note that
we use the command quotation on the right-hand side.
Warning: Metavariables are identifiers preceded by the ``$
'' symbol.
They cannot be replaced by identifiers. For instance, if we enter a
rule with identifiers and not metavariables, an error occurs:
Coq < Grammar command command1 :=
Coq < not_eq [ command0(a) "#" command0(b) ] -> [<<~(a=b)>>].
Warning: Could not globalize a
Warning: Could not globalize b
Toplevel input, characters 49-50
> not_eq [ command0(a) "#" command0(b) ] -> [<<~(a=b)>>].
> ^
Error: This ident is not a metavariable.
For instance, let us give the statement of the symmetry of #
:
Coq < Goal (A:Set)(a,b:A) a#b -> b#a.
1 subgoal
============================
(A:Set; a,b:A)~a=b->~b=a
Conversely, one can force ~a=b
to be printed a#b
by
giving pretty-printing rules. This is explained in section 9.3.
Example 2: Redefining vernac commands
Thanks to the following rule, ``|- term.'' will have the same
effect as ``Goal term.''.
Coq < Grammar vernac vernac :=
Coq < thesis [ "|" "-" command:command($term) "." ]
Coq < -> [<:vernac:<Goal $term.>>].
This rule allows putting blanks between the bar and the
dash, as in
Coq < | - (A:Prop)A->A.
1 subgoal
============================
(A:Prop)A->A
Assuming the previous rule has not been entered, we can
forbid blanks with a rule that declares ``|-
'' as a single
token:
Coq < Grammar vernac vernac :=
Coq < thesis [ "|-" command:command($term) "." ]
Coq < -> [<:vernac:<Goal $term.>>].
Coq < | - (A:Prop)A->A.
Toplevel input, characters 0-1
> | - (A:Prop)A->A.
> ^
Syntax error: illegal begin of vernac
If both rules were entered, we would have three tokens
|
, -
and |-
. The lexical ambiguity on the string
|-
is solved according to the longest match rule (see lexical
conventions page X), i.e. |-
would be one single
token. To enforce the use of the first rule, a blank must be inserted
between the bar and the dash.
Remark: The vernac commands should always be terminated by a period. When a
syntax error is detected, the top-level discards its input until it
reaches a period token, and then resumes parsing.
Example 3: Redefining tactics
We can give names to repetitive tactic sequences. Thus in this example
``IntSp'' will correspond to the tactic Intros followed by
Split.
Coq < Grammar tactic simple_tactic :=
Coq < intros_split [ "IntSp" ] -> [<:tactic:<Intros; Split>>].
Let us check that this works.
Coq < Goal (A,B:Prop)A/\B -> B/\A.
1 subgoal
============================
(A,B:Prop)A/\B->B/\A
Coq < IntSp.
2 subgoals
A : Prop
B : Prop
H : A/\B
============================
B
subgoal 2 is:
A
Note that the same result can be obtained in a simpler way with Tactic Definition (see page X). However,
this macro can only define tactics which arguments are terms.
Example 4: Priority, left and right associativity of operators
The disjunction has a higher priority than conjunction. Thus
A/\B\/C
will be parsed as (A/\B)\/C
and not as
A/\(B\/C)
. The priority is done by putting the rule for the
disjunction in a higher level than that of conjunction: conjunction is
defined in the non-terminal command6 and disjunction in command7 (see file Logic.v in the library). Notice that
the character ``\
'' must be doubled (see lexical conventions
for quoted strings on page X).
Coq < Grammar command command6 :=
Coq < and [ command5($c1) "/\\" command6($c2) ] -> [<<(and $c1 $c2)>>].
Coq < Grammar command command7 :=
Coq < or [ command6($c1) "\\/" command7($c2) ] -> [<<(or $c1 $c2)>>].
Thus conjunction and disjunction associate to the right since in both
cases the priority of the right term (resp. command6 and command7) is higher than the priority of the left term (resp. command5 and command6). The left member of a conjunction cannot
be itself a conjunction, unless you enclose it inside parenthesis.
The left associativity is done by calling recursively the
non-terminal. Camlp4 deals with this recursion by first trying
the non-left-recursive rules. Here is an example taken from the
standard library, defining a syntax for the addition on integers:
Coq < Grammar znatural expr :=
Coq < expr_plus [ expr($p) "+" expr($c) ] -> [<<(Zplus $p $c)>>].
9.2.3 Actions
Every rule should generate an AST corresponding to the syntactic
construction that it recognizes. This generation is done by an
action. Thus every rule is associated to an action. The syntax has
been defined in Fig. 9.2. We give some examples.
Simple actions
A simple action is an AST enclosed between ``[
'' and
``]
''. It simply builds the AST by interpreting it as a
constructive expression in the environment defined by the LMP. This
case has been illustrated in all the previous examples. We will later
see that grammars can also return AST lists.
Local definitions
When an action should generate a big term, we can use
let pattern = action1 in action2 expressions to
construct it progressively. The action action1 is first
computed, then it is matched against pattern which may bind
metavariables, and the result is the evaluation of action2
in this new context.
Example 5:
From the syntax t1*+t2
, we generate the term
(plus (plus t1 t2) (mult t1 t2)).
Coq < Grammar command command1 :=
Coq < mult_plus [ command0($a) "*" "+" command0($b) ]
Coq < -> let $p1=[<<(plus $a $b)>>] in
Coq < let $p2=[<<(mult $a $b)>>] in
Coq < [<<(plus $p1 $p2)>>].
Let us give an example with this syntax:
Coq < Goal (O*+O)=O.
1 subgoal
============================
(plus (plus O O) (mult O O))=O
Conditional actions
We recall the syntax of conditional actions:
case action of pattern1 ->
action1 | ··· | patternn ->
actionn esac
The action to execute is chosen according to the value of
action. The matching is performed from left to right. The
selected action is the one associated to the first pattern that
matches the value of action. This matching operation will
bind the metavariables appearing in the selected pattern. The pattern
matching does need being exhaustive, and no warning is emitted. When the
pattern matching fails a message reports in which grammar rule the
failure happened.
Example 6: Overloading the ``+'' operator
The internal representation of an expression such as A+B depends
on the shape of A and B:
-
{P}+{Q}
uses sumbool
- otherwise,
A+{Q}
uses sumor
- otherwise,
A+B
uses sum.
The trick is to build a temporary AST: {A}
generates the node
(SQUASH A)
. When we parse A+B
, we remove the SQUASH in A and B:
Coq < Grammar command command1 :=
Coq < squash [ "{" lcommand($lc) "}" ] -> [(SQUASH $lc)].
Coq < Grammar command lassoc_command4 :=
Coq < squash_sum
Coq < [ lassoc_command4($c1) "+" lassoc_command4($c2) ] ->
Coq < case [$c2] of
Coq < (SQUASH $T2) ->
Coq < case [$c1] of
Coq < (SQUASH $T1) -> [<<(sumbool $T1 $T2)>>]
Coq < | $_ -> [<<(sumor $c1 $T2)>>]
Coq < esac
Coq < | $_ -> [<<(sum $c1 $c2)>>]
Coq < esac.
The problem is that sometimes, the intermediate SQUASH
node cannot re-shaped, then we have a very specific error:
Coq < Check {True}.
Toplevel input, characters 6-12
> Check {True}.
> ^^^^^^
Error: Unrecognizable braces expression.
Example 7: Comparisons and non-linear patterns
The patterns may be non-linear: when an already bound metavariable
appears in a pattern, the value yielded by the pattern matching must
be equal, up to renaming of bound variables, to the current
value. Note that this does not apply to the wildcard $_
. For
example, we can compare two arguments:
Coq < Grammar command command10 :=
Coq < refl_equals [ command9($c1) "||" command9($c2) ] ->
Coq < case [$c1] of $c2 -> [<<(refl_equal ? $c2)>>] esac.
Coq < Check ([x:nat]x || [y:nat]y).
(refl_equal nat->nat [y:nat]y)
: ([y:nat]y)=([y:nat]y)
The metavariable $c1
is bound to [x:nat]x
and
$c2
to [y:nat]y
. Since these two values are equal, the
pattern matching succeeds. It fails when the two terms are not equal:
Coq < Check ([x:nat]x || [z:bool]z).
Toplevel input, characters 7-28
> Check ([x:nat]x || [z:bool]z).
> ^^^^^^^^^^^^^^^^^^^^^
Error: during interpretation of grammar rule refl_equals,
Grammar case failure. The ast (LAMBDALIST nat [x]x)
does not match any of the patterns : $c2
with constraints :
$c1 = (LAMBDALIST nat [x]x)
$c2 = (LAMBDALIST bool [z]z)
9.2.4 Grammars of type List
Assume we want to define an non-terminal ne_identarg_list that
parses an non-empty list of identifiers. If the grammars could only
return AST's, we would have to define it this way:
Coq < Grammar tactic my_ne_ident_list :=
Coq < ident_list_cons [ identarg($id) my_ne_ident_list($l) ] ->
Coq < case [$l] of
Coq < (IDENTS ($LIST $idl)) -> [(IDENTS $id ($LIST $idl))]
Coq < esac
Coq < | ident_list_single [ identarg($id) ] -> [(IDENTS $id)].
But it would be inefficient: every time an identifier is read, we
remove the ``boxing'' operator IDENTS, and put it back once the
identifier is inserted in the list.
To avoid these awkward trick, we allow grammars to return AST
lists. Hence grammars have a type (Ast or List), just like
AST's do. Type-checking can be done statically.
The simple actions can produce lists by putting a list of constructive
expressions one beside the other. As usual, the $LIST operator
allows to inject AST list variables.
Coq < Grammar tactic ne_identarg_list : List :=
Coq < ne_idl_cons [ identarg($id) ne_identarg_list($idl) ]
Coq < -> [ $id ($LIST $idl) ]
Coq < | ne_idl_single [ identarg($id) ] -> [ $id ].
Note that the grammar type must be recalled in every extension
command, or else the system could not discriminate between a single
AST and an AST list with only one item. If omitted, the default type
is Ast. The following command fails because ne_identarg_list is already defined with type List but the
Grammar command header assumes its type is Ast.
Coq < Grammar tactic ne_identarg_list :=
Coq < list_two [ identarg($id1) identarg($id2) ] -> [ $id1 $id2 ].
Toplevel input, characters 15-31
> Grammar tactic ne_identarg_list :=
> ^^^^^^^^^^^^^^^^
Error: Entry tactic:ne_identarg_list already exists with another type
All rules of a same grammar must have the same type. For instance, the
following rule is refused because the command command1
grammar
has been already defined with type Ast, and cannot be extended
with a rule returning AST lists.
Coq < Grammar command command1 :=
Coq < carret_list [ command0($c1) "^" command0($c2)] -> [ $c1 $c2 ].
Toplevel input, characters 146-153
> carret_list [ command0($c1) "^" command0($c2)] -> [ $c1 $c2 ].
> ^^^^^^^
Error: entry type is Ast, but the right hand side is a list
9.2.5 Limitations
The extendable grammar mechanism have four serious limitations. The
first two are inherited from Camlp4.
-
Grammar rules are factorized syntactically: Camlp4 does not
try to expand non-terminals to detect further factorizations. The
user must perform the factorization himself.
- The grammar is not checked to be LL(1) when
adding a rule. If it is not LL(1), the parsing may fail on an input
recognized by the grammar, because selecting the appropriate rule
may require looking several tokens ahead. Camlp4 always selects
the most recent rule (and all those that factorize with it)
accepting the current token.
- There is no command to remove a grammar rule. However there is a
trick to do it. It is sufficient to execute the ``Reset''
command on a constant defined before the rule we want to remove.
Thus we retrieve the state before the definition of the constant,
then without the grammar rule. This trick does not apply to grammar
extensions done in Objective Caml.
- Grammar rules defined inside a section are automatically removed
after the end of this section: they are available only inside it.
The command Print Grammar prints the rules of a grammar. It is
displayed by Camlp4. So, the actions are not printed, and the
recursive calls are printed SELF
. It is
sometimes useful if the user wants to understand why parsing fails, or
why a factorization was not done as expected.
Coq < Print Grammar command command8.
[ LEFTA
[ Command.command7; "<->"; SELF
| Command.command7; "->"; SELF
| Command.command7 ] ]
Getting round the lack of factorization
The first limitation may require a non-trivial work, and may lead to
ugly grammars, hardly extendable. Sometimes, we can use a trick to
avoid these troubles. The problem arises in the Gallina syntax, to
make Camlp4 factorize the rules for application and product. The
natural grammar would be:
Coq < Grammar command command0 :=
Coq < parenthesis [ "(" command10($c) ")" ] -> [$c]
Coq < | product [ "(" prim:var($id) ":" command($c1) ")" command0($c2) ] ->
Coq < [(PROD $c1 [$id]$c2)]
Coq < with command10 :=
Coq < application [ command91($c1) ne_command91_list($lc) ] ->
Coq < [(APPLIST $c1 ($LIST $lc))]
Coq < | inject_com91 [ command91($c) ] -> [$c].
Coq < Check (x:nat)nat.
Toplevel input, characters 8-9
> Check (x:nat)nat.
> ^
Syntax error: ')' expected after [Command.command10] (in [Command.command0])
But the factorization does not work, thus the product rule is never
selected since identifiers match the command10 grammar. The
trick is to parse the ident as a command10 and check a
posteriori that the command is indeed an identifier:
Coq < Grammar command command0 :=
Coq < product [ "(" command10($c) ":" command($c1) ")" command0($c2) ] ->
Coq < [(PROD $c1 [$c]$c2)].
Coq < Check (x:nat)nat.
nat->nat
: Set
We could have checked it explicitly with a case in
the right-hand side of the rule, but the error message in the
following example would not be as relevant:
Coq < Check (S O:nat)nat.
Toplevel input, characters 7-10
> Check (S O:nat)nat.
> ^^^
Error: during interpretation of grammar rule product,
the variable $c was bound to (APPLIST S O)
instead of a (list of) variable(s).
This trick is not similar to the SQUASH node in which
we could not detect the error while parsing. Here, the error pops out
when trying to build an abstraction of $c2 over the value of
$c. Since it is not bound to a variable, the right-hand side of
the product grammar rule fails.