9.2 Extendable grammars

Grammar rules can be added with the Grammar command. This command is just an interface towards Camlp4, providing the semantic actions so that they build the expected AST. A simple grammar command has the following syntax:

Grammar entry nonterminal := rule-name LMP -> action .

The components have the following meaning:

a grammar name: defined by a parser entry and a non-terminal. Non-terminals are packed in an entry (also called universe). One can have two non-terminals of the same name if they are in different entries. A non-terminal can have the same name as its entry.
a rule (sometimes called production), formed by a name, a left member of production and an action, which generalizes constructive expressions.

grammar ::= Grammar entry gram-entry with ... with gram-entry

entry ::= ident

gram-entry ::= rule-name [: entry-type] := [production | ... | production]

rule-name ::= ident

entry-type ::= Ast | List

production ::= rule-name [ [prod-item ... prod-item] ] -> action

rule-name ::= ident

prod-item ::= string

| [entry :] entry-name [( meta )]

action ::= [ [ast ... ast] ]

| let pattern = action in action

| case action [: entry-type] of [case | ... | case] esac

case ::= [pattern ... pattern] -> action

pattern ::= ast

Figure 9.2: Syntax of the grammar extension command

The exact syntax of the Grammar command is defined in Fig. 9.2. It is possible to extend a grammar with several rules at once.

`Grammar` entry nonterminal	`:=`	production₁
	`\|`	· · ·
	`\|`	production_n `.`

Productions are entered in reverse order (i.e. production_n before production₁), so that the first rules have priority over the last ones. The set of rules can be read as an usual pattern matching.

Also, we can extend several grammars of a given universe at the same time. The order of non-terminals does not matter since they extend different grammars.

Grammar entry nonterminal₁ := production₁¹

| ·
·
·

|

production

n₁

with ·
·
·

with nonterminal_p := production₁^p

| ·
·
·

|

production

n_p

.

9.2.1 Grammar entries

Let us describe the four predefined entries. Each of them (except prim) possesses an initial grammar for starting the parsing process.

prim : it is the entry of the primitive grammars. Most of them cannot be defined by the extendable grammar mechanism. They are encoded inside the system. The entry contains the following non-terminals:
- var : variable grammar. Parses an identifier and builds an AST which is a variable.
- ident : identifier grammar. Parses an identifier and builds an AST which is an identifier such as {x}.
- number : number grammar. Parses a positive integer.
- string : string grammar. Parses a quoted string.
- path : section path grammar.
- ast : AST grammar.
- astpat : AST pattern grammar.
- astact : action grammar.
The primitive grammars are used as the other grammars; for instance the variables of terms are parsed by prim:var($id).
command : it is the term entry. It allows to have a pretty syntax for terms. Its initial grammar is command command. This entry contains several non-terminals, among them command0 to command10 which stratify the terms according to priority levels (0 to 10). These priority levels allow us also to specify the order of associativity of operators.
vernac : it is the vernacular command entry, with vernac vernac as initial grammar. Thanks to it, the developers can define the syntax of new commands they add to the system. As to users, they can change the syntax of the predefined vernacular commands.
tactic : it is the tactic entry with tactics tactic as initial grammar. This entry allows to define the syntax of new tactics. See chapter 10 about user-defined tactics for more details.

The user can define new entries and new non-terminals, using the grammar extension command. A grammar does not have to be explicitly defined. But the grammars in the left member of rules must all be defined, possibly by the current grammar command. It may be convenient to define an empty grammar, just so that it may be called by other grammars, and extend this empty grammar later. Assume that the command command13 does not exist. The next command defines it with zero productions.

Coq < Grammar command command13 := .

The grammars of new entries do not have an initial grammar. To use them, they must be called (directly or indirectly) by grammars of predefined entries. We give an example of a (direct) call of the grammar newentry nonterm by command command. This following rule allows to use the syntax a&b for the conjunction a/\b.

Coq < Grammar newentry nonterm :=
Coq < ampersand [ "&" command:command($c) ] -> [$c].

Coq < Grammar command command :=
Coq < new_and [ command8($a) newentry:nonterm($b) ] -> [<<$a/\$b>>].

9.2.2 Left member of productions (LMP)

A LMP is composed of a combination of terminals (enclosed between double quotes) and grammar calls specifying the entry. It is enclosed between ``['' and ``]''. The empty LMP, represented by [ ], corresponds to e in formal language theory.

A grammar call is done by entry:nonterminal($id) where:

entry and nonterminal specifies the entry of the grammar, and the non-terminal.
$id is a metavariable that will receive the AST or AST list resulting from the call to the grammar.

The elements entry and $id are optional. The grammar entry can be omitted if it is the same as the entry of the non-terminal we are extending. Also, $id is omitted if we do not want to get back the AST result. Thus a grammar call can be reduced to a non-terminal.

Each terminal must contain exactly one token. This token does not need to be already defined. If not, it will be automatically added. Nevertheless, any string cannot be a token (e.g. blanks should not appear in tokens since parsing would then depend on indentation). We introduce the notion of valid token, as a sequence, without blanks, of characters taken from the following list:

< > / \ - + = ; , | ! @ # % ^ & * ( ) ? : ~ $ _ ` ' a..z A..Z 0..9

that do not start with a character from

$ _ a..z A..Z ' 0..9

When an LMP is used in the parsing process of an expression, it is analyzed from left to right. Every token met in the LMP should correspond to the current token of the expression. As for the grammars calls, they are performed in order to recognize parts of the initial expression.
Warning: Unlike destructive expressions, if a variable appears several times in the LMP, the last binding hides the previous ones. Comparison can be performed only in the actions.

Example 1: Defining a syntax for inequality

The rule below allows us to use the syntax t1#t2 for the term ~t1=t2.

Coq < Grammar command command1 :=
Coq < not_eq [ command0($a) "#" command0($b) ] -> [<<~($a=$b)>>].

The level 1 of the grammar of terms is extended with one rule named not_eq. When this rule is selected, its LMP calls the grammar command command0. This grammar recognizes a term that it binds to the metavariable $a. Then it meets the token ``#'' and finally it calls the grammar command command0. This grammar returns the recognized term in $b. The action constructs the term ~($a=$b). Note that we use the command quotation on the right-hand side.
Warning: Metavariables are identifiers preceded by the ``$'' symbol. They cannot be replaced by identifiers. For instance, if we enter a rule with identifiers and not metavariables, an error occurs:

Coq < Grammar command command1 :=
Coq <   not_eq [ command0(a) "#" command0(b) ] -> [<<~(a=b)>>].
Warning: Could not globalize a
Warning: Could not globalize b
Toplevel input, characters 49-50
>   not_eq [ command0(a) "#" command0(b) ] -> [<<~(a=b)>>].
>                     ^
Error: This ident is not a metavariable.

For instance, let us give the statement of the symmetry of #:

Coq < Goal (A:Set)(a,b:A) a#b -> b#a.
1 subgoal

  ============================
   (A:Set; a,b:A)~a=b->~b=a

Conversely, one can force ~a=b to be printed a#b by giving pretty-printing rules. This is explained in section 9.3.

Example 2: Redefining vernac commands

Thanks to the following rule, ``|- term.'' will have the same effect as ``Goal term.''.

Coq < Grammar vernac vernac :=
Coq < thesis [ "|" "-" command:command($term) "." ]
Coq < -> [<:vernac:<Goal $term.>>].

This rule allows putting blanks between the bar and the dash, as in

Coq < | - (A:Prop)A->A.
1 subgoal

  ============================
   (A:Prop)A->A

Assuming the previous rule has not been entered, we can forbid blanks with a rule that declares ``|-'' as a single token:

Coq < Grammar vernac vernac :=
Coq < thesis [ "|-" command:command($term) "." ]
Coq < -> [<:vernac:<Goal $term.>>].

Coq < | - (A:Prop)A->A.
Toplevel input, characters 0-1
> | - (A:Prop)A->A.
> ^
Syntax error: illegal begin of vernac

If both rules were entered, we would have three tokens |, - and |-. The lexical ambiguity on the string |- is solved according to the longest match rule (see lexical conventions page X), i.e. |- would be one single token. To enforce the use of the first rule, a blank must be inserted between the bar and the dash.
Remark: The vernac commands should always be terminated by a period. When a syntax error is detected, the top-level discards its input until it reaches a period token, and then resumes parsing.

Example 3: Redefining tactics

We can give names to repetitive tactic sequences. Thus in this example ``IntSp'' will correspond to the tactic Intros followed by Split.

Coq < Grammar tactic simple_tactic :=
Coq < intros_split [ "IntSp" ] -> [<:tactic:<Intros; Split>>].

Let us check that this works.

Coq < Goal (A,B:Prop)A/\B -> B/\A.
1 subgoal

  ============================
   (A,B:Prop)A/\B->B/\A

Coq < IntSp.
2 subgoals

  A : Prop
  B : Prop
  H : A/\B
  ============================
   B
subgoal 2 is:
A

Note that the same result can be obtained in a simpler way with Tactic Definition (see page X). However, this macro can only define tactics which arguments are terms.

Example 4: Priority, left and right associativity of operators

The disjunction has a higher priority than conjunction. Thus A/\B\/C will be parsed as (A/\B)\/C and not as A/\(B\/C). The priority is done by putting the rule for the disjunction in a higher level than that of conjunction: conjunction is defined in the non-terminal command6 and disjunction in command7 (see file Logic.v in the library). Notice that the character ``\'' must be doubled (see lexical conventions for quoted strings on page X).

Coq < Grammar command command6 :=
Coq < and [ command5($c1) "/\\" command6($c2) ] -> [<<(and $c1 $c2)>>].

Coq < Grammar command command7 :=
Coq < or [ command6($c1) "\\/" command7($c2) ] -> [<<(or $c1 $c2)>>].

Thus conjunction and disjunction associate to the right since in both cases the priority of the right term (resp. command6 and command7) is higher than the priority of the left term (resp. command5 and command6). The left member of a conjunction cannot be itself a conjunction, unless you enclose it inside parenthesis.

The left associativity is done by calling recursively the non-terminal. Camlp4 deals with this recursion by first trying the non-left-recursive rules. Here is an example taken from the standard library, defining a syntax for the addition on integers:

Coq < Grammar znatural expr :=
Coq < expr_plus [ expr($p) "+" expr($c) ] -> [<<(Zplus $p $c)>>].

9.2.3 Actions

Every rule should generate an AST corresponding to the syntactic construction that it recognizes. This generation is done by an action. Thus every rule is associated to an action. The syntax has been defined in Fig. 9.2. We give some examples.

Simple actions

A simple action is an AST enclosed between ``['' and ``]''. It simply builds the AST by interpreting it as a constructive expression in the environment defined by the LMP. This case has been illustrated in all the previous examples. We will later see that grammars can also return AST lists.

Local definitions

When an action should generate a big term, we can use let pattern = action₁ in action₂ expressions to construct it progressively. The action action₁ is first computed, then it is matched against pattern which may bind metavariables, and the result is the evaluation of action₂ in this new context.

Example 5:

From the syntax t1*+t2, we generate the term (plus (plus t1 t2) (mult t1 t2)).

Coq < Grammar command command1 :=
Coq <   mult_plus [ command0($a) "*" "+" command0($b) ]
Coq <      -> let $p1=[<<(plus $a $b)>>] in
Coq <         let $p2=[<<(mult $a $b)>>] in
Coq <           [<<(plus $p1 $p2)>>].

Let us give an example with this syntax:

Coq < Goal (O*+O)=O.
1 subgoal

  ============================
   (plus (plus O O) (mult O O))=O

Conditional actions

We recall the syntax of conditional actions:

case action of pattern₁ -> action₁ | ··· | pattern_n -> action_n esac

The action to execute is chosen according to the value of action. The matching is performed from left to right. The selected action is the one associated to the first pattern that matches the value of action. This matching operation will bind the metavariables appearing in the selected pattern. The pattern matching does need being exhaustive, and no warning is emitted. When the pattern matching fails a message reports in which grammar rule the failure happened.

Example 6: Overloading the ``+'' operator

The internal representation of an expression such as A+B depends on the shape of A and B:

{P}+{Q} uses sumbool
otherwise,A+{Q} uses sumor
otherwise, A+B uses sum.

The trick is to build a temporary AST: {A} generates the node (SQUASH A). When we parse A+B, we remove the SQUASH in A and B:

Coq < Grammar command command1 :=
Coq <   squash [ "{" lcommand($lc) "}" ] -> [(SQUASH $lc)].

Coq < Grammar command lassoc_command4 :=
Coq <   squash_sum
Coq <   [ lassoc_command4($c1) "+" lassoc_command4($c2) ] ->
Coq <       case [$c2] of
Coq <         (SQUASH $T2) ->
Coq <             case [$c1] of
Coq <               (SQUASH $T1) -> [<<(sumbool $T1 $T2)>>]
Coq <             | $_           -> [<<(sumor $c1 $T2)>>]
Coq <             esac
Coq <       | $_           -> [<<(sum $c1 $c2)>>]
Coq <       esac.

The problem is that sometimes, the intermediate SQUASH node cannot re-shaped, then we have a very specific error:

Coq < Check {True}.
Toplevel input, characters 6-12
> Check {True}.
> ^^^^^^
Error: Unrecognizable braces expression.

Example 7: Comparisons and non-linear patterns

The patterns may be non-linear: when an already bound metavariable appears in a pattern, the value yielded by the pattern matching must be equal, up to renaming of bound variables, to the current value. Note that this does not apply to the wildcard $_. For example, we can compare two arguments:

Coq < Grammar command command10 :=
Coq <   refl_equals [ command9($c1) "||" command9($c2) ] ->
Coq <               case [$c1] of $c2 -> [<<(refl_equal ? $c2)>>] esac.

Coq < Check ([x:nat]x || [y:nat]y).
(refl_equal nat->nat [y:nat]y)
     : ([y:nat]y)=([y:nat]y)

The metavariable $c1 is bound to [x:nat]x and $c2 to [y:nat]y. Since these two values are equal, the pattern matching succeeds. It fails when the two terms are not equal:

Coq < Check ([x:nat]x || [z:bool]z).
Toplevel input, characters 7-28
> Check ([x:nat]x || [z:bool]z).
>        ^^^^^^^^^^^^^^^^^^^^^
Error: during interpretation of grammar rule refl_equals,
Grammar case failure. The ast (LAMBDALIST nat [x]x)
does not match any of the patterns : $c2
with constraints :
  $c1 = (LAMBDALIST nat [x]x)
  $c2 = (LAMBDALIST bool [z]z)

9.2.4 Grammars of type `List`

Assume we want to define an non-terminal ne_identarg_list that parses an non-empty list of identifiers. If the grammars could only return AST's, we would have to define it this way:

Coq < Grammar tactic my_ne_ident_list :=
Coq <   ident_list_cons [ identarg($id) my_ne_ident_list($l) ] ->
Coq <     case [$l] of
Coq <       (IDENTS ($LIST $idl)) -> [(IDENTS $id ($LIST $idl))]
Coq <     esac
Coq < | ident_list_single [ identarg($id) ] -> [(IDENTS $id)].

But it would be inefficient: every time an identifier is read, we remove the ``boxing'' operator IDENTS, and put it back once the identifier is inserted in the list.

To avoid these awkward trick, we allow grammars to return AST lists. Hence grammars have a type (Ast or List), just like AST's do. Type-checking can be done statically.

The simple actions can produce lists by putting a list of constructive expressions one beside the other. As usual, the $LIST operator allows to inject AST list variables.

Coq < Grammar tactic ne_identarg_list : List :=
Coq < ne_idl_cons [ identarg($id) ne_identarg_list($idl) ]
Coq < -> [ $id ($LIST $idl) ]
Coq < | ne_idl_single [ identarg($id) ] -> [ $id ].

Note that the grammar type must be recalled in every extension command, or else the system could not discriminate between a single AST and an AST list with only one item. If omitted, the default type is Ast. The following command fails because ne_identarg_list is already defined with type List but the Grammar command header assumes its type is Ast.

Coq < Grammar tactic ne_identarg_list :=
Coq < list_two [ identarg($id1) identarg($id2) ] -> [ $id1 $id2 ].
Toplevel input, characters 15-31
> Grammar tactic ne_identarg_list :=
> ^^^^^^^^^^^^^^^^
Error: Entry tactic:ne_identarg_list already exists with another type

All rules of a same grammar must have the same type. For instance, the following rule is refused because the command command1 grammar has been already defined with type Ast, and cannot be extended with a rule returning AST lists.

Coq < Grammar command command1 :=
Coq <   carret_list [ command0($c1) "^" command0($c2)] -> [ $c1 $c2 ].
Toplevel input, characters 146-153
>   carret_list [ command0($c1) "^" command0($c2)] -> [ $c1 $c2 ].
>                                                       ^^^^^^^
Error: entry type is Ast, but the right hand side is a list

9.2.5 Limitations

The extendable grammar mechanism have four serious limitations. The first two are inherited from Camlp4.

Grammar rules are factorized syntactically: Camlp4 does not try to expand non-terminals to detect further factorizations. The user must perform the factorization himself.
The grammar is not checked to be LL(1) when adding a rule. If it is not LL(1), the parsing may fail on an input recognized by the grammar, because selecting the appropriate rule may require looking several tokens ahead. Camlp4 always selects the most recent rule (and all those that factorize with it) accepting the current token.
There is no command to remove a grammar rule. However there is a trick to do it. It is sufficient to execute the ``Reset'' command on a constant defined before the rule we want to remove. Thus we retrieve the state before the definition of the constant, then without the grammar rule. This trick does not apply to grammar extensions done in Objective Caml.
Grammar rules defined inside a section are automatically removed after the end of this section: they are available only inside it.

The command Print Grammar prints the rules of a grammar. It is displayed by Camlp4. So, the actions are not printed, and the recursive calls are printed SELF. It is sometimes useful if the user wants to understand why parsing fails, or why a factorization was not done as expected.

Coq < Print Grammar command command8.
[ LEFTA
  [ Command.command7; "<->"; SELF
  | Command.command7; "->"; SELF
  | Command.command7 ] ]

Getting round the lack of factorization

The first limitation may require a non-trivial work, and may lead to ugly grammars, hardly extendable. Sometimes, we can use a trick to avoid these troubles. The problem arises in the Gallina syntax, to make Camlp4 factorize the rules for application and product. The natural grammar would be:

Coq < Grammar command command0 :=
Coq <   parenthesis [ "(" command10($c) ")" ] -> [$c]
Coq < | product [ "(" prim:var($id) ":" command($c1) ")" command0($c2) ] ->
Coq <             [(PROD $c1 [$id]$c2)]
Coq < with command10 :=
Coq <   application [ command91($c1) ne_command91_list($lc) ] ->
Coq <                [(APPLIST $c1 ($LIST $lc))]
Coq < | inject_com91 [ command91($c) ] -> [$c].

Coq < Check (x:nat)nat.
Toplevel input, characters 8-9
> Check (x:nat)nat.
>         ^
Syntax error: ')' expected after [Command.command10] (in [Command.command0])

But the factorization does not work, thus the product rule is never selected since identifiers match the command10 grammar. The trick is to parse the ident as a command10 and check a posteriori that the command is indeed an identifier:

Coq < Grammar command command0 :=
Coq <   product [ "(" command10($c) ":" command($c1) ")" command0($c2) ] ->
Coq <             [(PROD $c1 [$c]$c2)].

Coq < Check (x:nat)nat.
nat->nat
     : Set

We could have checked it explicitly with a case in the right-hand side of the rule, but the error message in the following example would not be as relevant:

Coq < Check (S O:nat)nat.
Toplevel input, characters 7-10
> Check (S O:nat)nat.
> ^^^
Error: during interpretation of grammar rule product,
the variable $c was bound to (APPLIST S O)
instead of a (list of) variable(s).

This trick is not similar to the SQUASH node in which we could not detect the error while parsing. Here, the error pops out when trying to build an abstraction of $c2 over the value of $c. Since it is not bound to a variable, the right-hand side of the product grammar rule fails.

grammar	::=	`Grammar` entry gram-entry `with` ... `with` gram-entry
entry	::=	ident
gram-entry	::=	rule-name [`:` entry-type] `:=` [production `\|` ... `\|` production]
rule-name	::=	ident
entry-type	::=	`Ast` \| `List`
production	::=	rule-name `[` [prod-item ... prod-item] `] ->` action
rule-name	::=	ident
prod-item	::=	string
	\|	[entry `:`] entry-name [`(` meta `)`]
action	::=	`[` [ast ... ast] `]`
	\|	`let` pattern `=` action `in` action
	\|	`case` action [`:` entry-type] `of` [case `\|` ... `\|` case] `esac`
case	::=	[pattern ... pattern] `->` action
pattern	::=	ast