Chapter 7Choosing among alternatives

Where it’s practical to write a grammar that is unambiguous, that’s best. But it isn’t always practical. Sometimes it’s difficult, and sometimes it’s impossible. If the data is actually ambiguous, you may have to reflect that in your grammar.

Invisible XML doesn’t consider ambiguity an error, but it also doesn’t provide any mechanism for controlling it. All parses are considered equal and the processor’s only obligation is to provide one of them.

That may not suit your needs. CoffeeSacks provides a way to examine the alternatives and select one. You can supply a choose-alternative function in the parser options. This must be a function that takes a single argument, a list of elements, and returns an integer.

Note

More than one approach to resolving ambiguity may be applied. For example, CoffeeFilter supports a “priority” pragma that lets the grammar author identify a preferred resolution. When the choose-alternative function is called, the first element in the list is always the currently selected alternative, if some previous process has been applied.

The function must return an integer between 0 and the number of alternatives provided. The number is the alternative selected. A value of 0 indicates that the function did not select an alternative. The implementation may then go on to try other methods for selecting an alternative.

This function always selects the first alternative:

  |<xsl:function name="f:first" as="xs:integer">
  |  <xsl:param name="alts" as="element()+"/>
  |  <xsl:sequence select="1"/>
  |</xsl:function>

You could pass it in parse options like this:

  |<xsl:variable name="parser"
  |              select="cs:make-parser($grammar,
  |                      map{'choose-alternative': f:first#1})"/>

Of course, unconditionally choosing the first option isn’t very interesting. To explore further, consider this simple, ambiguous grammar:

1 | number-list: (number, -#a)+, number? .
  |-number: hex | decimal .
  | hex: hex-digit+ .
  | decimal: decimal-digit+ .
5 |-hex-digit: ["0"-"9" | "a"-"f" | "A"-"F" ] .
  |-decimal-digit: ["0"-"9" ] .

If we parse the following input,

  |bad
  |cafe
  |42

We might get this result:

1 |<number-list xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
  |  <hex>bad</hex>
  |  <hex>cafe</hex>
  |  <hex>42</hex>
5 |</number-list>

Of course, we might equally get this result:

1 |<number-list xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
  |  <hex>bad</hex>
  |  <hex>cafe</hex>
  |  <decimal>42</decimal>
5 |</number-list>

The ambiguity here is between “decimal” and “hexidecimal”. Here’s a function that will always select the decimal alternative:

  |<xsl:function name="f:choose-alternative" as="xs:integer">
  |  <xsl:param name="alternatives" as="element()+"/>
  |  <xsl:sequence select="$alternatives[decimal]/@alternative"/>
  |</xsl:function>

In order to understand how this works, we need to look at what’s passed to the function. The function get’s an XML description of the current state of the parse, where each element in the list represents an alternative. For the grammar and input above, what’s passed to the function is a sequence of two elements:

 1 |(<number alternative="1" name="number" from="9" to="11" mark="-">
   |   <hex name="hex" from="9" to="11" mark="^">
   |     <a:nonterminal name="$2_hex-digit-plus" mark="-"></a:nonterminal>
   |   </hex>
 5 | </number>,
   | <number alternative="2" name="number" from="9" to="11" mark="-">
   |   <decimal name="decimal" from="9" to="11" mark="^">
   |     <a:nonterminal name="$3_decimal-digit-plus" mark="-"></a:nonterminal>
   |   </decimal>
10 | </number>)

Each alternative is identified with its number, name, the range of tokens it covers, and its mark. The content of each alternative is the sequence of literals and nonterminals that are the “right hand side” of this alternative. For example, the first alternative above includes:

 1 |<number alternative="1"                          
   |        name="number"                            
   |        from="9" to="11"                         
   |        mark="-">                                
 5 |   <hex name="hex"                                         
   |        from="9" to="11"                                   
   |        mark="^">                                          
   |      <a:nonterminal name="$2_hex-digit-plus"    
   |                     mark="-">                             
10 |
   |      </a:nonterminal>
   |   </hex>
   |</number>

The alternatives are numbered. Return this number to select this alternative. The sequence of alternatives is also ordered by alternative number.

The name of the nonterminal is provided. The name is always provided in an attribute for convenience in the case where the element in the tree and the nonterminal have different names (see ).

This is the range of characters (tokens) in the input covered by this nonterminal.

This is the mark associated with the nonterminal.

When the nonterminal has been created by the parser, the element name and the nonterminal name are different. See Section 7.1, “Grammar details”.

An ellipsis appears where content will appear “further down” in the tree.

The sequence of nodes passed to the function are the alternatives, but that isn’t the whole story. The alternatives will include additional context if they are not the root of the result tree. For example, the whole tree associated with the first alternative in this case is:

 1 |<number-list xmlns:a="https://nineml.org/ns/describe-ambiguity"
   |             a:version="1.0"
   |             name="number-list"
   |             mark="^">
 5 |   <number name="number"/>
   |   <number name="number"/>
   |   <number alternative="1" name="number" from="9" to="11" mark="^">
   |      <hex name="hex" from="9" to="11" mark="^">
   |         <a:nonterminal name="$2_hex-digit-plus" mark="-"></a:nonterminal>
10 |      </hex>
   |   </number>
   |</number-list>

The context includes all of the ancestors of the point where a choice must be made and the number of preceding siblings. Preceding siblings are identified simply by name.

This example doesn’t include any matched literal tokens, but if they are present, they are wrapped in a:literal elements.

Which aspects of the context (if any) are relevant is going to depend on your grammar and the input.

7.1Grammar details

Invisible XML is an example of an extended Backus-Naur form (EBNF) grammar. The underlying parsing technlogies, either Earley or GLL, operate on simple Backus-Naur form (BNF) grammars. What that means in practice is that the first thing CoffeeGrinder does to your Invisible XML grammar is convert it into a plain BNF. That process introduces new nonterminals.

Mostly this is a behind-the-scenes transformation that isn’t relevant to the user, but it’s inescapable when describing ambiguity. Ultimately, the ambiguity is in the BNF and that’s the only grammar that the parser can describe.

The oddly named nonterminals seen above come from the fact that our iXML grammar has been transformed into this BNF:

 1 |                   $$ ::= number-list
   |          number-list ::= $1_number-plus, $4_number-optionⁿ
   |               number ::= hex
   |               number ::= decimal
 5 |                  hex ::= $2_hex-digit-plus
   |              decimal ::= $3_decimal-digit-plus
   |            hex-digit ::= ['0'-'9'; 'a'-'f'; 'A'-'F']
   |        decimal-digit ::= ['0'-'9']
   |       $1_number-plus ::= number, #A
10 |       $1_number-plus ::= number, #A, $1_number-plus
   |    $2_hex-digit-plus ::= hex-digit
   |    $2_hex-digit-plus ::= hex-digit, $2_hex-digit-plus
   |$3_decimal-digit-plus ::= decimal-digit
   |$3_decimal-digit-plus ::= decimal-digit, $3_decimal-digit-plus
15 |    $4_number-optionⁿ ::= ε
   |    $4_number-optionⁿ ::= number

This grammar is simpler in the sense that the rules for matching tokens in the input are simpler. There are no repetitions, alternatives, or other features on the “right hand side” of each production.

It appears more complicated partly because there are more productions and there can be multiple productions for any given nonterminal. It also appears more complicated because it’s impossible to generate semantically meaningful names for the new nonterminals introduced.