M.Sc. thesis


The sixth chapter of the thesis I submitted in partial fulfilment of an M.Sc. in Informatics at the University of Edinburgh

This chapter looks at the conclusions that can be made from the preceding material. The hypothesis used as the basis of this work was given in the first chapter, and stated that introducing an intermediate step before an XPath expression is evaluated against an XML document instance would be a sagacious measure. This step would provide an abstract evaluation of the XPath expression against the document type — an XML schema — to determine whether a concrete evaluation against the XML document could, in principle, be satisfied.

In support of the hypothesis this thesis set out the method involved in converting any given XML schema into a form that could be used to evaluate an XPath expression. XSV, an XML Schema validation tool from the University of Edinburgh, was used to convert XML schemas into a collection of finite-state automata that represented the content models of the elements and attributes defined in a schema. The XPath expression can then be modelled as an input sentence for the automata. The input sentence is made up of symbols, in this case location steps, that are used to progress through the automata. Once the end of the input sentence has been reached, the XPath expression can be said to have been evaluated successfully, and therefore in principle could return a non-empty node set when evaluated against the XML document itself. If the end of the input sentence cannot be reached then the evaluation has failed and has thus proved that further evaluation against the XML document would be pointless.

The project was successful in providing a partial implementation of the most important type of XPath expression, the location path. XPath defines this construct as having three components parts, the axis specifier, the node test, and a (possibly empty) set of predicates. The first of these contains one of thirteen axes, six of which have been implemented here. These six are all but two of the eight forward axes, i.e. those that allow movement downwards through the document tree. The remaining unsupported axes are those that allow movement upwards through the tree. By focusing on the forward axes the project could use a system of one-dimensional node sets in place of a document tree to evaluate the location paths. This system meant that possible problems with left-recursion in trees was avoided.

Node tests are supported when using node name literals. This allows the XPath expression to evaluate against specific elements and attributes, but not against wildcard structures. Multiple namespaces are fully supported in node tests by using corresponding prefixes passed by the user. Predicates are supported only in part, mainly due to the complexity of their structure. A predicate can contain any type of XPath expression, meaning that XPath expressions have a recursive design. A method to evaluate expressions recursively was devised, and certain additional expression types were implemented for use in predicates. Currently there is support for predicates that contain location paths, binary or expressions, binary and expressions, and unary expressions for existence-checking.

As is apparent from the previous paragraphs, a full XPath implementation was not created. Although this was the original intention of the project the time-span did not allow for it. XPath is a very complex language, and this was underestimated at the start of the project. In order to complete the XPath evaluator, the reverse axes are needed, along with wildcards in node tests, and support for the expression types other than location paths. It would be possible to augment the current system with the necessary additions to complete the implementation, the most important being to include a method of recording the path taken through the document tree in order to evaluate the expression. However a more interesting and possibly more efficient process would be to use ordered cyclic graphs as well as finite-state automata to keep track of the path taken through evaluation. It is suggested that any further work looks at expanding the system to include such graphs.

The system has shown that the evaluation of XPath expressions against document types rather than the document itself is a feasible proposition. By evaluating expressions in such a way it is possible to decide whether the evaluation against the document is a worthwhile proposition or not. In doing so, this sort of evaluation can be used to filter out pointless and impossible evaluations before they take place. Inserting this sort of evaluation as an intermediate step before a concrete evaluation took place would be of some advantage to an XPath evaluator — providing efficiency gains for example. Such a measure should be considered worthwhile, which leaves the hypothesis satisfied.

This is a chapter of the thesis I submitted in partial fulfilment of an M.Sc. in Informatics at the University of Edinburgh. Read the full thesis.