M.Sc. thesis

Literature review

The second chapter of the thesis I submitted in partial fulfilment of an M.Sc. in Informatics at the University of Edinburgh

To the best of my knowledge there has been no previous work on evaluating XPath expressions against document types. However, much research has focused separately on the three main parts of this project: automata theory, XML Schema, and XPath. This chapter details some of the work related to this thesis, and discusses its relevance.

The finite automaton came about during the middle years of the last century, based on Turing’s model of algorithmic computation1. Turing’s paper is considered by many to be the foundation of modern computing; in it he defined a machine with finite control and an input-and-output tape. The machine could, in one move, read a symbol from the tape, write a different symbol back onto the tape, change state, and move the position of the tape in either direction.

From this work came McCulloch and Pitts’s model of a neuron2. This automaton-like binary device took excitatory and inhibitive input from other neurons and fired once its activation passed a given threshold. Based on this Kleene3 defined finite automata (see section 3.1 for a definition) along with regular expressions, and proved them to be equivalent. From this comes the axiom that any regular language may be represented as a finite automaton.

The part of XML Schema that constrains the validity of elements and attributes, the content model, is itself a regular language. Thompson and Tobin4 have shown that by modifying Aho and Ulman’s algorithm for converting regular expressions to finite-state automata5, XML schemas can be represented as finite-state automata themselves. Thompson and Tobin use this to implement an XML Schema validation tool, but it can also be used to evaluate an input language against the XML Schema, as will be seen in this thesis.

Recent database research has seen a shift away from traditional relational database management systems towards XML and semi-structured data, with XML Schema used for data description. Arenas et al.6 have shown that XML Schema has problems with the integrity constraints that are integral to database design (i.e. constraint checking is intractable and NP-hard). This, along with other problems (e.g. a lack of required restriction7), has seen an updated version of XML Schema drafted that will eventually supersede the current standard.

XPath has also seen much research interest since its publication by the W3C. Neven8 introduced tree-automata theory for use with XML, and showed how to use such automata to parse and accept XPath expressions. XPath is a complex language, and some have realized that in many cases, a full XPath parser is not necessary. Benedikt et al.9 have discussed several fragments of XPath (for instance only implementing either downward or upward axes), and concentrated on simplifying and optimizing XPath expressions.

On papers that discuss practical implementations rather than theory, there are a number of techniques for processing XPath expressions. Gottlob et al.10 11 introduced algorithms for evaluating XPath expressions using top-down and bottom-up parsing techniques. Interestingly they note that the XPath axes (with the exception of attribute and namespace) can be defined using using two primitives, firstchild and nextsibling, and their inverses. Altinel and Franklin12 describe XFilter, a system for evaluating large numbers of XPath expressions using what is essentially a highly-optimized non-deterministic finite-state automata. A similar system, XTrie, is described by Chan et al.13 XTrie is a more efficient system than XFilter; it identifies common sub-strings in the XPath expression and organizes them in a trie14, and also uses extra optimization techniques at run time. A related system is detailed by Green et al.15; This again models the XPath expression as a finite-state automaton — although here it is deterministic — and evaluates it against streamed XML documents.

These systems all evaluate against the XML document itself, and pay no attention to the related DTD or XML schema. The latter three also model the XPath expression as finite-state automata and use the document as the input language. This project will differ by using the document type and not the document, and by modelling the document type as finite-state automata with the XPath expression as the input language.

  1. A. M. Turing. On computable numbers, with an application to the Entsheidungsproblem. Proc. of the London Mathematical Society, 42:230–265, 1936. A correction was published in the following volume, pp. 544–546. ↩︎

  2. W. S. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, pages 115–133, 1943. Reprinted in Neurocomputing: Foundations of Research, ed. by J. A. Anderson and E. Rosen- feld, MIT Press 1988. ↩︎

  3. S. C. Kleene. Representation of events in nerve nets and finite automata. In C. Shannon and J. McCarthy, editors, Automata Studies, pages 3–41. Princeton University Press, Princeton, NJ., 1956. ↩︎

  4. H. S. Thompson and R. Tobin. Using finite state automata to implement W3C XML Schema content model validation and restriction checking. In Proc. of the European XML Conference, p. 11, 2003. https://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html ↩︎

  5. A. Aho and J. Ullman. Principles of Compiler Design. Addison-Wesley, Reading, Mass., 1977. ↩︎

  6. M. Arenas, W. Fan, and L. Libkin. What’s hard about XML Schema constraints? Lecture Notes in Computer Science, 2453:269–278, 2002. ↩︎

  7. The lack of required restriction is mentioned in H. S. Thompson. W3C XML Pointer, XML Base and XML Linking, 2003. http://www.w3.org/XML/Linking. ↩︎

  8. F. Neven. Automata theory for XML researchers. SIGMOD Record, 31(3):39–46, 2002. ↩︎

  9. M. Benedikt, W. Fan, and G. M. Kuper. Structural properties of XPath fragments. Lecture Notes in Computer Science, 2572:79–95, 2003. ↩︎

  10. G. Gottlob, C. Koch, and R. Pichler. Efficient algorithms for processing XPath queries. In Proc. of the 28th Int’l Conf. on Very Large Databases, pages 95–106, Hong Kong, China, 2002. Morgan Kaufmann Publishers. ↩︎

  11. G. Gottlob, C. Koch, and R. Pichler. XPath processing in a nutshell. SIGMOD Record, 32(1):12–19, 2003. ↩︎

  12. M. Altinel and M. J. Franklin. Efficient filtering of XML documents for selective dissemination of information. VLDB Journal, 9:53–64, 2000. ↩︎

  13. C. Chan, P. Felber, M. Garofalakis, and R. Rastogi. Efficient filtering of XML documents with XPath expressions. In R. Agrawal, K. Dittrich, and A. H. H. Ngu, editors, Proc. of the 18th Int’l Conf. on Data Engineering, pages 235–244, San Jose, Calif., 2002. IEEE Comput. Soc. ↩︎

  14. A tree for storing strings. There is one node for each common prefix, with the strings stored in leaf nodes. ↩︎

  15. T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing XML streams with deterministic automata. Lecture Notes in Computer Science, 2572:173–189, 2003. ↩︎

This is a chapter of the thesis I submitted in partial fulfilment of an M.Sc. in Informatics at the University of Edinburgh. Read the full thesis.