REX: XML Shallow Parsing with Regular Expressions

Robert D. Cameron
School of Computing Science
Simon Fraser University

CMPT TR 1998-17

November, 1998

Copyright 1998, Robert D. Cameron. Permission to copy for individual use is permitted. Multiple copies may be made for use in classrooms, discussion groups, or committee meetings, provided that notice of the intent and extent of the copying is sent to the author (e-mail is satisfactory). Reproduction or republication in any general distribution medium is not permitted without explicit consent of the author. All copying requires that the integrity of the paper be preserved and that this copyright notice be reproduced in full.

Contents

Abstract

The syntax of XML is simple enough that it is possible to parse an XML document into a list of its markup and text items using a single regular expression. Such a shallow parse of an XML document can be very useful for the construction of a variety of lightweight XML processing tools. However, complex regular expressions can be difficult to construct and even more difficult to read. Using a form of literate programming for regular expressions, this paper documents a set of XML shallow parsing expressions that can be used a basis for simple, correct, efficient, robust and language-independent XML shallow parsing. Complete shallow parser implementations of less than 50 lines each in Perl, JavaScript and Lex/Flex are also given.

I. Introduction

The syntax of Extensible Markup Language (XML) is simple enough that it is possible to contemplate lightweight XML parsers based entirely on regular expression technology. This is no accident; the ease of writing programs to process XML documents was an important design goal for XML [XML 1.0]. Indeed, early working drafts of the XML specification considered a "trivial text grammar" that could be used as the basis for the "simplest form of XML processor" [WD-961114, WD-970807]; a version thereof also appears in the SGML FAQ Book [DeRose]. Unfortunately, these grammars never seem to have been debugged and developed to the point of usability for parsing; the trivial text grammar was dropped from the final version of the XML 1.0 specification.

Perhaps the most intriguing application of regular expressions to XML parsing is that a single regular expression can serve as a basis for completely parsing a document into a list of its individual markup and text components. Such an operation is called a shallow parse of the document, because the internal structure of the markup terms is not parsed in the initial instance. For example, if $XML_SPE is a Perl variable denoting a shallow parsing regular expression, then the following Perl subroutine computes and returns the shallow parse of an XML document as an array of strings.

sub ShallowParse { 
  my($XML_document) = @_;
  return $XML_document =~ /$XML_SPE/g;
}

Shallow parsing can be implemented with similar ease using other languages or tools that provide regular expression support. Full implementations of shallow parsing in Perl, JavaScript and Lex/Flex are provided as Appendices to this paper.

Once shallow parsing has returned the list of strings representation of an XML document, further parsing operations can be applied to extract the deep structure of markup items. Regular expressions can also be applied to a number of these subsequent processing steps. In general, the regular expressions for subsequent processing can be relatively simple adaptations of corresponding structures from the overall shallow parsing expression. In some cases, completion of the parsing task also requires some recursive parsing procedures to deal with, for example, the recursive structure of element content models.

The combination of the initial shallow parser of XML documents together with the secondary parsing operations suggests a third form of XML processor API (application programmer interface) to supplement the event-based APIs such as those based on SAX and the tree-based APIs such as those based on expat. The shallow parser approach has some interesting advantages. The list of strings representation is very simple and may often be directly coded using a native data structure in the host language. In comparison with the tree-based approach this simplifies the interface by eliminating the need for a programmed data structure and corresponding access routines. The representation of markup terms by their string representation may also be considerably more compact than corresponding parsed representations. Another advantage is that it is easy to regenerate the original XML document or components thereof using simple string concatenation operations. Finally, the shallow parsing approach is fault-tolerant: it provides a correct parse for correct XML documents and a useful parse for attempted but incorrect XML documents. In fact, shallow parsing need never fail; given any document whatsoever, an appropriate list of strings representation of that document can be produced. For the implementation of lightweight XML-to-XML filters and fault-tolerant XML editing tools, these advantages may strongly favour the shallow parser approach. A full comparative analysis of the three approaches is beyond the scope of this paper, however.

Implementation of shallow parsing using regular expressions has the additional benefit that it is a relatively language-independent technique. Modern scripting languages now generally provide built-in regular expression support, while other languages can make use of widely available regular expression libraries or lexical analyzer generators. Thus, regular-expression based designs for XML processing tools can often be contemplated without a premature commitment to implementation language. Furthermore, once an implementation has been developed, rehosting of that implementation in a different language may be relatively straightforward, at least insofar as the regular-expression based processing is concerned. However, language independence is not automatic; a number of nonstandard extensions or special processing modes exist in the various regular expression support packages. To maximize language independence, the regular expressions developed in this paper generally avoid nonstandard features.

The goal of this paper, then, is to present a language-independent toolkit of XML regular expressions that can be used as the basis of simple, correct, and efficient XML processing tools. Section II of this paper briefly discusses important preliminaries of regular expression notation, character set issues and the efficiency of regular-expression based processing. Section III describes the most important regular expression developed in this paper: a single regular expression that can be used to produce a shallow parse of an entire XML document. Subsequent to shallow parsing there are a variety of other tasks that can be aided by carefully constructed regular-expressions; these are discussed in section IV. Section V concludes the main body of the paper; the Appendix gives complete code for sample regular-expression based XML tools implemented in JavaScript.

II. Preliminaries

II.1 Regular Expression Notation

Regular expressions are presented in this paper using a grammatical notation adapted from the XML specification. This notation allows a free-format presentation (spacing is irrelevant) and avoids the readability problems of excessive backslash escapes for special characters. Compared to the notation of the XML specification, restrictions are imposed in order to ensure that the grammar is indeed regular and can be easily translated into the syntax used by typical regular expression support facilities.

The following conventions are used for regular expression construction.

Overall, the translation of regular expressions given using these conventions to the equivalent notation of a regular expression support package is relatively straightforward. The appendices illustrates the translation in several instances.

II.2 Regular Expression Processing on ISO 10646/Unicode Character Sets

An apparent problem with the use of regular expressions as a basis for XML processing is that many regular expression packages are based on single-octet (8-bit) extended ASCII (e.g., ISO Latin 1) character sets, while XML requires support for the full ISO 10646/Unicode character set using various alternative multi-octet encodings. Over time, this problem may be solved by the general upgrading of regular expression support packages to provide explicit support for ISO 10646 character sets. However, for some applications, including XML parsing, it is possible to fairly convenient use of 8-bit regular expression packages on the UTF-8 encoding of ISO 10646, provided that some care is taken in the treatment of non-ASCII characters.

UTF-8 is a particular encoding of ISO 10646 designed for interoperability with some forms of ASCII-based single-octet processing. UTF-8 is a variable length (1 to 6 octet) encoding with two important properties with respect to the ASCII representation. First, characters in the 7-bit ASCII character subset (high-order bit 0) have the same bit representation as single octets in UTF-8. Second, octets 2 through 6 in UTF-8 always have a high-order bit 1 to avoid confusion with 7-bit ASCII characters. Thus, ASCII-based applications that make processing decisions only on the occurrence of 7-bit characters and otherwise pass through 8-bit characters (non-ASCII octets) unaltered can often be used with equivalent effect on the UTF-8 representation.

There are two important techniques that allow UTF-8 processing with 8-bit extended ASCII regular expression packages. First, a simple restricted form of regular expression construction can ensure that they are suitable for processing UTF-8 documents directly. In short, the restriction is that non-ASCII characters may only be accepted by negative character set constructors in combinations equivalent to "[^A] [^B]*", where A and B represent possibly different sets of 7-bit ASCII characters. This restriction is explained in subsection II.2.1 below. The second technique is to directly write regular expressions over UTF-8 octets using hexadecimal character code notation. For XML processing, this technique is only relevant for processing of the non-ASCII name characters that may appear in names and name tokens. Processing of XML name characters is discussed in subsection II.2.2.

Encodings other than UTF-8 are also permitted for XML documents. However, as UTF-8 is a universal encoding format for ISO 10646, it is always possible to convert the alternatively encoded documents to the UTF-8 representation. If output documents are then to be generated based on the UTF-8 representation, an important technicality is to ensure that any encoding declaration in the XML document is rewritten to use the string "UTF-8".

II.2.1 Extended ASCII Processing of UTF-8

ASCII-based processing for UTF-8 applications can be formalized based on the following notion of ASCII-based automata without counting. An ASCII-based automaton without counting is a deterministic finite automaton over a character set that includes the 7-bit ASCII set as a subset and meets the constraints that (a) all super-ASCII transitions are total, and (b) all super-ASCII acceptance states include a super-ASCII self transition. A super-ASCII transition is one which accepts any non-ASCII character as well as possibly some ASCII characters; a super-ASCII transition is total if it accepts all non-ASCII characters. A super-ASCII acceptance state is one reached by a super-ASCII transition. In essence, constraint (a) means that the automaton is ASCII-based, that is, transition decisions are based on the presence or absence of specific 7-bit ASCII characters. If the test is based on the absence of specific ASCII characters, then all non-ASCII characters are acceptable and the super-ASCII transition is total. Constraint (b) means that non-ASCII characters are accepted without counting. Once a non-ASCII character is accepted, the automaton accepts arbitrarily more non-ASCII characters in the same state. An exit transition for the state may only be defined based on the recognition of specific ASCII characters.

The essential result that enables ASCII-based processing for UTF-8 applications is that an ASCII-based automaton without counting defined over an 8-bit extended ASCII character set implements an equivalent automaton over UTF-8. In the 8-bit case, note that the super-ASCII transitions are the ones that accept octets with the high-order bit set. Consider then what happens if a UTF-8 document is supplied to an 8-bit ASCII-based automaton without counting. In essence, the automaton is being applied to the octet stream of the UTF-8 document which is not necessarily the same as the character stream. However, octets with zero in the high-order bit will be accepted and processed correctly as the ASCII characters they represent. When an octet with high-order bit set to one is encountered, it is the start of a multi-octet sequence coding for a non-ASCII character of ISO 10646. The automaton will initially treat this as a single-octet non-ASCII character. However, the character will only be accepted if a super-ASCII transition exists in the current state. If so, a super-ASCII acceptance state is entered. This state will have a super-ASCII self transition that will accept additional octets with the high-order bit set indefinitely. Thus the remaining octets of the UTF-8 sequence coding for the non-ASCII character (which octets must all have their high-order bits set) will be accepted without leaving this state. In essence, the state has been entered with the acceptance of the full multi-octet coding of the non-ASCII character. As a result, the 8-bit ASCII-based automaton accepts UTF-8 octet sequences in precisely the fashion required for implementation of the corresponding automaton over the UTF-8 character set.

The constraints for UTF-8 processing using 8-bit extended ASCII automata can be expressed as corresponding constraints on the use of 8-bit ASCII regular expression packages. Again, the requirement that all super-ASCII transitions are total is equivalent to the statement that transitions decisions are based on the presence or absence of specific ASCII characters. In regular expression terms, this corresponds to a requirement that only the 7-bit ASCII characters be used in literal strings or character set constructors. A super-ASCII transition then arises whenever a character set is constructed in negative form, for example, "[^>;.]". Such an expression accepts any single character except those listed. Because only 7-bit ASCII characters are listed, every non-ASCII character is acceptable and the resulting super-ASCII transition is total.

The second constraint for UTF-8 processing using 8-bit ASCII regular expression packages arises from the requirement that super-ASCII acceptance states have super-ASCII self-transitions. In regular expression terms, a super-ASCII self-transition corresponds to a zero-or-more repetition of a negated character set expression. The prototypical regular expression for super-ASCII transitions in a super-ASCII acceptance state is thus "[^A][^B]*", where A and B represent possibly different sets of 7-bit ASCII characters. In this case, the initial super-ASCII transition based on "[^A]" enters a state with a super-ASCII self-transition based on "[^B]". However, other regular expression constructions can have a similar effect. Note, for example that that the form "[^A]+" is acceptable because it is equivalent to "[^A][^A]*". A more complex example that may also be implemented by an ASCII-based DFA without counting is "([^A]|R)([^B]*|R2)", provided that neither subexpression R nor R2 initially matches a non-ASCII character.

In general, except for the processing of names and name tokens as discussed in the next subsection, XML parsing actions may be implemented ASCII-based automata without counting. Every regular expression defined in sections III and IV of this paper meet the restrictions outlined here.

II.2.2 Processing of Non-ASCII Name Characters

XML documents can largely be processed using ASCII-based techniques compatible with the constraints described above. The only difficulty is that names and name tokens are permitted to include some, but not all, non-ASCII characters from the ISO 10646 as name characters (NameChar in the XML specification). The super-ASCII transitions of the Unicode subautomata for name recognition are thus nontotal and cannot be directly implemented using the techniques of the previous section.

However, there are two relatively straightforward approaches to dealing with the issue of processing non-ASCII name characters using extended-ASCII regular expression packages. The first is to use direct hexadecimal coding of UTF-8 octets to define regular expressions for all and only the legal name structures of XML. Construction of these expressions is tedious, but not conceptually complex. The second approach is to relax the restrictions on non-ASCII name characters. That is, because non-ASCII characters play no role in delimiting names within correct XML markup, every non-ASCII character may safely be accepted as a name character during XML shallow parsing. Using this approach, correct XML documents will be correctly parsed, and documents that are erroneous by virtue of an illegal non-ASCII name character will nevertheless be usefully parsed in a way that supports the implementation of robust XML filters.

For simplicity and its support of robustness, then, REX adopts the approach of relaxing the restrictions on non-ASCII name characters. The following definitions result.

These definitions meet the constraints for implementation via ASCII-based DFAs without counting, allowing correct regular-expression based processing of UTF-8 documents to be implemented using 8-bit extended ASCII regular expression packages.

II.3 Efficient Processing with Regular Expressions

Although higher-level declarative programming techniques are often associated with inefficient implementations, regular-expression based processing can be made quite efficient. Indeed, there are good arguments why regular-expression based implementations may be more efficient than carefully hand-crafted implementations under certain circumstances.

In general, regular-expression based processing is implemented by compilation of the regular expressions into finite automata. The automata may either be nondeterministic (NFAs) or deterministic (DFAs) depending on the implementation strategy. NFAs may be more compact than DFAs, but next state transitions may be ambiguous. NFA interpretation may hence require backtracking strategies to explore alternative state transition possibilities.

When the number of states is not excessive, DFAs can be a very efficient approach to regular-expression processing. In essence, each DFA state has a lookup table giving the next state transition for each possible input character. For 8-bit extended ASCII automata, complete lookup tables of 256 entries are tolerably small, provided that the total number of states is not excessive. Input characters can thus be examined and next states computed with a single memory reference. Long strings can be scanned very efficiently, with each input character being touched only once, and each touch being very lightweight.

DFA-based implementation may indeed outperform carefully coded manual implementations. A manual implementation may also be able to achieve a touch-once property, but usually will involve heavier touches. For example, character and line counting is often an integral part of such implementations so that error messages can be helpfully correlated with the input document when necessary. However, this sacrifices the speed of processing correct documents for the convenience of error analysis of incorrect documents. This paper develops regular-expression based approaches that reverse the trade-off; correct documents can be parsed very efficiently, while error analysis and reporting may be somewhat slower.

Furthermore, when regular expression support is built-in to a scripting language such as Perl or JavaScript, it may not be possible to manually implement scanners with equivalent speed. In essence, interpretation and/or compilation of manually written source code statements may involve considerably more overhead than the built-in string processing code associated with regular-expression support.

For XML parsing, it is generally possible to develop regular expressions for which reasonably compact DFAs may be generated. Furthermore, it is also possible to write these expressions in a deterministic form that allows NFA-based implementation without excessive backtracking. In essence, the regular expressions can be developed so that, at each decision point, there is an unambiguous choice dependent on the next input character. Sections III and IV follow this approach in developing the regular expressions for XML parsing.

II.4. Literate Regular-Expression Programming

Literate programming is a term coined by Knuth for a particular style of program development in which high quality typesettable program documentation is integrated with program source code [Literate Programming]. Literate programming is supported by a representation that combines both a particular programming language and a particular typesetting language, for example, Pascal and TeX in the original web system (unrelated to the World-Wide Web). Two programs, generally referred to as tangle and weave operate on this representation to produce the required language processor input, or typesetter input, respectively. By construction, literate programming leads to program documentation that is consistent with the program it documents. This is a particularly valuable property both for program validation and subsequent program maintenance.

This paper employs a form of literate programming adapted for use with regular expressions. In essence, each named regular expression in this paper is defined in terms of a structured representation intended to serve as input to programs retangle and reweave. The reweave program weaves the regular expression definitions into the tapestry of this paper in the notation of Section II.1 above. The retangle program produces programming language implementations of these expressions tangled together with all the necessary quotes, backslash escapes, string operators and so on. Furthermore, the retangle program is parametric in the target programming language of the generated output. As implemented for this paper, retangle generates the Perl, JavaScript and Lex/Flex implementations presented in the Appendices.

Perhaps the most intriguing aspect of literate regular-expression programming as used in this paper is that it is based on an XML representation. This provides an interesting opportunity to test the XML shallow parsing expressions in the implementation of the reweave and retangle tools. Indeed, after a bootstrapping process using hand-written shallow parsers, the reweave and retangle have in fact been rewritten to use the very regular expressions documented in this paper.

Overall, the use of these literate regular-expression programming techniques have a considerable benefit in ensuring that the regular expressions developed herein are consistent, reliable and well-documented. By being based on a common XML representation, the Perl, JavaScript and Lex/Flex implementations of the regular expressions are consistent with each other and with the forms actually appearing in this paper. Errors identified in any one setting are fed back to the original XML encoding and result in regenerated versions for each implementation and the paper. Furthermore, regenerated versions are then tested with the retangle and reweave programs. If those tests produce erroneous results, the errors must be corrected once again until stable, consistent and correct processing is once again achieved.

III. General Properties of Shallow Parsing

In summary form, shallow parsing may be described as the parsing of an XML document into a list of its individual markup and character data strings. However, this statement can be interpreted in a variety of ways with respect to the actual markup and text strings that are returned and the behaviour of shallow parsing in the case of an XML document containing errors. To clarify these issues, this section analyzes the general properties desirable for XML shallow parsing, together with some of the alternatives.

III.1 Input Partitioning

The first desirable property of shallow parsing is that the output list of strings is an ordered partition of the input document. That is, every character from the input document is represented exactly once in an output string, and the order of the character occurrences is preserved. In other words, the concatenation of the output list of strings precisely reproduces the input document.

The input partitioning property for shallow parsing is particularly useful for the construction of lightweight XML-to-XML filters. These filters generally take an XML document and make some small changes at one or more locations. The partition property allows the implementation of such filters without the imposition of additional changes by the parsing process. This will often be considered highly desirable by the users of such filters, particularly when the filter is used as part of a document editing or transformation task.

It is possible to contemplate a shallow parsing system that does not preserve all input characters, for example, deleting some white space characters within markup items and/or expanding some references. However, these effects may be considered undesirable for some types of XML filters.

III.2 String Classification

Each of strings in the list returned by an XML shallow parser may be classified as of one three types: markup strings, error strings, or parsed character data strings. The following breakdown is particularly simple and useful.

Character, general and parameter-entity references are not extracted in this approach. A second process of reference extraction can be used for this purpose. An alternative design would be to separate out character and general entity references from text items in the initial shallow parse. However, reference extraction would still be needed as a separate process for application to markup items, for example, extracting character and general references from the literal attribute values of element start tags. Furthermore, early reference extraction from text strings complicates the specification and implementation of the shallow parser and leads to a somewhat heavier weight representation. More importantly, reference extraction from text may be irrelevant to a large class of lightweight XML filters, that is, those filters that can preserve references because the input and output document type definitions are the same. For such filters, early reference extraction from text is simply an unnecessary task that needs to be reversed. For these reasons, the approach taken in this paper is to avoid premature reference extraction; it is also more consistent with the overall philosophy of shallow parsing.

Another implication of this approach is that XML document type declarations are returned as single strings, even though document type declarations may contain nested markup items such as processing instructions, comments or markup declarations. When necessary, these nested items must be extracted and processed in a secondary parsing process.

III.3 Error String Properties

Shallow parsing is intended to support processing of partial, invalid or otherwise ill-formed XML documents. In fact, it is not difficult to arrange that, given any document whatsoever, the shallow parser returns a viable list of strings representing its interpretation as an XML document. For a fully robust shallow parser, there are no fatal errors. Errors are identified during shallow processing whenever the opening left angle bracket of a markup item has been found, but there is either an internal syntactic error in the item or the corresponding closing bracket cannot be found. The simplest approach to error identification in these cases is to return the character string consisting of a single left angle bracket as a universal error token. In the shallow parse representation, this would generally be followed a text string object containing both the remainder of the attempted markup item as well as the following actual text up to the next occurrence of an angle bracket.

However, it is possible to design regular expressions for shallow parsing that improve upon this. In most cases involving erroneous markup items, it is possible to return at least the full opening delimiter of a markup item, possibly followed by its "name" (where relevant). Thus, strings such as "<!--", "<![CDATA[", "<?xml", "<!DOCTYPE", "<H3" and "</H3" can serve as tokens denoting errors in particular types of structure. In some cases, longer error tokens can also be usefully returned to more precisely locate an error. For example, given the erroneous comment <!-- Embedded double hyphen (--) not allowed --> the error token "<!-- Embedded double hyphen (--" usefully locates the precise point of error. Nevertheless, a single character "<" error token can still occur; it usually denotes an erroneous occurrence of an opening angle bracket in text data, such as "x < 0".

III.4 Markup String Properties

The final general issue with respect to shallow parsing specification is what properties can be assumed when a fully delimited markup string is returned. In general, regular expressions cannot enforce all of the well-formedness constraints for correct XML markup, let alone the context-sensitive validity constraints. Furthermore, in support of fault-tolerant XML tools it may be desirable to relax certain constraints that are technically possible to enforce. For example, various different contexts within the XML grammar impose different constraints on the characters that may be contained within quoted strings. However, most of the constraints do not affect the construction of XML-to-XML filters and so their enforcement within the XML shallow parser may be unnecessarily limiting.

The specific properties that can be assumed for each type of markup item are documented in the relevant subsection of section IV.

III.5 Parsed Character Data Properties

Parsed character data strings returned by the shallow parser consist of maximal length arbitrary text and whitespace strings found between markup items (correct or erroneous) in the document source. The maximal length property means that the strings always extend from one markup or error item (or beginning of file) to another markup or error item (or end of file). Thus, no two character data strings will occur consecutively in the shallow parse. Alternatively, a shallow parser could be designed that further divides parsed character data in some fashion, for example, at line boundaries. However, this leads to a heavier weight representation. In the case of line boundary divisions, the strategy is also incomplete, in that such divisions occuring within markup or error items would not be identified.

Three additional observations about parsed character data may be of interest. First, character data strings may consist entirely of whitespace. This is necessary to preserve the input partitioning property when consecutive markup items are found, say, with an intervening blank line. Second, as noted previously, parsed character data may also include character or general entity references (which in turn may be either correctly or incorrectly formed). Third, parsed character data strings returned in shallow parsing may also include "]]>" sequences even though those sequences are not permitted in XML. The principal implication of this is that a check for this sequence must be made whenever a text object is enclosed in a CDATA section. However, a general purpose CDATA construction operation must always make this test anyway, because the text argument for such an operation might come from a source other than a text string of an unexpanded input document.

IV. The XML Shallow Parsing Expression

Based on the general analysis of the previous section, this section develops the overall XML shallow parsing expression XML_SPE. This expression is intended to be repeatedly matched against an XML document to return the list of individual text, error and markup items that comprise the document shallow parse.

The overall structure of XML_SPE is given by the following definitions.

If the first character of the input to be matched against XML_SPE is any character but a left angle bracket, then the text scanning expression TextSE is used to extract a text item. Otherwise, a markup item (or related error item) should be parsed, the type of which is determined by one or more characters following the opening angle bracket. Once the correct markup type is determined, further parsing of the item is handled by a "completion expression" (CE) for the type.

In the overall structure of XML_SPE, note that the completion expressions are each optional. If no string matching the relevant completion expression is found, then the opening delimiter itself may be returned as an error token. Thus any of the tokens "<!", "<!--", "<![CDATA[", "<!DOCTYPE", "<?" or "</" may be returned to indicate malformed markup of a particular type. Furthermore, a solitary "<" may be returned as a general error token in the event that it is followed neither by one of the special characters ("!", "?" or "/") nor by a string matching the completion expression for element tags.

Note also that XML_SPE always matches at least one character for any input. In this way, global matching of XML_SPE against an input document is guaranteed to produce a shallow parse satisfying the input partitioning property.

The remainder of this section develops the regular expressions for extraction of each type of markup item. Each expression is developed in standalone form as well as in the completion expression form for use within XML_SPE as defined here.

IV.1 Comments

An XML comment opens with a "<!--" delimiter and generally closes with the first subsequent occurrence of the closing "-->" delimiter. An explicitly stated exception is that a double hyphen is not permitted within the body of a comment. This rule ensures that unterminated comments are detected if a new comment opening delimiter is encountered.

There is an additional restriction that comments cannot be terminated with the "--->" sequence, that is, that the body of the comment cannot terminate with a hyphen. This rule may be inferred from the grammar rule for comments given in the specification [XML 1.0].

Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'

Here Char stands for the set of all characters and the unquoted "-" denotes substraction from this set. This rule does implement the stated restriction on double hyphens, but also has the additional effect of disallowing a hyphen at the end. This is compatible with the SGML treatment of comments.

In fact, this grammar rule can be directly converted to a regular expression by substitution of "[^-]" for each occurrence of "(Char - '-')". However, the result is nondeterministic with respect to hyphens found after the opening delimiter. In essence, in a direct NFA-based implementation it forces a choice of matching the hyphen within the comment body or as the first character of the closing delimiter. Backtracking to reverse this decision may consequently be required.

An equivalent deterministic regular expression may be constructed as follows.

The steps used in matching this expression to an XML comment may be summarized as follows.

  1. Match the opening delimiter ("<!--").
  2. Scan past any nonhyphens to the first occurrence of a hyphen (Until1Hyphen).
  3. If the first character after this hyphen is not another hyphen, scan past this and other nonhyphens until the next hyphen. Repeat until the first character past the hyphen is another hyphen, i.e., until the first double hyphen is found.
  4. Accept the right angle bracket to close the comment.

The comment matching process scans until the first occurrence of a double hyphen. Because a double hyphen is not legal within the body of a comment, it must signal the position of the closing delimiter. If the closing angle bracket does not follow immediately, the comment scan fails.

Applied to shallow parsing, CommentRE will either match complete and correct comments or fail to match. For robustness, the shallow parsing expression returns two types of comment error token.

As mentioned previously, the comment completion expression is optional, allowing the opening delimiter to be identified as an error string even if the rest of the structure is not matched (unterminated comment). Within the completion expression, the optional closing angle bracket means that a comment erroneously terminated with a double hyphen may nevertheless be identified. Under the longest substring match rule, then, CommentSPE matches the full structure of a legal XML comment when it exists, an erroneously terminated comment ending in a double hyphen as a second priority, or the opening comment delimiter, otherwise.

The properties of comment shallow parsing may be summarized as follows.

IV.2 Processing Instructions

XML processing instructions (including the XML declaration as a special case) have the following syntax.

Here "Name" and "S" are nonterminals in the XML grammar for names and whitespace. After the opening "<?" delimiter, a processing instruction terminates with the first occurrence of the corresponding "?>" sequence (this sequence cannot occur in a name or whitespace).

A deterministic regular expression that precisely captures this syntax may be constructed as follows.

After the opening delimiter and the processing instruction target (name), the tail of the processing instruction may immediately terminate with the closing delimiter or may first have arbitrary content following whitespace. The key to the scanning process for this arbitrary content is to use the UntilQMs subexpression to scan for question mark sequences (one or more question marks) until a closing angle bracket is found immediately after such a sequence. This may be contrasted to the scan for a single hyphen in comment matching. Technically, a processing instruction may terminate with "?>", or "??>" or "???>" or so on. Failure to properly scan for question mark sequences is an easy mistake to make. For example, the following erroneous scanning expression was found in early working drafts of the XML specification [WD-970807, WD-961114] as well as the SGML FAQ book [DeRose].

This scanning expression was designed as a minimal form to find the closing delimiter of a processing instruction, but fails to accept any processing instruction that terminates with exactly two question marks before the right angle bracket.

A further subtle point in the PI_RE expression is the use of S1 expression to match a single whitespace character rather than matching one or more whitespace characters with S. The reason for this is to avoid nondeterminism; the extra whitespace characters are also matched by the [^?] subexpression that immediately follows. In turn, this means that NFA-based implementations of this expression can avoid backtracking.

PI_RE suffices to identify all and only syntactically correct processing instructions. In the event of an incorrect or unterminated processing expression, the shallow parser should return the opening "<?" delimiter followed by any legal portion of the target name that exists. The following shallow parsing expression is suitable for the task.

The properties of shallow parsing of processing instructions may be summarized as follows.

IV.3 CDATA Sections

CDATA sections open with the delimiter "<![CDATA[" and have arbitrary content terminated with the first occurrence of the "]]>" delimiter. The key to the scanning process in this case is to scan for sequences of two or more right square brackets followed immediately by a right angle bracket. The precise regular expression for CDATA sections can thus be developed as follows.

Although the UntilRSBs expression is more complex, it plays a similar role here to that of UntilHyphen in comment scanning and UntilQMs in processing-instruction scanning.

Following the previous examples, adapting this expression for error signalling can be accomodated by making the tail part of the CDATA section optional.

This gives a shallow parsing expression which may match and return two types of string only.

IV.4 Element Tagging

The element tagging structures consist of element start tags, empty element tags and element end tags. Element start tags and empty element tags have a common syntactic structure distinguished only by the occurrence of "/" before the closing ">" delimiter.

There are several design alternatives for element tag regular expressions ranging from minimal to maximal enforcement of syntactic constraints. On the minimal side, after matching the opening delimiter and a name start character, a scanning expression could match arbitrary text up to the first unquoted right angle bracket. On the maximal side, a regular expression can be designed to enforce every constraint given in the grammar above. Each approach gives regular expressions that can be used to correctly extract element tagging from well-formed XML documents. The approaches differ in terms of fault-tolerance and support for subsequent processing. The scanning expression approach allows markup items to be identified even in the presence of internal syntactic errors, while the maximal enforcement approach ensures that markup items are in a form enabling extraction of internal components.

The pure scanning expression approach suffers from the problem that, if a closing quote mark or a closing angle bracket is missing, scanning can continue into subsequent text and markup. This can often generate misleading results. For example, the erroneous construction <E1 Att1='error><e2 Att2='>'> would be parsed as the element tag <E1 Att1='error><e2 Att2='> followed by "'>" as text. The element tag might then be erroneously interpreted as a correct E1 element start tag with "error><e2 Att2=" as the the value of attribute Att1.

However, it is possible to avoid these problems by designing a scanning expression that terminates upon occurrence of an opening angle bracket of a new markup item. Note that such brackets are legal neither within quoted attribute value strings nor unquoted elsewhere in the body of the element tag.

This scanning expression is maximally tolerant of internal syntactic faults without overscanning into subsequent markup items. Such fault tolerance may be useful to applications intended to work with erroneous XML documents.

A further problem with the scanning approach is that one cannot count on returned markup items to be sufficiently well structured to support extraction of attribute value lists. This represents an impediment to element tag processing in the "normal case," that is, that element tags are correctly formed. In order to remove this impediment, the following expression may be used as the basis for shallow parsing.

This expression ensures that identified markup items are well structured for attribute value extraction. However, a scanning philosophy is still maintained for attribute value strings. This allows applications to robustly handle documents in which such strings contain ill-formed references.

Finally, to adapt ElemTagRE for error handling, it is useful to make the closing delimiter optional.

This allows the return of an error string consisting of the opening delimiter and element name together with as many attribute value pairs that exist in the correct form.

A minor point about this expression is that there is a slight nondeterminism in whitespace processing. In a nonoptimizing NFA-based implementation this may require some backtracking if whitespace exists immediately before the final "/>" or ">" delimiter. However, a DFA-based implementation can be achieved without state expansion.

The regular expression for element end tags follows directly from the grammar in the XML specification.

The shallow parsing expression modifies this structure to match appropriate error strings.

Either the opening end tag delimiter followed by the element name and possible white space, or the opening delimiter alone may be identified as error strings.

IV.5 Document Type Declarations

Document type declarations are complex syntactic objects for which regular-expression based parsing would not normally be attempted. However, from the point of view of scanning and locating the appropriate closing ">" delimiter, they do in fact have a regular structure. By generally ignoring the internal structure of document type declarations except as specifically relevant to delimiter determination, a reasonably straightforward scanning expression can be developed. This in turn permits implementation by a DFA of a reasonably small number of states and is also tolerant of minor internal syntactic errors.

As discussed in the example of element tagging, however, it is desirable to modify the pure scanning approach in three ways. First, it is desirable to avoid overscanning through a faulty construct into subsequent markup items. Thus, although other kinds of syntactic error might be tolerated so long as the delimiters can be correctly identified, any inappropriate occurrence of an opening markup delimiter (angle bracket) should immediately terminate the scan. Second, in order to facilitate subsequent processing, it is desirable to ensure that a string returned as a fully-delimited document type declaration indeed has sufficient internal structure to enable correct extraction of the individual components. Third, when internal syntactic errors do exist, it is desirable to return a partial document type declaration consisting of a correct prefix prior to the point of error.

Document type declarations have two basic parts: an identificaton part specifying the document type name and possible external identifiers, and an optional body consisting of a sequence of declarations and other items enclosed in square brackets. The top-level structure of a shallow parsing expression for document type declarations enforces this syntax.

Optional whitespace is included in places specified by the XML grammar. The closing angle bracket is optional for shallow parsing; if it is indeed missing, the almost complete document type declaration is returned as an error string.

The identification part of a document type declaration consists of a document type name, possibly followed by an external identifier consisting of a keyword SYSTEM or PUBLIC and one or two quoted string literals. Whitespace is required to separate these elements from each other and the opening "<!DOCTYPE" delimiter. From the perspective of scanning for the opening "[" of the document type body or the closing ">" of the overall document type declaration, occurrences of these symbols within the quoted strings must be ignored. Beyond this, it is reasonable to enforce the existence of a document type name since this is required of all correct document type declarations. The existence and structure of the external identifier part can be modelled as a list of zero or more whitespace separated names (keywords) or quoted string literals.

This ensures that external identifier part can be extracted as well-formed list of symbols while tolerating possible errors in the actual symbols used.

Within the square brackets of a document type declaration, four types of item may be found, with or without whitespace to separate.

Here DeclBody denotes a common lexical structure for element, attlist, entity and notation declarations, although their detailed syntax varies considerably. For shallow parsing purposes, the bodies of these declarations have the common property that they may contain an internal occurrence of a closing ">" delimiter only within a quoted string. This syntax is captured by the following declaration completion expression.

This expression scans through arbitrary content searching for the closing ">" delimiter. Quoted strings are skipped and the scan may terminate with failure if an erroneous "]" or "<" delimiter is encountered.

Combining MarkupDeclCE with the previously developed expressions for comments and processing instructions and parameter-entity reference expression gives an overall expression for scanning items in the body of document type declarations.

This completes the definition of the scanning expression for document type declarations.

V. Secondary Regular-Expression Based Processes

Shallow parsing performs the primary task of parsing an XML document into its invididual markup and character data components. Subsequent to this, other processing tasks are generally required, depending on the specific XML application. Some of these secondary tasks can also use regular-expression based techniques for parsing of the individual markup and text items. In most cases, the regular expressions for these secondary parsing tasks can be developed by adaptation of those documented in the previous section. However, there is one additional task unaddressed by the previously developed expressions, namely that of reference extraction.

V.1 Reference Extraction

Reference extraction is the process of extracting character, entity and parameter-entity references from quoted strings or parsed character data. The quoted string literals may be either literal entity values from entity declarations, or literal attribute values from attribute list declarations, element start tags or empty element tags. Character and entity references may be found in any of these contexts, while parameter-entity references may occur in literal entity values only. Parsed character data is the text data found between markup terms during shallow parsing; only character and entity references occur in parsed character data.

The XML grammatical productions for each of these reference types are in fact in regular form as given in the XML specification.

Direct use of regular expressions based on these forms can allow syntactically correct references to be extracted.

From the standpoint of fault tolerance, however, the reference extraction expressions can be improved to account for syntactically incorrect references as well. A useful technique is to extract maximal length reference prefixes. For example, in the text item "A carriage return (&#13:)", the character reference is incorrect because it is terminated with a colon rather than a semicolon. In this case, the maximal length reference prefix "&#13" could be returned by the extraction process to indicate the error. Of course, whenever a reference is syntactically correct, the maximal length prefix is the full reference itself. Correct references can always be identified by the presence of the terminating semicolon in the maximal length prefix. Using the longest match rule, maximal length prefixes can be determined using regular expressions that match all possible prefixes. The simplest case is the all-prefix expression for parameter-entity references.

The shortest prefix is just "%", so everything after this is optional. Longer prefixes must have one or more characters of a name. However, the Name construct may be used directly, since every nonempty prefix of a name is also a valid name. After the name characters, the prefix may be extended with a semicolon to give a full parameter-entity reference.

An all-prefix expression for character and general entity references may be built using similar techniques.

After the opening "&", there may or may not be either a name part or a numeric part. A name part, if it exists, may have a following semicolon to complete the reference. A numeric part starts with the "x" character followed by a possible decimal sequence or a hexadecimal part.

From an efficiency perspective, the all-prefix expressions have the attractive property that their DFA implementations require no more states than do the corresponding implementations of the simpler expressions for full references. In essence, the DFAs for the full reference expressions are easily converted to DFAs for the all-prefix form by making every state except for the start state (before the "&" or "%") into acceptance states.

Finally, the extraction expressions can be easily converted into expressions suitable for parsing and determination of replacement texts.

Using these expressions input strings can be parsed into lists of alternating pure text and reference prefix strings. In JavaScript, for example, if textString is a string object, and Text_PE_g is a global match version of the Text_PE expression, then textString.match(Text_PE_g) parses textString into such a list of strings. The replacement text for the entire string can then be determined by replacing the references with their appropriate values and reconcatenating the string list.

VI. Conclusion

Regular-expression based processing allows for very simple implementations of a variety of lightweight XML processing tools. These include lightweight XML-to-XML filters that may generate output texts only slightly modified from their inputs (with limited entity replacement, for example), tools for special purpose XML processing (constrained to a fixed document type or set of document types, for example), and fault-tolerant editing or construction tools for working with partial, ill-formed or invalid XML documents before they meet the requirements of conforming XML processors. Implementation of these kinds of tools is not well supported, and may even be hindered, by XML toolkits based on either validating or nonvalidating processors conforming to the XML specification.

The simplicity of the shallow parsing model based on regular expressions suggest suggests some interesting possible directions for development of XML. First of all, a shallow parsing representation such as that produced by REX could be a useful reference representation for a revised XML specification. Such a reference representation would have the advantage of providing a language-independent approach to shallow parsing encoded in the standard, with a language-independent implementation framework based on regular expressions. Furthermore, it may be possible to relax certain XML restrictions that can be easily accommodated by regular-expression processing, such as that restriction that attributed values must always be quoted. However, possibilities such as these must be carefully weighed by the overall XML development community.

References

[XML 1.0]
Bray, Tim, Paoli, Jean, and Sperberg-McQueen, C. M. Extensible Markup Language (XML) 1.0, W3C Recommendation REC-xml-19980210, World-Wide Web Consortium, February 10, 1998. URL: http://www.w3.org/TR/1998/REC-xml-19980210.
[WD-970807]
Bray, Tim, Paoli, Jean, and Sperberg-McQueen, C. M. Extensible Markup Language (XML), W3C Working Draft WD-xml-970807, World-Wide Web Consortium, August 7, 1997. URL: http://www.w3.org/TR/WD-xml-970807.
[WD-961114]
Bray, Tim, and Sperberg-McQueen, C. M. Extensible Markup Language (XML), W3C Working Draft WD-xml-961114, World-Wide Web Consortium, November 14, 1996. URL: http://www.w3.org/TR/WD-xml-961114.
[DeRose]
DeRose, Steven J. The SGML FAQ Book, Kluwer, Boston, 1997.
[Literate Programming]
Knuth, Donald E. Literate Programming, CSLI Lecture Notes Number 27, Center for the Study of Language and Information, Stanford University, 1992.

Appendix A: Shallow Parsing in Perl

# REX/Perl 1.0 
# Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions",
# Technical Report TR 1998-17, School of Computing Science, Simon Fraser 
# University, November, 1998.
# Copyright (c) 1998, Robert D. Cameron. 
# The following code may be freely used and distributed provided that
# this copyright and citation notice remains intact and that modifications
# or additions are clearly identified.

$TextSE = "[^<]+";
$UntilHyphen = "[^-]*-";
$Until2Hyphens = "$UntilHyphen(?:[^-]$UntilHyphen)*-";
$CommentCE = "$Until2Hyphens>?";
$UntilRSBs = "[^\\]]*](?:[^\\]]+])*]+";
$CDATA_CE = "$UntilRSBs(?:[^\\]>]$UntilRSBs)*>";
$S = "[ \\n\\t\\r]+";
$NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
$NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
$Name = "(?:$NameStrt)(?:$NameChar)*";
$QuoteSE = "\"[^\"]*\"|'[^']*'";
$DT_IdentSE = "$S$Name(?:$S(?:$Name|$QuoteSE))*";
$MarkupDeclCE = "(?:[^\\]\"'><]+|$QuoteSE)*>";
$S1 = "[\\n\\r\\t ]";
$UntilQMs = "[^?]*\\?+";
$PI_Tail = "\\?>|$S1$UntilQMs(?:[^>?]$UntilQMs)*>";
$DT_ItemSE = "<(?:!(?:--$Until2Hyphens>|[^-]$MarkupDeclCE)|\\?$Name(?:$PI_Tail))|%$Name;|$S";
$DocTypeCE = "$DT_IdentSE(?:$S)?(?:\\[(?:$DT_ItemSE)*](?:$S)?)?>?";
$DeclCE = "--(?:$CommentCE)?|\\[CDATA\\[(?:$CDATA_CE)?|DOCTYPE(?:$DocTypeCE)?";
$PI_CE = "$Name(?:$PI_Tail)?";
$EndTagCE = "$Name(?:$S)?>?";
$AttValSE = "\"[^<\"]*\"|'[^<']*'";
$ElemTagCE = "$Name(?:$S$Name(?:$S)?=(?:$S)?(?:$AttValSE))*(?:$S)?/?>?";
$MarkupSPE = "<(?:!(?:$DeclCE)?|\\?(?:$PI_CE)?|/(?:$EndTagCE)?|(?:$ElemTagCE)?)";
$XML_SPE = "$TextSE|$MarkupSPE";


sub ShallowParse { 
  my($XML_document) = @_;
  return $XML_document =~ /$XML_SPE/g;
}

Appendix B: Shallow Parsing in JavaScript

// REX/Javascript 1.0 
// Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions",
// Technical Report TR 1998-17, School of Computing Science, Simon Fraser 
// University, November, 1998.
// Copyright (c) 1998, Robert D. Cameron. 
// The following code may be freely used and distributed provided that
// this copyright and citation notice remains intact and that modifications
// or additions are clearly identified.

TextSE = "[^<]+";
UntilHyphen = "[^-]*-";
Until2Hyphens = UntilHyphen + "([^-]" + UntilHyphen + ")*-";
CommentCE = Until2Hyphens + ">?";
UntilRSBs = "[^]]*]([^]]+])*]+";
CDATA_CE = UntilRSBs + "([^]>]" + UntilRSBs + ")*>";
S = "[ \\n\\t\\r]+";
NameStrt = "[A-Za-z_:]|[^\\x00-\\x7F]";
NameChar = "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]";
Name = "(" + NameStrt + ")(" + NameChar + ")*";
QuoteSE = '"[^"]' + "*" + '"' + "|'[^']*'";
DT_IdentSE = S + Name + "(" + S + "(" + Name + "|" + QuoteSE + "))*";
MarkupDeclCE = "([^]\"'><]+|" + QuoteSE + ")*>";
S1 = "[\\n\\r\\t ]";
UntilQMs = "[^?]*\\?+";
PI_Tail = "\\?>|" + S1 + UntilQMs + "([^>?]" + UntilQMs + ")*>";
DT_ItemSE = "<(!(--" + Until2Hyphens + ">|[^-]" + MarkupDeclCE + ")|\\?" + Name + "(" + PI_Tail + "))|%" + Name + ";|" + S;
DocTypeCE = DT_IdentSE + "(" + S + ")?(\\[(" + DT_ItemSE + ")*](" + S + ")?)?>?";
DeclCE = "--(" + CommentCE + ")?|\\[CDATA\\[(" + CDATA_CE + ")?|DOCTYPE(" + DocTypeCE + ")?";
PI_CE = Name + "(" + PI_Tail + ")?";
EndTagCE = Name + "(" + S + ")?>?";
AttValSE = '"[^<"]' + "*" + '"' + "|'[^<']*'";
ElemTagCE = Name + "(" + S + Name + "(" + S + ")?=(" + S + ")?(" + AttValSE + "))*(" + S + ")?/?>?";
MarkupSPE = "<(!(" + DeclCE + ")?|\\?(" + PI_CE + ")?|/(" + EndTagCE + ")?|(" + ElemTagCE + ")?)";
XML_SPE = TextSE + "|" + MarkupSPE;


function ShallowParse(XMLdoc) {
  return XMLdoc.match(new RegExp(XML_SPE, "g"));
}

Appendix C: Shallow Parsing in Lex/Flex

/* REX/Lex 1.0 
   Robert D. Cameron "REX: XML Shallow Parsing with Regular Expressions",
   Technical Report TR 1998-17, School of Computing Science, Simon Fraser 
   University, November, 1998.
   Copyright (c) 1998, Robert D. Cameron. 
   The following code may be freely used and distributed provided that
   this copyright and citation notice remains intact and that modifications
   or additions are clearly identified.  */
TextSE	[^<]+
UntilHyphen	[^-]*-
Until2Hyphens	{UntilHyphen}([^-]{UntilHyphen})*-
CommentCE	{Until2Hyphens}>?
UntilRSBs	[^\]]*]([^\]]+])*]+
CDATA_CE	{UntilRSBs}([^\]>]{UntilRSBs})*>
S	[ \n\t\r]+
NameStrt	[A-Za-z_:]|[^\x00-\x7F]
NameChar	[A-Za-z0-9_:.-]|[^\x00-\x7F]
Name	({NameStrt})({NameChar})*
QuoteSE	\"[^"]*\"|'[^']*'
DT_IdentSE	{S}{Name}({S}({Name}|{QuoteSE}))*
MarkupDeclCE	([^\]"'><]+|{QuoteSE})*>
S1	[\n\r\t ]
UntilQMs	[^?]*\?+
PI_Tail	\?>|{S1}{UntilQMs}([^>?]{UntilQMs})*>
DT_ItemSE	\<(!(--{Until2Hyphens}>|[^-]{MarkupDeclCE})|\?{Name}({PI_Tail}))|%{Name};|{S}
DocTypeCE	{DT_IdentSE}({S})?(\[({DT_ItemSE})*]({S})?)?>?
DeclCE	--({CommentCE})?|\[CDATA\[({CDATA_CE})?|DOCTYPE({DocTypeCE})?
PI_CE	{Name}({PI_Tail})?
EndTagCE	{Name}({S})?>?
AttValSE	\"[^<"]*\"|'[^<']*'
ElemTagCE	{Name}({S}{Name}({S})?=({S})?({AttValSE}))*({S})?\/?>?
MarkupSPE	\<(!({DeclCE})?|\?({PI_CE})?|\/({EndTagCE})?|({ElemTagCE})?)
XML_SPE	{TextSE}|{MarkupSPE}

Appendix D: An Interactive Shallow Parsing Demo

This appendix contains an interactive regular expression demo using the regular expression support facilities of Javascript 1.2.

Choose an XML regular expression:

RE Name
RE Value
Input
Skipped
Match:
Remain: