The Design of RELAX NG

Abstract

RELAX NG is a new schema language for XML. This paper discusses various aspects of the design of RELAX NG including the treatment of attributes, datatyping, mixed content, unordered content namespaces, cross-references and modularity.

Composability

RELAX NG is designed to be highly composable. A schema language (or indeed a programming language) provides a number of atomic objects and a number of methods of composition. The methods of composition can be used to combine atomic objects into compound objects which can in turn be composed into further compound objects. The composability of the language is the degree to which the various methods of composition can be applied uniformly to all the various objects of the language, both atomic and compound. For example, RELAX NG provides a choice element that can be applied uniformly to elements, attributes, datatypes and enumerated values. This is not mere syntactic overloading. The choice element has a single uniform semantic in all these cases and can have a single implementation. Another example is the grammar element, which is the container for definitions. The grammar element is just another pattern and can be composed in just the same way as other patterns. Composability improves ease of learning and ease of use. Composability also tends to improve the ratio between complexity and power: for a given amount of complexity, a more composable language will be more powerful than a less composable one.

XML syntax

RELAX NG uses XML instance syntax to express schemas. Although this makes for a rather verbose schema language, it has some major advantages. Since a user of an XML schema language must necessarily already learn XML instance syntax, using XML instance syntax for the schema language reduces the learning burden on a schema user. It also allows XML tools and technologies to be applied to the schema. For example, a schema can be used to specify the syntax of the schema language. Another important benefit of XML syntax is extensibility. RELAX NG has an open syntax that allows the RELAX NG defined elements and attributes to be annotated with elements and attributes from other namespaces. RELAX NG DTD Compatibility [12] uses this annotation mechanism to extend RELAX NG with a mechanism for declaring default values for attributes. RelaxNGCC [23] uses this annotation mechanism to allow users to embed Java code in RELAX NG schemas, which gets executed as an XML document is parsed against the schema. An unofficial non-XML syntax for RELAX NG has also been developed [8]. The non-XML syntax can be used for authoring RELAX NG schemas by hand and can then be transformed into the standard RELAX NG XML syntax for interchange.

Attributes

One of the questions most frequently asked by newcomers to the SGML and XML world is how to choose whether to use an attribute or element to represent something. Reasonable people differ on the answer to this question. In many cases, the choice is somewhat arbitrary and is largely a matter of the taste of the document designer. RELAX NG therefore aims to treat attributes as uniformly as possible with elements. In this respect RELAX NG is very different from many XML schema languages such as XML DTDs, W3C XML Schema and RELAX, which each provide separate facilities for dealing with elements and with attributes. This aspect of RELAX NG comes from TREX. One inspiration for this was XSLT and XPath, which are two successful technologies that take an approach that is even-handed as between elements and attributes. This uniform treatment is a significant factor in simplifying the language: there is one set of facilities that is applied uniformly to elements and attributes rather than two distinct sets of facilities.

The mechanism that RELAX NG uses to give attributes uniform treatment to attributes is to extend DTD-style content models to include attributes as well as elements. The content of an XML element consists a sequence of elements and strings. Accordingly, a model for the content of an XML element can be understood to be denoting a set of such sequences. The XML DTD content model operators (|,*+?) correspond to operations on such sets. For example, the choice operator (|) corresponds to set union. To extend content models to include attributes, we first augment the content of an XML element with its attributes; instead of the content of an element, we use an attribute-set/content pair. The content is a sequence of elements and strings as before. The attribute set is a set of name/value pairs. An extended content model thus denotes a set of attribute-set/content pairs. Each of the content model operators can be applied in a natural way to these sets of pairs. For example, the choice operator corresponds to set union just as before. The sequence operator (,) concatenates the content sequences occurring in its operands, but unions the attribute-sets. This ensures that the extended content models respect the unorderedness of attributes. At first, it might seem that the repetition operators (*+) make no sense for attributes, but in fact, when wildcards are allowed for attribute names, repetition of attributes becomes necessary. Although the theory underlying extended content models is a little tricky, in practice they are both easy to use and very powerful.

The extension of content models to handle attributes creates a difficulty for attribute defaulting. For example, if the content model allows either attribute A or attribute B or neither attribute, how would attribute defaulting handle the absence of both attribute A and attribute B? Would defaulting add an attribute A or an attribute B? RELAX NG's solution is simply not to do attribute defaulting. Attribute defaulting is a kind of transformation: a transformation that adds attributes. But it is a very limited kind of transformation. Not only can it do nothing but add attributes, but it can only add an attribute when the value of the attribute to be added does not depend on the context, although it is often necessary for attributes to be defaulted in a context-dependent way, for example by inheritance. Although there is certainly a need for special-purpose transformation languages, it is not clear why this kind of transformation should alone be privileged by being included in a schema language. Omission of attribute defaulting is also consistent with the policy of equal treatment for elements and attributes. W3C XML Schema provides defaulting for both elements and attributes, but its defaulting for elements is quite different from its defaulting for attributes: defaulting for elements adds content to elements that were specified as empty, whereas defaulting for attributes adds attributes that were not specified at all. For compatibility with XML DTDs, RELAX NG DTD Compatibility defines an annotation that can be used to specify default attributes values. However, this can only be used for content models that do not go beyond XML DTDs in their use of attributes.

Infoset modification

The omission of support for default attributes from RELAX NG is part of a general policy in RELAX NG of not modifying or augmenting the infoset [16]. RELAX NG validation does not involve changing the information about the document that is passed to an application. One reason for this is that the processes of validation and infoset modification need to be capable of being performed independently. In some situations, there is a need to ensure that a document is valid with respect to some schema but no need to perform any additional processing at that stage and hence no need for an augmented infoset. In other situations, a document is already known to be valid but an augmented infoset is needed for additional processing.

The fact the RELAX NG validation does not involve infoset modification does not imply that applications cannot derive useful information from RELAX NG schemas. For example, it is possible to use a RELAX NG schema to assign types to elements and attributes; Sun's Multi-Schema Validator [19] supports this. Type assignment requires additional restrictions on RELAX NG schemas beyond those imposed by RELAX NG itself. There is a range of possible restrictions that can be imposed to facilitate type assignment: the more severe the restriction, the easier type assignment becomes; assigning types to elements which can contain subelements requires different restrictions than merely assigning datatypes to the string values of element and attributes. By layering type assignment on top of RELAX NG validation, applications that do not require type assignment do not need to pay the cost for a feature that they do not use. It is also more flexible. For example, it allows there to be two schemas for a document: one strict schema that captures as many constraints as possible but does not satisfy the restrictions necessary for type assignment and another looser schema that cannot express all the constraints but which can be used for type assignment. One advantage to not performing type assignment with RELAX NG is that RELAX NG works well with existing APIs such as DOM and SAX. Type assignment would require major changes to report the types assignment to elements and attributes.

Datatyping

XML DTDs have a built-in, limited, rather ad hoc set of datatypes, which can be applied only to the values of attributes and not to the content of elements. RELAX NG differs in two major respects. Firstly, it allows datatypes to be specified uniformly for both attribute values and element content; this is in accordance with the philosophy of uniform treatment for elements and attributes. Secondly, RELAX NG decouples the schema language from the set of datatypes. RELAX NG is not tied to a single set of datatypes. The philosophy of RELAX NG is, like XML, to restrict itself purely to syntax. This restriction allows RELAX NG to be both simple and general. Defining specific datatypes is not simply a matter of syntax but involves semantics as well. The issue of what datatypes are useful is both more application-dependent and more open-ended than the purely syntactic issues that RELAX NG deals with. Instead, RELAX NG introduces the concept of a datatype library, which provides a semantic model for a collection of datatypes. Any collection of datatypes that can fit into the RELAX NG semantic model can potentially be used as a RELAX NG datatype library. In particular, the datatypes defined by W3C XML Schema Part 2 [1] can be used as a datatype library; the RELAX NG TC has published a set of guidelines [11] for this in order to promote interoperability. A vendor-independent Java interface has been developed for datatype libraries [10]. Any collection of datatypes implemented using this interface can be dynamically plugged in to any RELAX NG validator that supports this interface.

W3C XML Schema Part 2 defines both a collection of primitive datatypes and methods for deriving datatypes. With RELAX NG, the functionality relating to specific primitive datatypes is factored out into independent datatype libraries. However, the functionality relating to deriving datatypes is included in RELAX NG. W3C XML Schema Part 2 provides three methods for deriving datatypes. Derivation by restriction is provided in RELAX NG by allowing a reference to a datatype in a datatype library to specify a list of named parameters. Derivation by union is provided in RELAX NG by allowing the choice element to be applied to datatypes just as it is to elements or attributes. Derivation by list is provided in RELAX NG by the list element. The list element allows the normal RELAX NG content model operators (group, interleave, choice, oneOrMore, zeroOrMore, optional) to be used for specifying the sequence of tokens comprising the list. It is more powerful than the W3C XML Schema Part 2, which allows only a minimum and maximum length for the number of tokens in the sequence to be specified .

Mixed content

SGML does not restrict the occurrence of #PCDATA in content models. However, SGML suffers from the infamous pernicious mixed content bug, which causes certain content models involving #PCDATA to treat whitespace between tags as significant in surprising ways. This bug in SGML motivated XML to drastically restrict the use of #PCDATA in content models. Unfortunately, this prohibits many perfectly reasonable content models. RELAX NG restores the generality of SGML by removing the restriction on #PCDATA. (In RELAX NG, #PCDATA is represented by a text element.) It solves the pernicious mixed content bug by observing that the pernicious mixed content bug only arises in SGML because SGML parsers need to report whether whitespace is significant and insignificant. RELAX NG does no modify or augment the infoset and it therefore does not need to decide whether whitespace in mixed content is significant. RELAX NG can therefore lift the restriction imposed by XML without reintroducing the problem that motivated the imposition of the restriction.

Unordered content

SGML provides an & operator: A & B matches A followed by B or B followed by A. XML removed the & operator. RELAX NG reintroduces it with a twist. In SGML, a content model of A & B* requires all the B elements to be consecutive: the required A element cannot occur in between two B elements. Usually, users use the & operator because they want to allow child elements to occur in any order, so this restriction is undesirable. In RELAX NG, the corresponding operator has interleaving semantics. It matches any interleaving of a sequence containing a single A element and a sequence containing zero or more B elements; it thus allows the A element to occur anywhere, including between two B elements.

XML removed the & operator mainly because of the & operator's reputation for implementation complexity. The most difficult part of implementing the & operator in SGML is detecting whether a content model including & is 1-unambiguous. Unlike SGML, XML and W3C XML Schema, RELAX NG does not restrict content models to be 1-unambiguous, so this implementation difficulty is removed. The classic implementation technique for SGML and XML content models is to construct a Glushkov automaton. The 1-unambiguity restriction is helpful for this technique because it ensures that the Glushkov automaton is deterministic. An interleaving operator causes difficulty with this technique. However, there is an alternative implementation technique available [17] based on derivatives of regular expressions [4]. This handles content models that are not 1-unambiguous without any additional effort and can deal with interleaving without difficulty. RELAX NG imposes restrictions on the use of interleave which are sufficient to ensure that a derivative-based implementation will not exhibit exponential behavior.

Namespaces

The XML Namespaces Recommendation [2] was published after the XML 1.0 Recommendation. XML Namespaces are layered on top of XML 1.0 and do not affect the semantics of XML 1.0 including the semantics of DTD validation. This means that DTD validation is not namespace-aware: it treats prefixes of element and attribute names as significant. On a namespace-aware view, it is the namespace URIs to which the prefixes are bound that should be significant rather than the prefixes themselves. RELAX NG validation is namespace-aware. For many applications of XML, namespaces are critical. However, there are also many other uses of XML, particularly in closed environments, where namespaces are not needed. RELAX NG therefore tries to ensure that none of the complexity related to namespaces affects users that do not make use of the namespaces support.

The mechanisms introduced by the XML Namespaces Recommendation are purely syntactic. The XML 1.0 Recommendation provides a syntax for representing a tree of elements and attributes in which each element and attribute is labeled with a simple, unstructured name. The XML Namespaces Recommendation extends this to allow the label of elements and attributes to be qualified with a namespace URI. This is all it does [7]. It makes no guarantees that the namespace URI refers to anything. The namespace URI is just part of the label of an element or attribute. RELAX NG takes this same syntactic view. It makes no assumptions about the usage of XML namespaces that go beyond what is specified in the XML Namespaces Recommendation. This contrasts with the approach of W3C XML Schema, which assumes that a namespace URI is associated with a schema. The advantages of the RELAX NG approach are simplicity and generality. RELAX NG has no problems representing vocabularies such as XSLT and RDF that make atypical use of XML namespaces.

The purpose of XML Namespaces is to enable extensibility. Extensions defined by a particular organization can be clearly identified by using a namespace URI controlled by that organization. To support this, a schema language needs to be able to specify that a schema is open to various kinds of extension at various points. For example, a schema language needs to be able to say that an arbitrary attribute is allowed on an element provided the name of the attribute is namespace qualified. RELAX NG provides very general support for this through the idea of a name class. A name class denotes a set of names, where a name is a pair consisting of a namespace URI and a local name. There are three kinds of atomic name class: a single specific name, any name with a particular namespace URI and any name whatsoever. Name classes can be composed using set union and set difference. RDF provides a good example of where this flexibility is needed: in RDF the name of an element specifying a property can be anything with a non-null namespace URI except rdf:Description, rdf:RDF, rdf:ID, rdf:about, rdf:aboutEach, rdf:bagID, rdf:parseType or rdf:resource. An important feature of RELAX NG is that the name of an element is specified independently of its attributes and content. When the name of an element is specified as an open name class, all the normal facilities of RELAX NG remain available for specifying its attributes and content. For example, in XSLT an element such as xsl:if can contain certain specific elements from the XSLT namespace and arbitrary elements from non-XSLT namespaces. These non-XSLT elements specify literal result elements; although the names of these elements can be from any non-XSLT namespace, their contents have the same constraints as elements from the XSLT namespace such as xsl:if. RELAX NG's support for extensibility avoids making any assumptions about what extensibility policies are appropriate for schemas, but instead provides general facilities that are sufficient to describe almost any extensibility policy that a schema author may choose.

Customization

The main mechanism provided by XML DTDs for customization is overriding the parameter entity definitions. RELAX NG also supports definition overriding, but provides two improvements. One is that RELAX NG makes the order of definitions within a grammar insignificant. A definition is not required to come before references to that definition. Overriding definitions are distinguished by being placed within the include element that references the schema containing the overridden definition, in a similar fashion to the internal subset of a DOCTYPE declaration. This gives schema authors the freedom to order their definitions as they see fit. It also makes it explicit when a definition is overriding another definition. The other improvement is that RELAX NG allows multiple definitions to be combined together. This is similar to the way that XML 1.0 DTDs allow multiple attribute list declarations for a single element type. Unlike XML 1.0 DTDs, RELAX NG requires schema authors to indicate explicitly when a definition is to be combined (by using a combine attribute) and allows combination of arbitrary patterns using either the interleave or choice operator rather than restricting the facility to just attributes.

Inheritance

One of the most significant differences between RELAX NG and W3C XML Schema is that RELAX NG does not have any concept of inheritance. The support for inheritance in W3C XML Schema is probably the major contributor to the considerable complexity of W3C XML Schema Part 1. Yet, the inheritance mechanisms in W3C XML Schema do not allow W3C XML Schema to express any constraints that cannot be expressed in RELAX NG. Although W3C XML Schema has a very complex type system with two type hierarchies, one for elements (called substitution groups) and one for complex types, it supports only single inheritance. However, modern object-oriented languages, such as Java and C#, support multiple inheritance (at least for interfaces). Thus, in general the inheritance structure of a class hierarchy cannot be represented in a schema. Inheritance has proven to be very useful in modeling languages such as UML. However, I would argue that trying to make an XML schema language also be a modeling language is not a good idea. An XML schema language has to be concerned with syntactic details, such as whether to use elements or attributes, which are irrelevant to the conceptual model. Instead, I believe it is better to use a standard modeling language such as UML, which provides full multiple inheritance, to do conceptual modeling, and then generate schemas and class definitions from the model [5]. If a schema language is used in this way, then there is no need for it to support inheritance; the role of the schema language is purely to describe the XML syntax used to represent the conceptual model. RELAX NG has the advantage in this role that it provides more flexibility in the choice of syntax. For example, in W3C XML Schema the xsi:type attribute is a special case; it is the only attribute that can affect the content model of an element. In RELAX NG, any attribute can affect the content model in a quite general way. Thus, in situations where W3C XML Schema forces the use of the xsi:type attribute, RELAX NG allows the schema designer to choose the attribute name (or indeed choose to use a subelement instead of an attribute).

Identity constraints

The RELAX NG TC spent a considerable amount of time considering what support RELAX NG should provide for enforcing identity (uniqueness and cross-reference) constraints. In the end, the conclusion was that identity constraints were better separated out into a separate specification. Accordingly, RELAX NG itself provides no support for identity constraints. RELAX NG DTD Compatibility [12] provides support for traditional XML ID/IDREF attributes. There were a number of reasons for preferring separation. One reason is the relative difference in maturity. RELAX NG is based on finite tree automata; this is an area of computer science that has been studied for many years and is accordingly mature and well understood. The use of grammars for specifying document structures is based on more than 15 years of practical experience. By contrast, the area of identity constraints (beyond simple ID/IDREF constraints) is much less mature and is still the subject of active research. Another reason is that it is often desirable to perform grammar processing separately from identity constraint processing. For example, it may be known that a particular document is valid with respect to a grammar but not known that it satisfies identity constraints. The type system of the language that was used to generate a document may well be able to guarantee that it is valid with respect to the grammar; it is unlikely that it will be able to guarantee that it satisfies the identity constraints. A document assembled from a number of components may guaranteed to be valid with respect to a grammar because of the validity of the components, but this will often not be the case with identity constraints. Even when a document is known to satisfy the identity constraints as well as be valid with respect to the grammar, it may be necessary to perform identity constraint processing in order to allow application programs to follow references. Another reason is that no single identity constraint language is suitable for all applications. Different applications have identity constraints of vastly different complexity. Some applications have complex constraints that span multiple documents [22]. Other applications need only a modest increment on the XML ID/IDREF mechanism. A solution that is sufficient for those applications with complex requirements is likely to be overkill for those applications with simpler requirements.

W3C XML Schema [24] provides a quite sophisticated identity-constraint mechanism. However, it has something of the feel of a specification within a specification. An element or attribute in the instance that participates in an identity constraint plays one of three roles: it can be a scope in which the constraint is enforced; it can be a target, which is an object which is unique in the scope; or it can be a field, which is part of the key which identifies the target within its scope. In W3C XML Schema, the target and field are identified using an XPath. However, the scope is identified by including the identity constraint specification in the declaration of the scoping element. This leads to the restriction that although identity constraints are hierarchical, there is no way to specify a reference to a key in another part of the hierarchy. A better approach would be to use a path also to specify the scope, thus completely decoupling the specification of identity constraints from the rest of the schema and opening the way to more complete constraints on key references.

Associating schemas with documents

With XML 1.0, the XML document uses a DOCTYPE declaration to identify the DTD with respect to which it is valid. There is no provision for a document to be validated with respect to DTD that is specified independently of the document. This is unsatisfactory for interchange. When a document recipient receives a document from an untrusted sender, the recipient may need to check that the document is valid with respect to a particular DTD. The recipient cannot assume that the DOCTYPE declaration of the document correctly identifies that DTD. The recipient may want to validate against a DTD different from that used by the author: for example, the recipient may validate against a generalized, public DTD, whereas the author may validate against a restrictive, private DTD that is a subset of the public DTD. Unlike XML 1.0, RELAX NG does not tie a document to a single schema. The RELAX NG validation process has two inputs: a document and a schema against which to validate the document.

In fact, RELAX NG does not define any mechanism for associating a document with a RELAX NG schema. Although it is useful to be able to specify rules for determining the schema to be used to validate a particular document, this problem is not specific to RELAX NG. Validation is just one of many processes that can be applied to an XML document. For example, a user may wish to perform XInclude [21] processing or XSLT processing. A user may wish to perform validation before or after any of these other processes. The problem of associating a schema with a document is really just a special case of the problem of associating processing with a document. What is needed is a solution that can specify a series of processes to be applied to a document.

Database NULLs

XML users coming from the database world sometimes wish to represent database NULLs explicitly in an XML document. There are two plausible ways that a document might do this. One is to use an attribute to signal that an element has null content. This requires that the schema language be able to specify a choice between data and the presence of a particular attribute. Another way is to use an element to signal that an element has null content. This requires that the schema language be able to specify a choice between data and the presence of a particular element. W3C XML Schema does not meet either of the above requirements; thus without some special treatment for database NULLs, it would be awkward to explicitly represent database NULLs in document. W3C XML Schema provides a special built-in xsi:nil attribute to deal with this. The situation with RELAX NG is different. RELAX NG can handle both of the above requirements. Thus, there is no need for RELAX NG to provide any special facility for database NULLs. If it is desired to standardize a representation of NULL, then this can be done without changing RELAX NG. Indeed, it is possible for RELAX NG explicitly to model the semantics of xsi:nil that are built-in to W3C XML Schema.

Acknowledgements

Murata Makoto and other members of the OASIS RELAX NG TC participated in the design of RELAX NG.

The Design of RELAX NG

James Clark (jjc@thaiopensource.com)

Abstract

Table of contents

Introduction

Evolution of DTDs

Schema structure

Declarations and definitions