One of many approaches to managing XML data in an application is XML/Object-binding, where classes are created in a language capable of object-oriented programming to encapsulate portions of the XML data format, and conversion methods are produced capable of translating in both directions between XML data and class instances.
There are applications in existence which automate the task of generating such classes and conversions, but they are fettered by the differences between the XML paradigm and the object-oriented paradigm, a problem which is sometimes referred to as the XML/Object impedance mismatch (an allusion to analogous problems in other domains).
This article explains the background to the problem in the context of W3C XML Schema language, and proposes a practical solution.
- XML and XML Schema Overview
- XML/Object Impedance Mismatch
XML and XML Schema Overview
The nature of the topic requires reference to specific details of XML and XML Schema. An overview of XML and XML Schema concepts that have some bearing on the the XML/Object impedance mismatch is provided here. Skip to the next section if you’re already familiar with the terminology.
XML documents are a tree of nodes. The main node types of interest here are elements and text.
There is a single element that serves as the root node called the root or document element. Elements may contain any other nodes nested inside them. Elements have names, and may also contain attributes, which are name/value pairs. Attribute values are text.
<?xml version="1.0" encoding="utf-8"?>
<DocumentElement attributeName="Attribute value">
<!-- Comment -->
The names of attributes and elements may exist inside a namespace, or they may be unqualified. The use of a namespace in an XML instance document must be declared using a special
xmlns attribute. Optionally, a prefix may be assigned to a namespace, allowing elements and attributes from more than one namespace to be referenced in the same context.
The use of namespaces allows elements and attributes with the same name to be defined in different XML data formats, without risk of conflicts should one data format choose to reference data from another format. The choice of the name of the prefix given to a namespace is entirely up to the application writing the XML. Namespaces are often URIs, but they are merely treated as identifiers, not as addresses from which to retrieve data.
<?xml version="1.0" encoding="utf-8"?>
<!-- Comment -->
XML Schema is a language for defining the format of an XML document; XML Schemas are themselves XML documents, and they may import other schemas. The XML namespace prefix commonly given to XML Schema elements is
xs, and that convention is used here. XML Schemas have a document element of
Kinds of Definitions
XML Schema allows for definitions of elements and attributes, as would be expected, but also allows the definition of simple types, complex types, model groups and attribute groups.
XML Schema considers all complex and simple types to be derived from a built-in
xs:anyType type, which can contain any XML content.
xs:simpleType element is used to define a type that can be used for either a text node or an attribute value in an XML instance document.
xs:anySimpleType is the base type for all simple types (it derives directly from
xs:anyType), and there are 43 other built-in simple types, for representing data such as time, numeric values, strings and binary data.
User-defined simple types
New simple types can be defined by deriving from built-in types. Derivation may happen by restriction, where facets, such as
xs:enumeration (to specify a permitted value) and
xs:minLength are used to constrain the content from a base type.
It is also possible to define a type that is a list of instances of another simple type, or a union of other simple types which allows any of the members of the union to be used in an instance document. In XML documents, list types are represented as strings with whitespace-separated members. This means that the members cannot have embedded whitespace themselves.
Complex types are defined with the
xs:complexType element, and derive (ultimately) from the
xs:anyType type. They represent the content of an element, and define whether or not it may contain text nodes, and what child elements and attributes there may be.
Complex types can be defined to derive from other complex types by either extension or restriction. Restriction is used to disallow content that is permitted in the base type, but it cannot add content. The restricted type must still be a valid instance of the base type, so only optional content may be restricted away.
Complex type extension can add new attributes, and append another content model for child nodes. This new content model appears immediately after the base type’s content model in instance documents.
Complex types may have mixed content, which means that text nodes may be interspersed with the child elements; it is also possible to define complex types to have simple content, which allows them to have attributes, and a child text node, but not child elements.
Elements are defined with the
xs:element element. They may be declared as global, in which case they may be referenced in other element definitions. Only global elements can be the root element of XML instance documents.
Other element definitions define elements that have local scope. The same element name may be used by different local element definitions throughout the schema, provided they do not conflict with any global element names, or any other local element definitions in the same scope.
If an element definition is not a reference to a global element, then its content can be defined as either being a named global simple or complex type, or as an anonymous simple or complex type that is defined inline.
It is possible to declare that child elements and attributes belonging to particular namespaces may appear in an element in an instance document, without specifying what the names of those elements and attributes shall be.
xs:any is used to declare such elements, and
xs:anyAttribute is used to declare such attributes. Both these definitions have a namespace attribute, which can specify that content appears in
- A set of specific namespaces, or
- Any namespace other than the target namespace of the schema
- The global namespace
- The target namespace of the schema
A particle in XML Schema is an identifiable portion of a content model (but not necessarily a named portion). The standard requires that the particle to which any element which is encountered in a schema document belongs, must be uniquely identifiable without considering any subsequent content. This is called Unique Particle Attribution, or UPA.
The particle may be a single element, or it may be a pattern of elements. A particle has a minimum and a maximum occurrence count. The minimum occurrence count may be any finite non-negative number. The maximum occurrence count may be any finite non-negative number (including 0), or unbounded.
Patterns of elements are defined by grouping particles into compositors. Compositors are themselves particles and so have occurrence counts. Compositors may be anonymous, or they may be given a name by placing them in a model-group by defining them in a
A complex type, and hence any element supporting child elements, always has exactly one root compositor (possibly implicit, and possibly empty). There are three kinds of compositors:
xs:choice. Sequence and choice allow other compositors to be nested inside them, whereas all compositors allow only elements.
Sequence specifies the order of its nested particle definitions, all allows any order of its nested element definitions and choice specifies that exactly one of its nested particle definitions may appear in the content model.
<xs:sequence minOccurs="0" maxOccurs="unbounded">
<xs:element name="C" type="xs:string"/>
Complex types derived by extension implicitly have a sequence root compositor, where the child particles are the base type’s content model and the appended content model of the derived type. This is true even where the base complex type’s root compositor is a choice.
It is possible to specify that an attribute has a default value. This is the value that an attribute shall be considered to have by an application if the attribute is not otherwise specified in the instance document. An attribute may alternatively have a fixed value, which is also a default, but if the attribute is specified, it must have the fixed value.
Elements that are defined to be an instance of a simple type, or that are instances of complex types with simple content (no child elements) may have a default or fixed value, but the behaviour is different from that for attributes. Default or fixed values are automatically applied only for an empty element, or element with only insignificant whitespace. The type of the element determines whether whitespace is significant.
XML/Object Impedance Mismatch
Managing XML Data
XML data can be manipulated in software using libraries that support reading and writing the nodes of an XML document tree. Such libraries implement APIs such as W3C DOM (an in-memory tree of the XML nodes) and SAX (Simple API for XML, which executes user-defined call-back routines when an XML node is encountered during parsing). Users of the APIs interact with XML content directly.
Other approaches to interacting with XML data in an application include XQuery, a query language for extracting XML data, XSLT (Extensible Stylesheet Language Transformations) which is a programming language generally used for transforming XML data into another format, and Microsoft’s LINQ (Language Integrated Query) to XML which provides means both to query and construct XML documents. This article is concerned with the XML/Object binding approach to managing XML data.
XML/Object binding is the production of an object-oriented representation of the structure and types of an XML data format for a programming language. For example, instead of reading and creating a DOM
XmlNode instance for an XML element
spreadsheet-document, an application may instead operate on instances of a
SpreadsheetDocument class, with the details of converting to and from XML encapsulated by that class. Reasons for preferring the object binding approach might include type-safety and access to the structure and types of the data in a manner that is idiomatic for the programming language being used.
Code Generation from XML Schemas
Whilst definitions of XML data formats are used primarily for validation of XML instance documents, if a machine-readable definition of an XML data format is available, the opportunity exists to automate the production of object bindings too. Several dozen free and commercial applications exist that can produce object bindings automatically from XML Schemas.
Automatic object binding for XML Schemas is useful if the details of managing XML are to be abstracted away, and familiar types and structures to be used in an object-oriented programming language, when operating on XML data. However, the differences between XML Schema’s type systems and content models on the one hand, and common programming languages’ type systems and content models on the other hand, mean that determining a working binding between the two is complicated, and determining a “palatable” binding without additional user input beyond the XML Schemas may be impossible. The incongruity between XML and object representations of data is sometimes referred to as the XML/Object impedance mismatch. Some of the difficulties are explained below.
XML Schema has a richer type system than commonly used object-oriented languages such as C++, C# or Java. It supports global, named definitions of attributes, attribute groups, elements, model groups, complex types and simple types, local elements and attributes, and anonymous local type and compositor definitions.
This wide variety of types introduces the problem of name conflicts, because, for example, a group, complex type, attribute group and element may all legally have the same name in XML Schema, but an object-binding algorithm which generated classes to represent each of these schema types with names based on the schema names would then conflict. Similarly, a local element declaration may conflict with a local attribute definition, with the likely consequence that class members conflict in name in the target language.
|Conflicting element and attribute names in same scope|
Types may be declared nested inside other declarations, and although an obvious object mapping would avail of nested class support, usage of nested classes is more awkward than regular classes in languages such as C++ and C# due to verbosity in referencing them.
|Nesting of local definitions|
An idiomatic binding might avoid the nested classes and makes these global, but this introduces the issue of what these classes should then be named, and possible conflicts with the names of other types.
Choices and Unions
In the case of XML Schema choice and union constructs, multiple alternative content models are permitted. The object binding might use a common base class, and rely on RTTI (Run-Time Type Identification) to disambiguate between the alternatives, but the common base class may well be
object, or in the case of C++, which lacks a universal base class, not exist at all. Choice also poses the additional difficulty that it is anonymous, so a name must be invented if binding to a field in the object binding, e.g.
Items, and as a single element may have multiple choices, these invented names must then be mangled (
|Run-time type identification of choice|
XML Schema supports type derivation by extension, which maps naturally to inheritance in object-oriented languages.
However, it also supports type derivation by restriction (for both complex and simple types) and type derivation by list and union (for simple types only). Inheritance in OOP does not support these constructs, so bindings in the target language cannot readily retain the inheritance tree.
XML Schema supports definition of the pattern of particles that may appear nested inside an element. A particle in this context means an element, a named model-group or one of the anonymous compositors: sequence, all and choice. The sequence compositor specifies that the particles it contains must appear in a particular order; the all indicates that all of the particles are required but can appear in any order, and the choice compositor indicates that exactly one of the contained particles must appear as content.
The particles, including the compositors and model-groups, may be repeated, and the choice and sequence compositors may be nested indefinitely. Representing the structure as classes for repeated or nested compositors would require introducing classes for constructs that are anonymous in the XML Schema, introducing the problem of determining a suitable name for them.
|Possible object binding of repeated anonymous compositors|
As mentioned earlier, the choice compositor does not have a single natural object-oriented representation anyway, and the most appropriate binding may vary according to the nature of the data being represented.
Semantics vs Syntax
Lastly, the intended meaning of a construct in XML Schema cannot, AI aside, be “understood” by an automated application.
A trivial example of such a scenario would be a definition of a simple type for representing globally unique identifiers (GUIDs/UUIDs). This could take several forms, such as a length restriction of the built-in
xs:hexBinary datatype, or as a pattern restriction of the
xs:token datatype. Likely attempts at generating code for such a type in the target language would be as a byteblock, a string or some heavyweight type that accurately represents the XML Schema definition (from a validation point of view), but still fails to capture the intended meaning of the type.
The target language may have a suitable type for GUIDs, e.g.
System.Guid in .NET, and that class may be what a user would prefer in the binding, but even if an automated binding application had heuristics to recognize some of the ways a GUID might be represented as an XML Schema simple type, it could not recognize them all, and even if the application were pre-programmed to recognize some plausible representations, they could very well conflict with definitions of types not intended to be GUIDs, or even if they were for GUIDs, fail to map correctly to the target language, e.g. due to endianness differences.
Given the many problems that are encountered when attempting to determine a scheme for automatically generating an object-oriented view of XML data, a handful of which have been described above, a one-size-fits-all canonical binding of all of XML Schema to the object-oriented paradigm does not seem likely to produce “palatable” output, that is, a generated object-oriented output whose style is broadly in keeping with the idioms and patterns of the target programming language and target application, and whose interface uses meaningfully named identifiers for types and members.
From a theoretical stand-point, this doesn’t warrant much optimism about the general problem of a fully-automatic XML/Object compiler.
However, for practical purposes, I don’t see this as being as huge an obstacle as some would make out, even for sophisticated use cases, because a tool doesn’t really need to be fully-automatic in order to provide significant effort-saving and code-maintainability benefits.
I perceive the practical solution as lying in the following combination of features:
- Schema annotations:
- The user must be able to easily annotate the schema with non-schema directives to control the way that the tool maps schema types and structures to the target language.
- Automatic detection and reporting of problems:
- The user needs to know what problems exist with mapping to the target language, and where they are in the schema, so that they can add those annotations.
- Extensibility of the compiler and run-time:
- Even with annotations, there may be further refinements needed for a given case. A user must be able to extend the code generation machinery to achieve the output they want, without having to change the generator implementation directly, and without having to rewrite all generator functionality from scratch.
- Similarly, it must be possible for a user to control the treatment of the object-bindings at run-time rather than being locked into, for example, a single XML library.
An XML/Object binding compiler that operates on XML Schemas should be capable, on the one hand, of producing correctly working output code from an unmodified XML Schema, but also allow for human intervention to improve the quality of the generated output.
The human intervention envisaged is emphatically not hand-editing of the generated code, but instead annotation of the schema by the user with directives specifically related to object-binding (and not affecting the XML Schema definitions themselves, merely supplementing them).
This is supported by some existing tools, such as Java’s JAXB generator.
XML Schema language allows annotations for application-specific purposes to be embedded within schema documents, both as custom attributes on definition elements, and as custom elements inside the
xs:appinfo element. These annotations must appear in an XML namespace other than the namespace of XML Schemas. In the example below, the use of both attribute and element annotations is shown:
<xs:element name="A" type="CustomType">
<customAnnotation:TypeName value="UserType" />
<customAnnotation:ToXml value="UserType.ConvertToXml" />
<customAnnotation:FromXml value="UserType.ConvertFromXml" />
<xs:element name="Conflict" customAnnotation:name="AlternativeName"/>
<xs:attribute name="Conflict" type="xs:string"/>
The annotations shown in this example are merely for illustration, and aren’t meant to document the specific syntax of any particular implementation.
customAnnotation:TypeInformation element here might allow for information about how to bind a type in the target language to be provided directly by a user, over-riding the default binding scheme.
customAnnotation:name attribute might be used to resolve the conflict between the element and attribute names.
Detection and Reporting of Problems
It’s all very well to be able to annotate a schema to resolve mapping problems, but for large schemas or schemas that were written by a third-party, tracking down poor quality bindings will take some effort.
Whilst there will be many ways in which a user may wish to customize the generated output to match the naming conventions, type-system, etc. of the application that will be using it, and identifying such cases are outside the scope of automation, the fundamental XML/Object impedance mismatch problems nonetheless can be identified automatically by a tool.
To achieve this, I see the solution in an XML/Object compiler which has both strict and lax modes of operation:
- Lax mode:
- The compiler will generate working code from any XML schema, even in the face of issues that are likely to give rise to poor-quality output (such as choosing heavy-weight bindings that may not be needed or desired, assigning generated names to classes or properties).
- However, all such issues will be reported as warnings by the tool, indicating the reason for the problem, the schema definition and location that is problematic and one or more suggested annotations that might help resolve the problem.
- Strict mode:
- The compiler will terminate execution when it encounters an impedance mismatch problem that has not been addressed by a suitable user-supplied annotation.
This approach achieves three desirable outcomes:
- Supporting agile development:
- Functional but potentially-inelegant code can be generated with minimal effort.
- Improvements can be introduced iteratively.
- Feedback on progress made resolving issues:
- The compiler warnings guide the user to fixing the problems with the mapping to the target language.
- Warnings disappear as problems are addressed.
- Enforcement of quality in production code:
- Once all problems are resolved, the compiler can be switched to strict mode.
- This causes builds to fail if regressions are introduced (most likely if the schema is under active development).
Even with a rich set of supported annotations, a user of an XML/Object compiler might hit a “brick-wall” where the compiler cannot produce the right kind of output for the application. The inability of the tool to support a particular feature should not be an insurmountable problem that requires expensive resolutions such as throwing away the generator to opt for an alternative, or even to give up on automatic generation altogether and to write the object-bindings by hand.
Extensibility can be provided within the implementation of the compiler, by supporting plug-ins, through a documented API, for all the major subcomponents, either as a complete replacement for, or as a refinement of, the built-in implementations.
The generated output itself should also allow for a user to take full control over which XML parsers and writers are used, how files or streams of XML data are read in and written out, and all the small details of the XML data management, such as how the files should be formatted, and any processing instructions that should be added.
This way, users with the most sophisticated use-cases can adapt a tool which nearly but not quite supports what they need out-of-the-box, to achieve exactly what they need, without the need for invasive changes to the tool or drastic measures like hand-editing generated code or writing all the bindings by hand.
Some further reading:
- For another take on the subject, see Ralf Lämmel & Erik Meijer’s excellent paper Revealing the X/O impedance mismatch which goes into detail on many of the mismatches between the XML and OO paradigms.
- The W3C’s XML Schema Patterns for Databinding Working Group produced a catalogue of XML Schema structures that were common in real-world schemas, but problematic for tools in their Advanced XML Schema Patterns for Databinding