Chapter 24. XML

Every now and then, an idea comes along that in retrospect seems just so simple and obvious that everyone wonders why it hadn’t been seen all along. Often when that happens, it turns out that the idea isn’t really all that new after all. The Java revolution began by drawing on ideas from generations of programming languages that came before it. XML—the Extensible Markup Language—does for content what Java did for programming: draws on some old ideas and uses them to provide a portable way to describe data.

XML is a simple, common format for representing structured information as text. The concept of XML follows the success of HTML as a universal document presentation format and generalizes it to handle any kind of data. In the process, XML has not only recast HTML, but has transformed the way many businesses think about their information. In the context of a world driven more and more by documents and data exchange, XML is an important foundation technology.

The Butler Did It

This chapter is one of the longest in this book and deals with many APIs and concepts. Part of the reason for this is that there has been a great deal of evolution of XML tools over time in order to support working with XML at different levels of abstraction. We’re going to introduce you to the APIs that we think remain important and useful in this chapter and to some extent we’ll do this by starting at the bottom level and working our way up. First we’ll cover basic XML concepts and low-level APIs such as the event-driven SAX (Simple API for Java) and model-building DOM (Document Object Model). We’ll also discuss related technologies such as XML Schema validation, XPath queries, and XSL (Extensible Style Sheet) transformation. Later in this chapter, we’ll discuss the higher-level JAXB, Java XML Binding API, for mapping plain Java objects directly to XML and back.

This means that for some of you, the most useful material may be toward the end of this chapter where we cover the high-level tools. So we want to reassure you that things get more interesting as the chapter progresses. When we reach the section on JAXB we’ll see that we can take plain old Java objects (POJOs) and write them to XML by adding (in the simplest case) a one-line annotation. The following snippet shows a Java Person and Address class and the corresponding XML that they would map to by default.

@XmlRootElement
public class Person {
    public String name;
    public Address address;
    public int age;
}
public class Address {
    public String city, street;
    public int number, zip;
}

<person>
    <name>Pat Niemeyer</name>
    <address>
        <city>St. Louis</city>
        <street>Java St.</street>
        <number>1234</number>
        <zip>54321</zip>
    </address>
    <age>42</age>
</person>

But before we go there, let’s take a step back and talk about the motivation and “rules” of XML documents and some of the ways we can parse and generate them.

A Bit of Background

XML and HTML are called markup languages because of the way they add structure to plain-text documents—by surrounding parts of the text with tags that indicate structure or meaning, much as someone with a pen might highlight a sentence and add a note. While HTML predefines a set of tags and their structure, XML is a blank slate in which the author gets to define the tags, the rules, and their meanings.

Both XML and HTML owe their lineage to Standard Generalized Markup Language (SGML)—the mother of all markup languages. SGML has been used in the publishing industry for decades (including at O’Reilly). But it wasn’t until the Web captured the world that it came into the mainstream through HTML. HTML started as a very small application of SGML, and if HTML has done anything at all, it has proven that simplicity reigns.

Text Versus Binary

When Tim Berners-Lee began postulating the Web back at CERN in the late 1980s, he wanted to organize project information using hypertext with links embedded in plain text.[49] When the Web needed a protocol, HTTP—a simple, text-based client-server protocol—was invented. So, what exactly is so enchanting about the idea of plain text? Why, for example, didn’t Tim turn to the Microsoft Word format as the basis for web documents? Surely a binary, non-human-readable format and a similarly machine-oriented protocol would be more efficient? Since the Web’s inception, there have now been literally trillions of HTTP transactions. Was it really a good idea for them to use (English) words like “GET” and “POST" as part of the protocol?

The answer, as we’ve all seen, is yes! Whatever humans can read and undertstand, human developers can work with more easily. There is a time and place for a high level of optimization (and obscurity), but when the goal is universal acceptance and cross-platform portability, simplicity and transparency are paramount. This is the first fundamental proposition of XML: simple and nominally human-readable data.

A Universal Parser

Using text to exchange data is not exactly a new idea, either, but historically, for every new document format that came along, a new parser would have to be written. A parser is an application that reads a document and understands its formatting conventions, usually enforcing some rules about the content. For example, the Java Properties class has a parser for the standard properties file format (Chapter 11). In our simple spreadsheet in Chapter 18, we wrote a parser capable of understanding basic mathematical expressions. As we’ve seen, depending on complexity, parsing can be quite tricky.

With XML, we can represent data without having to write this kind of custom parser. This isn’t to say that it’s reasonable to use XML for everything (e.g., typing math expressions into our spreadsheet), but for the common types of information that we exchange on the Net, we shouldn’t have to write parsers that deal with basic syntax and string manipulation. In conjunction with document-verifying components (Document Type Definitions [DTDs] or XML Schema), much of the complex error checking is also done automatically. This is the second fundamental proposition of XML: standardized parsing and validation.

The State of XML

The APIs we’ll discuss in this chapter are powerful and popular. They are being used around the world to build enterprise-scale systems every day. In recent years, JAXB Java to XML binding has been vastly streamlined and simplified (primarily through the use of Java annotations to replace configuration files and support a “code first” methodology). However, as with any popular technology, there has been a recognition of its limitations and some complexity has crept into what began as simple concepts. In the area of browser-based applications, some have turned to JavaScript Object Notation (JSON) as an even lighter-weight approach that maps natively to JavaScript, especially for transient communications between client and server. However, XML tools are still widely used in this area as well. Google’s Protocol Buffers-encoding scheme is another example of a system-to-system communication format that has been used in place of XML; in this case, where very high performance trumps flexibility. But XML remains the most powerful general format for document and data exchange with the widest array of tools support.

The XML APIs

All the basic APIs for working with XML are now bundled with the standard release of Java. This included the javax.xml standard extension packages for working with Simple API for XML (SAX), Document Object Model (DOM), XML Binding JAXB, and Extensible Stylesheet Language (XSL) transforms, as well as APIs such as XPath, and XInclude. If you are using an older version of Java, you can still use many of these tools but you will have to download these packages separately.

XML and Web Browsers

All modern web browsers support XML explicitly, both in terms of simple rendering of XML content and also client-side transformation of XML into HTML for display. If you load an XML document in you browser it will generally be displayed as a tree with controls to allow you to collapse and expand nodes (like an outline). Displaying XML in this way is used mainly for debugging, but JavaScript can also support client-side XSL transformation directly in the browser. XSL is a language for transforming XML into other documents; we’ll talk about it later in this chapter.

When viewed in older browsers or in contexts that do not explicitly format XML for viewing, the browser will generally simply display the text of the document with all the tags (structural information) stripped off. This is the prescribed behavior for working with unknown XML markup in a viewing environment. Remember that you can always use the “view source” option to display the text of a file in your browser if you want to see the original source.

XML Basics

The basic syntax of XML is extremely simple. If you’ve worked with HTML, you’re already halfway there. As with HTML, XML represents information as text using tags to add structure. A tag begins with a name sandwiched between less than (<) and greater than (>) characters. Unlike HTML, XML tags must always be balanced; in other words, an opening tag must always be followed by a closing tag. A closing tag looks just like the opening tag but starts with a less than sign and a slash (</). An opening tag, closing tag, and any content in between are collectively referred to as an element of the XML document. Elements can contain other elements, but they must be properly nested (all tags started within an element must be closed before the element itself is closed). Elements can also contain plain text or a mixture of elements and text (called mixed content). Comments are enclosed between <!— and —> markers. Here are a few examples:

<!-- Simple -->
<Sentence>This is text.</Sentence>

<!-- Element -->
<Paragraph><Sentence>This is text.</Sentence></Paragraph>

<!-- Mixed -->
<Paragraph>
        <Sentence>This <verb>is</verb> text.</Sentence>
</Paragraph>

<!-- Empty -->
<PageBreak></PageBreak>

An empty tag can be written more compactly in a special form using a single tag ending with a slash and a greater-than sign (/>):

<PageBreak/>

Attributes

An XML element can contain attributes, which are simple name-value pairs supplied inside the start tag.

<Document type="LEGAL"id="42">...</Document>
<Image name="truffle.jpg"/>

The attribute value must always be enclosed in quotes. You can use double (") or single (') quotes. Single quotes are useful if the value contains double quotes.

Attributes are intended to be used for simple, unstructured properties or compact identifiers associated with the element data. It is always possible to make an attribute into a child element, so, strictly speaking, there is no real need for attributes. But they often make the XML easier to read and more logical. In the case of the Document element in our preceding snippet, the attributes type and ID represent metadata about the document. We might expect that a Java class representing the Document would have an enumeration of document types such as LEGAL. In the case of the Image element, the attribute is simply a more compact way of including the filename. As a rule, attributes should be compact, with little significant internal structure (URLs push the envelope); by contrast, child elements can have arbitrary complexity.

The id attribute in the previous example may have special significance when used with a corresponding idref attribute. Together, these standard attributes are used with document validation to enforce referential integrity in documents. When validated, an id attribute value must be unique within the document and an idref attribute value must refer to a valid id within the document.

XML Documents

An XML document begins with a header like the following and one root element:

<?xml version="1.0" encoding="UTF-8"?>
<MyDocument>
</MyDocument>

The header identifies the version of XML and the character encoding used. The root element is simply the top of the element hierarchy, which can be considered a tree. If you omit this header or have XML text without a single root element (as in our earlier simple examples), technically what you have is called an XML fragment.

Encoding

The default encoding for an XML document is UTF-8, the ASCII-friendly 8-bit Unicode encoding. This encoding preserves ASCII values, so English text is unaltered by it. It also allows Unicode values to be stored in a reasonably efficient way. An XML document may specify another encoding using the encoding attribute of the XML header.

Within an XML document, certain characters are necessarily sacrosanct: for example, the < and > characters that indicate element tags. When you need to include these in your text, you must encode them. XML provides an escape mechanism called “entities” that allows for encoding special structures. XML has five predefined entities, as shown in Table 24-1.

Table 24-1. XML entities

Entity

Encodes

&amp;

& (ampersand)

&lt;

< (less than)

&gt;

> (greater than)

&quot;

" (quotation mark)

&apos;

' (apostrophe)

An alternative to encoding text in this way is to use a special “unparsed” section of text called a character data (CDATA) section. A CDATA section starts with the cryptic string <![CDATA[ and ends with ]]>, like this:

<![CDATA[  Learning Java, O'Reilly & Associates ]]>

The CDATA section looks a little like a comment, but the data is still part of the document, just opaque to the parser.

There is one more alternative, which is to use a special <include> directive to include the contents of a URL or file either as pre-escaped text or optionally parsed as XML. XML includes are very convenient, and we’ll talk about them later in this chapter.

Namespaces

You’ve probably seen that HTML has a <body> tag that is used to structure web pages. Suppose for a moment that we are writing XML for a funeral home that also uses the tag <body> for some other, more macabre, purpose. This could be a problem if we want to mix HTML with our mortuary information.

If you consider HTML and the funeral home tags to be languages in this case, the elements (tag names) used in a document are really the vocabulary of those languages. An XML namespace is a way of saying whose dictionary you are using for a given element, allowing us to mix them freely. (Later, we’ll talk about XML Schemas, which enforce the grammar and syntax of the language.)

A namespace is specified with the xmlns attribute, whose value is a Uniform Resource Identifier (URI) that uniquely defines the set (and usually the meaning) of tags from that namespace:

<element xmlns="namespaceURI">

Recall from Chapter 14 that a URI is not necessarily a URL. URIs are more general than URLs. In practical terms, a URI is to be treated as a unique string. Often, the URI is in fact also a URL for a document describing the namespace, but when true it is only by convention.

An xmlns namespace attribute can be applied to an element and affects all its (nested) children; this is called a default namespace for the element:

<body xmlns="http://funeral-procedures.org/">

Often it is desirable to mix and match namespaces on a tag-by-tag basis. To do this, we can use the special xmlns attribute to define a special identifier for the namespace and use that identifier as a prefix on the tags in question. For example:

<funeral xmlns:fun="http://funeral-procedures.org/">
     <html><head></head><body>
     <fun:body>Corpse #42</fun:body>
</funeral>

In the preceding snippet of XML, we’ve qualified the body tag with the prefix “fun:”, which we defined in the <funeral> tag. In this case, we should qualify the root tag as well, reflexively:

<fun:funeral xmlns:fun="http://funeral-procedures.org/">

The XML parser factories supplied with Java have a switch to specify whether you want the parser to interpret namespaces. This switch defaults to off for historical reasons.

parserFactory.setNamespaceAware( true );

We’ll talk more about parsing in the sections on SAX and DOM later in this chapter.

Validation

A document that conforms to the basic rules of XML with proper encoding and balanced tags is called a well-formed document. Just because a document is syntactically correct, however, doesn’t mean that it makes sense. Two related sets of tools, DTDs and XML Schemas, define ways to provide a grammar for your XML elements. They allow you to create syntactic rules, such as “a City element can appear only once inside an Address element and comes before a State element.” XML Schema goes further to provide a flexible language for describing the validity of data content of the tags, including both simple and compound data types made of numbers and strings.

A document that is checked against a DTD or XML Schema description and follows the rules is called a valid document. A document can be well formed without being valid, but not vice versa.

HTML to XHTML

To speak very loosely, we could say that the most popular and widely used form of XML in the world today is HTML. The terminology is loose because HTML is not really well-formed XML. HTML tags violate XML’s rule forbidding unbalanced elements; the common <p> tag is typically used without a closing tag, for example. HTML attributes also don’t require quotes. XML tags are also case-sensitive; <P> and <p> are two different tags in XML. We could generously say that HTML is “forgiving” with respect to details like this, but as a developer, you know that sloppy syntax results in ambiguity. XHTML is an alternate, strict XML version of HTML that is clear and unambiguous. This form of HTML works in modern browsers. Fortunately, if you want to switch, you don’t have to manually clean up all your HTML documents; Tidy is an open source program that automatically converts HTML to XHTML, validates it, and corrects common mistakes.

SAX

SAX is a low-level, event-style API for parsing XML documents. SAX originated in Java, but has been implemented in many languages. We’ll begin our discussion of the Java XML APIs here at this lower level, and work our way up to higher-level (and often more convenient) APIs as we go.

The SAX API

To use SAX, we’ll draw on classes from the org.xml.sax package, standardized by the W3C. This package holds interfaces common to all implementations of SAX. To perform the actual parsing, we’ll need the javax.xml.parsers package, which is the standard Java package for accessing XML parsers. The java.xml.parsers package is part of the Java API for XML Processing (JAXP), which allows different parser implementations to be used with Java in a portable way.

To read an XML document with SAX, we first register an org.xml.sax.ContentHandler class with the parser. The ContentHandler has methods that are called in response to parts of the document. For example, the ContentHandler’s startElement() method is called when an opening tag is encountered, and the endElement() method is called when the tag is closed. Attributes are provided with the startElement() call. Text content of elements is passed through a separate method called characters(). The characters() method may be invoked repeatedly to supply more text as it is read, but it often gets the whole string in one bite. The following are the method signatures of these methods of the ContentHandler class.

public void startElement(
    String namespace, String localname, String qname, Attributes atts );
public void characters(
    char[] ch, int start, int len );
public void endElement(
    String namespace, String localname, String qname );

The qname parameter is the qualified name of the element: this is the element name, prefixed with any namespace that may be applied. When you’re working with namespaces, the namespace and localname parameters are also supplied, providing the namespace and unqualified element name separately.

The ContentHandler interface also contains methods called in response to the start and end of the document, startDocument() and endDocument(), as well as those for handling namespace mapping, special XML instructions, and whitespace that is not part of the text content and may optionally be ignored. We’ll confine ourselves to the three previous methods for our examples. As with many other Java interfaces, a simple implementation, org.xml.sax.helpers.DefaultHandler, is provided for us that allows us to override only the methods in which we’re interested.

JAXP

To perform the parsing, we’ll need to get a parser from the javax.xml.parsers package. JAXP abstracts the process of getting a parser through a factory pattern, allowing different parser implementations to be plugged into the Java platform. The following snippet constructs a SAXParser object and then gets an XMLReader used to parse a file:

    import javax.xml.parsers.*;
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();
    XMLReader reader = saxParser.getXMLReader();
    
    reader.setContentHandler( myContentHandler );
    reader.parse( "myfile.xml" );

You might expect the SAXParser to have the parse method. The XMLReader intermediary was added to support changes in the SAX API between 1.0 and 2.0. Later, we’ll discuss some options that can be set to govern how XML parsers operate. These options are normally set through methods on the parser factory (e.g., SAXParserFactory) and not the parser itself. This is because the factory may wish to use different implementations to support different required features.

SAX’s strengths and weaknesses

The primary motivation for using SAX instead of the higher-level APIs that we’ll discuss later is that it is lightweight and event-driven. SAX doesn’t require maintaining the entire document in memory. So if, for example, you need to grab the text of just a few elements from a document, or if you need to extract elements from a large stream of XML, you can do so efficiently with SAX. The event-driven nature of SAX also allows you to take actions as the beginning and end tags are parsed. This can be useful for directly manipulating your own models without first going through another representation. The primary weakness of SAX is that you are operating on a tag-by-tag level with no help from the parser to maintain context. We’ll talk about how to overcome this limitation next. Later, we’ll also talk about the new XPath API, which combines much of the benefits of both SAX and DOM in a form that is easier to use.

Building a Model Using SAX

The ContentHandler mechanism for receiving SAX events is very simple. It should be easy to see how one could use it to capture the value or attributes of a single element in a document. What may be harder to see is how one could use SAX to populate a real Java object model. Creating or pushing data into Java objects from XML is such a common activity that it’s worth considering how the SAX API applies to this problem. The following example, SAXModelBuilder, does just this, reading an XML description and creating Java objects on command. This example is a bit unusual in that we resort to using some reflection to do the job, but this is a case where we’re trying to interact with Java objects dynamically.

In this section, we’ll start by creating some XML along with corresponding Java classes that serve as the model for this XML. The focus of the example code here is to create the generic model builder that uses SAX to read the XML and populate the model classes with their data. The idea is that the developer is creating only XML and model classes—no custom code—to do the parsing. You might use code like this to read configuration files for an application or to implement a custom XML “language” for describing workflows. The advantage is that there is no real parsing code in the application at all, only in the generic builder tool. Finally, late in this chapter when we discuss the more powerful JAXB APIs, we’ll reuse the Java object model from this example simply by adding a few annotations.

Creating the XML file

The first thing we’ll need is a nice XML document to parse. Luckily, it’s inventory time at the zoo! The following document, zooinventory.xml, describes two of the zoo’s residents, including some vital information about their diets:

<?xml version="1.0" encoding="UTF-8"?>
    <inventory>
        <animal animalClass="mammal">
            <name>Song Fang</name>
            <species>Giant Panda</species>
            <habitat>China</habitat>
            <food>Bamboo</food>
            <temperament>Friendly</temperament>
            <weight>45.0</weight>
        </animal>
        <animal animalClass="mammal">
            <name>Cocoa</name>
            <species>Gorilla</species>
            <habitat>Central Africa</habitat>
            <foodRecipe>
                <name>Gorilla Chow</name>
                <ingredient>fruit</ingredient>
                <ingredient>shoots</ingredient>
                <ingredient>leaves</ingredient>
            </foodRecipe>
            <temperament>Know-it-all</temperament>
            <weight>45.0</weight>
        </animal>
    </inventory>

The document is fairly simple. The root element, <inventory>, contains two <animal> elements as children. <animal> contains several simple text elements for things like name, species, and habitat. It also contains either a simple <food> element or a complex <foodRecipe> element. Finally, note that the <animal> element has one attribute, animalClass, that describes the zoological classification of the creature (e.g., Mammal, Bird, Fish, etc.). This gives us a representative set of XML features to play with in our examples.

The model

Now let’s make a Java object model for our zoo inventory. This part is very mechanical—we simply create a class for each of the complex element types in our XML; anything other than a simple string or number. Best practices would probably be to use the standard JavaBeans property design pattern here—that is, to use a private field (instance variable) plus a pair of get and set methods for each property. However, because these classes are just simple data holders and we want to keep our example small, we’re going to opt to use public fields. Everything we’re going to do in this example and, much more importantly, everything we’re going to do when we reuse this model in the later JAXB binding example, can be made to work with either field or JavaBeans-style method-based properties equivalently. In this example, it would just be a matter of how we set the values and later in the JAXB case, it would be a matter of where we put the annotations. So here are the classes:

    public class Inventory {
           public List<Animal> animal = new ArrayList<>();
    }

    public class Animal 
    {
        public static enum AnimalClass { mammal, reptile, bird, fish, amphibian,
            invertebrate }    

        public AnimalClass animalClass;    
        public String name, species, habitat, food, temperament;    
        public Double weight;
        public FoodRecipe foodRecipe;
    
        public String toString() { return name +"("+animalClass+",
            "+species+")"; }
    }

    public class FoodRecipe
    {
        public String name;
        public List<String> ingredient = new ArrayList<String>();
    
        public String toString() { return name + ": "+ ingredient.toString(); }
    }

As you can see, for the cases where we need to represent a sequence of elements (e.g., animal in inventory), we have used a List collection. Also note that the property that will serve to hold our animalClass attribute (e.g., mammal) is represented as an enum type. We’ve also throw in simple toString() methods for later use. One more thing—we’ve chosen to name our collections in the singular form here (e.g., “animal,” as opposed to “animals”) just because it is convenient. We’ll talk about mapping names more in the JAXB example.

The SAXModelBuilder

Let’s get down to business and write our builder tool. Now we could do this by using the SAX API in combination with some “hardcoded” knowledge about the incoming tags and the classes we want to output (imagine a whole bunch of switches or if/then statements); however, we’re going do better than that and make a more generic model builder that maps our XML to classes by name. The SAXModelBuilder that we create in this section receives SAX events from parsing an XML file and dynamically constructs objects or sets properties corresponding to the names of the element tags. Our model builder is small, but it handles the most common structures: nested elements and elements with simple text or numeric content. We treat attributes as equivalent to element data as far as our model classes go and we support three basic types: String, Double, and Enum.

Here is the code:

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.lang.reflect.*;

public class SAXModelBuilder extends DefaultHandler
{
    Stack<Object> stack = new Stack<>();

    public void startElement( String namespace, String localname, String qname,
        Attributes atts ) throws SAXException
    {
        // Construct the new element and set any attributes on it
        Object element;
        try {
            String className = Character.toUpperCase( qname.charAt( 0 ) ) +
                qname.substring( 1 );
            element = Class.forName( className ).newInstance();
        } catch ( Exception e ) {
            element = new StringBuffer();
        }

        for( int i=0; i<atts.getLength(); i++) {
            try {
                setProperty( atts.getQName( i ), element, atts.getValue( i ) );
            } catch ( Exception e ) { throw new SAXException( "Error: ", e ); }
        }

        stack.push( element );
    }

    public void endElement( String namespace, String localname, String qname )
        throws SAXException
    {
        // Add the element to its parent
        if ( stack.size() > 1) {
            Object element = stack.pop();
            try {
                setProperty( qname, stack.peek(), element );
            } catch ( Exception e ) { throw new SAXException( "Error: ", e ); }
        }
    }

    public void characters(char[] ch, int start, int len )
    {
        // Receive element content text
        String text = new String( ch, start, len );
        if ( text.trim().length() == 0 ) { return; }
        ((StringBuffer)stack.peek()).append( text );
    }

    void setProperty( String name, Object target, Object value )
        throws SAXException, IllegalAccessException, NoSuchFieldException
    {
        Field field = target.getClass().getField( name );

        // Convert values to field type
        if ( value instanceof StringBuffer ) {
            value = value.toString();
        }
        if ( field.getType() == Double.class ) {
            value = Double.parseDouble( value.toString() );
        }
        if ( Enum.class.isAssignableFrom( field.getType() ) ) {
            value = Enum.valueOf( (Class<Enum>)field.getType(),
                value.toString() );
        }

        // Apply to field
        if ( field.getType() == value.getClass() ) {
            field.set( target, value );
        } else
        if ( Collection.class.isAssignableFrom( field.getType() ) ) {
            Collection collection = (Collection)field.get( target );
            collection.add( value );
        } else {
            throw new RuntimeException( "Unable to set property..." );
        }
    }

    public Object getModel() { return stack.pop(); }
}

The code may be a little hard to digest at first: we are using reflection to construct the objects and set the properties on the fields. But the gist of it is really just that the three methods, startElement(), characters(), and endElement()‚ are called in response to the tags of the input and we store the data as we receive it. Let’s take a look.

The SAXModelBuilder extends DefaultHandler to help us implement the ContentHandler interface. Because SAX events follow the hierarchical structure of the XML document, we use a simple stack to keep track of which object we are currently parsing. At the start of each element, the model builder attempts to create an instance of a class with the same name (uppercase) as the element and push it onto the top of the stack. Each nested opening tag creates a new object on the stack until we encounter a closing tag. Upon reaching an end of an element, we pop the current object off the stack and attempt to apply its value to its parent (the enclosing XML element), which is the new top of the stack. For elements with simple content that do not have a corresponding class, we place a StringBuffer on the stack as a stand-in to hold the character content until the tag is closed. In this case, the name of the tag indicates the property on the parent that should get the text and upon seeing the closing tag, we apply it in the same way. Attributes are applied to the current object on the stack within the startElement() method using the same technique. The final closing tag leaves the top-level element (inventory in this case) on the stack for us to retrieve.

To set values on our objects, we use our setProperty() method. It uses reflection to look for a field matching the name of the tag within the specified object. It also handles some simple type conversions based on the type of the field found. If the field is of type Double, we parse the text to a number; if it is an Enum type, we find the matching enum value represented by the text. Finally, if the field is not a simple field but is a Collection representing an XML sequence, then we invoke its add() method to add the child to the collection instead of trying to assign to the field itself.

Test drive

Finally, we can test drive the model builder with the following class, TestSAXModelBuilder, which calls the SAX parser, setting an instance of our SAXModelBuilder as the content handler. The test class then prints some of the information parsed from the zooinventory.xml file:

    import org.xml.sax.*;
    import javax.xml.parsers.*;
    
    public class TestSAXModelBuilder
    {
        public static void main( String [] args ) throws Exception
        {
            SAXParserFactory factory = SAXParserFactory
                .newInstance();
            SAXParser saxParser = factory.newSAXParser();
            XMLReader parser = saxParser.getXMLReader();
            SAXModelBuilder mb = new SAXModelBuilder();
            parser.setContentHandler( mb );
    
            parser.parse( new InputSource("zooinventory.xml") );
            Inventory inventory = (Inventory)mb.getModel();
            System.out.println("Animals = "+inventory.animal);
            Animal cocoa = (Animal)(inventory.animal.get(1));
            FoodRecipe recipe = cocoa.foodRecipe;
            System.out.println( "Recipe = "+recipe );
        }
    }

The output should look like this:

Animals = [Song Fang(mammal, Giant Panda), Cocoa(mammal, Gorilla)]
Recipe = Gorilla Chow: [fruit, shoots, leaves]

In the following sections, we’ll generate the equivalent output using different tools.

Limitations and possibilities

To make our model builder more complete, we could use more robust naming conventions for our tags and model classes (taking into account packages and mixed capitalization, etc.). More generally, we might want to introduce arbitrary mappings (bindings) between names and classes or properties. And of course, there is the problem of taking our model and going the other way, using it to generate an XML document. You can see where this is going: JAXB will do all of that for us, coming up later in this chapter.

XMLEncoder/Decoder

Java includes a standard tool for serializing JavaBeans classes to XML. The java.beans package XMLEncoder and XMLDecoder classes are analogous to java.ioObjectInputStream and ObjectOutputStream. Instead of using the native Java serialization format, they store the object state in a high-level XML format. We say that they are analogous, but the XML encoder is not a general replacement for Java object serialization. Instead, it is specialized to work with objects that follow the JavaBeans design patterns (setter and getter methods for properties), and it can only store and recover the state of the object that is expressed through a bean’s public properties in this way.

When you call it, the XMLEncoder attempts to construct an in-memory copy of the graph of beans that you are serializing using only public constructors and JavaBean properties. As it works, it writes out the steps required as “instructions” in an XML format. Later, the XMLDecoder executes these instructions and reproduces the result. The primary advantage of this process is that it is highly resilient to changes in the class implementation. While standard Java object serialization can accommodate many kinds of “compatible changes” in classes, it requires some help from the developer to get it right. Because the XMLEncoder uses only public APIs and writes instructions in simple XML, it is expected that this form of serialization will be the most robust way to store the state of JavaBeans. The process is referred to as long-term persistence for JavaBeans.

It might seem at first like this would obviate the need for our SAXModelBuilder example. Why not simply write our XML in the format that XMLDecoder understands and use it to build our model? Although XMLEncoder is very efficient at eliminating redundancy, you would see that its output is still very verbose (about two to three times larger than our original XML) and not very human-friendly. Although it’s possible to write it by hand, this XML format wasn’t designed for that. Finally, although XMLEncoder can be customized for how it handles specific object types, it suffers from the same problem that our model builder does, in that “binding” (the namespace of tags) is determined strictly by our Java class names. As we’ve said before, what is really needed is a more general tool to map our own classes to XML and back.

DOM

In the last section, we used SAX to parse an XML document and build a Java object model representing it. In that case, we created specific Java types for each of our complex elements. If we were planning to use our model extensively in an application, this technique would give us a great deal of flexibility. But often it is sufficient (and much easier) to use a “generic” model that simply represents the content of the XML in a neutral form. The Document Object Model (DOM) is just that. The DOM API parses an XML document into a generic representation consisting of classes with names such as Element and Attribute that hold their own values. You could use this to inspect the document structure and pull out the parts you want in a way that is perhaps more convenient than the low-level SAX. The tradeoff is that the entire document is parsed and read into memory—but for most applications, that is fine.

As we saw in our zoo example, once you have an object model, using the data is a breeze. So a generic DOM would seem like an appealing solution, especially when working mainly with text. One catch in this case is that DOM didn’t evolve first as a Java API and it doesn’t map well to Java. DOM is very complete and provides access to every facet of the original XML document, but it’s so generic (and language-neutral) that it’s cumbersome to use in Java. Later, we’ll also mention a native Java alternative to DOM called JDOM that is more pleasant to use.

The DOM API

The core DOM classes belong to the org.w3c.dom package. The result of parsing an XML document with DOM is a Document object from this package (see Figure 24-1). The Document is both a factory and a container for a hierarchical collection of Node objects, representing the document structure. A node has a parent and may have children, which can be traversed using its getChildNodes(), getFirstChild(), or getLastChild() methods. A node may also have “attributes” associated with it, which consist of a named map of nodes.

The parsed DOM
Figure 24-1. The parsed DOM

Subtypes of NodeElement, Text, and Attr—represent elements, text, and attributes in XML. Some types of nodes (including these) have a text “value.” For example, the value of a Text node is the text of the element it represents. The same is true of an attribute, cdata, or comment node. The value of a node can be accessed by the getNodeValue() and setNodeValue() methods. We’ll also make use of Node’s getTextContent() method, which retrieves the plain-text content of the node and all of its child nodes.

The Element node provides “random” access to its child elements through its getElementsByTagName() method, which returns a NodeList (a simple collection type). You can also fetch an attribute by name from the Element using the getAttribute()method.

The javax.xml.parsers package contains a factory for DOM parsers, just as it does for SAX parsers. An instance of DocumentBuilderFactory can be used to create a DocumentBuilder object to parse the file and produce a Document result.

Test-Driving DOM

Here is our TestDOM class:

    import javax.xml.parsers.*;
    import org.w3c.dom.*;
    
    public class TestDOM
    {
        public static void main( String [] args ) throws Exception
        {
            DocumentBuilderFactory factory = DocumentBuilderFactory
                .newInstance();
            DocumentBuilder parser = factory.newDocumentBuilder();
            Document document = parser.parse( "zooinventory.xml" );
    
            Element inventory = document.getDocumentElement();
            NodeList animals = inventory.getElementsByTagName("animal");
            System.out.println("Animals = ");
            for( int i=0; i<animals.getLength(); i++ ) {
                Element item = (Element)animals.item( i );
                String name = item.getElementsByTagName( "name" ).item( 0 )
                    .getTextContent();
                String species = item.getElementsByTagName( "species" )
                    .item( 0 ).getTextContent();
                String animalClass = item.getAttribute( "animalClass" );
                System.out.println( "  "+ name +" ("+animalClass+",
                    "+species+")" );
            }
    
            Element cocoa = (Element)animals.item( 1 );
            Element recipe = (Element)cocoa.getElementsByTagName( "foodRecipe" )
                .item( 0 );
            String recipeName = recipe.getElementsByTagName( "name" ).item( 0 )
                .getTextContent();
            System.out.println("Recipe = " + recipeName );
            NodeList ingredients = recipe.getElementsByTagName("ingredient");
            for(int i=0; i<ingredients.getLength(); i++) {
                System.out.println( "  " + ingredients.item( i )
                    .getTextContent() );
            }
        }
    }

TestDOM creates an instance of a DocumentBuilder and uses it to parse our zooinventory.xml file. We use the DocumentgetDocumentElement() method to get the root element of the document, from which we will begin our traversal. From there, we ask for all the animal child nodes. The getElementbyTagName() method returns a NodeList object, which we then use to iterate through our creatures. For each animal, we use the ElementgetElementsByTagName() method to retrieve the name and species child element information. Each of those queries can potentially return a list of matching elements, but we only allow for one here by taking the first element returned and asking for its text content. We also use the getAttribute() method to retrieve the animalClass attribute from the element.

Next, we use the getElementsByTagName() to retrieve the element called foodRecipe from the second animal. We use it to fetch a NodeList for all of the tags matching ingredient and print them as before. The output should contain the same information as our SAX-based example. But as you can see, the tradeoff in not having to create our own model classes is that we have to suffer through the use of the generic model and produce code that is considerably harder to read and less flexible.

Generating XML with DOM

Thus far, we’ve used the SAX and DOM APIs to parse XML. But what about generating XML? Sure, it’s easy to generate trivial XML documents simply by printing the appropriate strings. But if we plan to create a complex document on the fly, we might want some help with all those quotes and closing tags. We may also want to validate our model against an XML DTD or Schema before writing it out. What we can do is to build a DOM representation of our object in memory and then transform it to text. This is also useful if we want to read a document and then make some alterations to it. To do this, we’ll use of the java.xml.transform package. This package does a lot more than just printing XML. As its name implies, it’s part of a general transformation facility. It includes the XSL/XSLT languages for generating one XML document from another. (We’ll talk about XSL later in this chapter.)

We won’t discuss the details of constructing a DOM in memory here, but it follows fairly naturally from what you’ve learned about traversing the tree in our previous example. The following example, PrintDOM, simply parses our zooinventory.xml file to a DOM and then prints that DOM back to the screen. The same output code would print any DOM whether read from a file or created in memory using the factory methods on the DOM Document and Element, etc.

    import javax.xml.parsers.*;
    import org.xml.sax.InputSource;
    import org.w3c.dom.*;
    import javax.xml.transform.*;
    import javax.xml.transform.dom.DOMSource;
    import javax.xml.transform.stream.StreamResult;
    
    public class PrintDOM {
        public static void main( String [] args ) throws Exception 
        {
            DocumentBuilder parser = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
            Document document = parser.parse(
                new InputSource("zooinventory.xml") );
            Transformer transformer =  TransformerFactory.newInstance()
                .newTransformer();
            Source source = new DOMSource( document );
            Result output = new StreamResult( System.out );
            transformer.transform( source, output );
        }
    }

Note that the imports are almost as long as the entire program! Here, we are using an instance of a Transformer object in its simplest capacity to copy from a source to an output. We’ll return to the Transformer later when we discuss XSL, at which point it will be doing a lot more work for us.

JDOM

As we promised earlier, we’ll now describe an easier DOM API: JDOM, created by Jason Hunter and Brett McLaughlin, two fellow O’Reilly authors (Java Servlet Programming and Java and XML, respectively). It is a more natural Java DOM that uses real Java collection types such as List for its hierarchy and provides marginally more streamlined methods for building documents. You can get the latest JDOM from http://www.jdom.org/. Here’s the JDOM version of our standard “test” program:

    import org.jdom.*;
    import org.jdom.input.*;
    import org.jdom.output.*;
    import java.util.*;
    
    public class TestJDOM {
        public static void main( String[] args ) throws Exception {
            Document doc = new SAXBuilder().build("zooinventory.xml");
            List animals = doc.getRootElement().getChildren("Animal");
            System.out.println("Animals = ");
            for( int i=0; i<animals.size(); i++ ) {
                String name = ((Element)animals.get(i)).getChildText("Name");
                String species = ((Element)animals.get(i))
                    .getChildText("Species");
                System.out.println( "  "+ name +" ("+species+")" );
            }
            Element foodRecipe = ((Element)animals.get(1))
                .getChild("FoodRecipe");
            String name = foodRecipe.getChildText("Name");
            System.out.println("Recipe = " + name );
            List ingredients = foodRecipe.getChildren("Ingredient");
            for(int i=0; i<ingredients.size(); i++)
                System.out.println( "  "+((Element)ingredients.get(i))
                    .getText() );
        }
    }

The JDOM Element class has some convenient getChild() and getChildren() methods, as well as a getChildText() method for retrieving node text by element name.

Now that we’ve covered the basics of SAX and DOM, we’re going to look at a new API that, in a sense, straddles the two. XPath allows us to target only the parts of a document that we want and gives us the option of getting at those components in DOM form.

XPath

XPath is an expression language for addressing parts of an XML document. You can think of XPath expressions as sort of like regular expressions for XML. They let you pull out parts of an XML document based on patterns. In the case of XPath, the patterns are more concerned with structural information than with character content and the values returned may be either simple text or “live” DOM nodes. With XPath, we can query an XML document for all of the elements with a certain name or in a certain parent-child relationship. We can also apply fairly sophisticated tests or predicates to the nodes, which allows us to construct complex queries such as this one: give me all of the Animals with a Weight greater than the number 400 and a Temperament of irritable whose animalClass attribute is mammal.

The full XPath specification has many features and includes both a compact and more verbose syntax. We won’t try to cover it all here, but the basics are easy and it’s important to know them because XPath expressions are at the core of XSL transformations and other APIs that refer to parts of XML documents. The full specification does not make great bedtime reading, but can be found at http://www.w3.org/TR/xpath.

Nodes

An XPath expression addresses a Node in an XML document tree. The node may be an element (possibly with children) like <animal>...</animal> or it may be a lower-level document node representing an attribute (e.g., animalClass="mammal"), a CDATA block, or even a comment. All of the structure of an XML document is accessible through the XPath syntax. Once we’ve addressed the node, we can either reduce the content to a text string (as we might with a simple text content element like name) or we can access it as a proper DOM tree to further read or manipulate it.

Table 24-2 shows the most basic node-related syntax.

Table 24-2. Basic node-related syntax

Syntax

Example

Description

/Name

/inventory/animal

All animal nodes under /inventory.

//Name

//animal

All animal nodes anywhere in document. A foodRecipe/animal would also match.

Name/*

/inventory/*

All child nodes of inventory (animals and any other elements directly under inventory).

@Name

//animal/@animalClass

All animalClass attributes of animals.

.

/inventory/animal/.

The current node (all animals).

..

/inventory/animal/..

The parent node (inventory).

Nodes are addressed with a slash-separated path based on name. For example, /Inventory/Animal refers to the set of all Animal nodes under the Inventory node. If we want to list the names of all Animals, we would use /Inventory/Animal/Name. The // syntax matches a node anywhere in a document, at any level of nesting, so //Name would match the name elements of Animals, FoodRecipes, and possibly many other elements. We could be more specific, using //Animal/Name to match only Name elements whose parent is an Animal element. The at sign (@) matches attributes. This becomes much more useful with predicates, which we describe next. Finally, the familiar . and .. notation can be used to “move” relative to a node; read on to see how this is used.

Predicates

Predicates let us apply a test to a node. Nodes that pass the test are included in the result set or used to select other nodes (child or parent) relative to them. There are many types of tests available in XPath. Table 24-3 lists a few examples.

Table 24-3. Predicates

Syntax

Example

Description

[n]

/inventory/animal[1]

Select the nth element of a set. (Starts with 1 rather than 0.) For example, select the first animal in the inventory.

[@name=value]

//animal[@animalClass="mammal"]

Match nodes with the specified attribute value. For example, animals with the animalClass attribute "mammal".

[element=value]

//animal[name="Cocoa"]

Match nodes with a child node whose text value is specified. For example, match the animal with a name element containing the simple text "Cocoa".

=!=><

//animal[weight > 400]

Predicates may also test for inequality and numeric greater-/lesser-than value.

and, or

//animal[@animalClass= "mammal" or @class="reptile"]]

Predicates may use logical AND and OR to test. For example, animals whose animalClass is mammal or reptile.

Predicates can be compounded (AND’ed) using this syntax or simply by adding more predicates, like so:

        //animal[@animalClass="mammal"][weight > 400]

Here, we’ve asked for animals with a class attribute of "mammal" and a weight element containing a number greater than 400.

We can now also see the usefulness of the .. operator. Suppose we want to find all of the animals with a foodRecipe that uses Fruit as an ingredient:

        //animal/foodRecipe[ingredient="Fruit"]/..

The .. means that instead of returning the matching foodRecipe node itself, we return its parent—the animal element. The . (current node) operator is useful in other cases where we use XPath functions to manipulate values in more refined ways. We’ll say a few words about functions next.

Functions

The XPath specification includes not only the basic node traversal and predicate syntax we’ve shown, but also the ability to invoke more open-ended functions that operate on nodes and the node context. These XPath functions cover a wide range of duties and we’ll just give a couple of examples here. The functions fall into a few general categories.

Some functions select node types other than an element. For example, there is no special syntax for selecting an XML comment. Instead you invoke a special method called comment(), like this:

/inventory/comment()

This expression returns any XML comment nodes that are children of the inventory element. XPath also offers functions that duplicate all of the (compact) syntax we’ve discussed, including methods like child() and parent() (corresponding to . and ..).

Other functions look at the context of nodes—for example, last() and count().

/inventory/animal[last()]

This expression selects the last animal child element of inventory in the same way that [n] selects the nth.

//foodRecipe[count(ingredient)>2]

This expression matches all of the foodRecipe elements with more than two ingredients. (Cool, eh?)

Finally, there are many string-related functions. Some are useful for simple tests, but others are really useful only in the context of XSL, where they help out the language (in an awkward way) with basic formatting and string manipulation. For example, the contains() and starts-with() methods can be used to look at the text values inside XML documents:

//animal[starts-with(name,"S")]

This expression matches animals whose name starts with the character S (e.g., Song Fang). The contains() method, similarly, can be used to look for a substring in text.

The XPath API

Now that we’ve got a taste for the syntax, let’s look at how to use the API. The procedure is similar to that of the Java regular expression API for strings. We use a factory to create an XPath object. We can then either evaluate expressions with it or “compile” an expression down to an XPathExpression for better performance if we’re going to use it more than once.

XPath xpath = XPathFactory.newInstance().newXPath();
InputSource source = new InputSource( filename );
         
String result = xpath.evaluate( "//animal/name", source );
// Song Fang

Here we’ve used the simplest form of the evaluate() method, which returns only the first match and takes the value as a string. This method is useful for pulling simple text values from elements. However, if we want the full set of values (e.g., the names of all the animals matched by this expression), we need to return the results as a set of Node objects instead.

The return type of (the overloaded forms of) evaluate() is controlled by identifiers of the XPathConstants class. We can get the result as one of the following: STRING, BOOLEAN, NUMBER, NODE, or NODESET. The default is STRING, which strips out child element tags and returns just the text of the matching nodes. BOOLEAN and NUMBER are conveniences for getting primitive types. NODE and NODESET return org.w3c.dom.Node and NodeList objects, respectively. We need the NodeList to get all the values.

NodeList elements = (NodeList)xpath.evaluate(
    expression, inputSource, XPathConstants.NODESET );

Next, let’s put this together in a useful example.

XMLGrep

This simple example can be used as a command-line utility, such as grep, for testing XPath expressions against a file. It applies an XPath expression and then prints the resulting elements as XML text using the same technique we used in our PrintDOM example. Nodes that are not elements (e.g., attributes, comments, and so on) are simply printed with their toString() method, which normally serves well enough to identify them, but you can expand the example to your taste. Here it is:

    import org.w3c.dom.*;
    import org.xml.sax.InputSource;
    import javax.xml.xpath.*;
    import javax.xml.transform.*;
    import javax.xml.transform.dom.DOMSource;
    import javax.xml.transform.stream.StreamResult;
     
    public class XMLGrep {
     
        public static void printXML( Element element )
            throws TransformerException {
            
            Transformer transformer =
                TransformerFactory.newInstance().newTransformer();
            transformer.setOutputProperty( OutputKeys.OMIT_XML_DECLARATION,
                "yes" );
            Source source = new DOMSource( element );
            Result output = new StreamResult( System.out );
            transformer.transform( source, output );
            System.out.println();
        }
         
        public static void main( String [] args ) throws Exception {
            if ( args.length != 2 ) {
                System.out.println( "usage: PrintXPath expression file.xml" );
                System.exit(1);
            }
            String expression = args[0], filename = args[1];
             
            XPath xpath = XPathFactory.newInstance().newXPath();
            InputSource inputSource = new InputSource( filename );
             
            NodeList elements = (NodeList)xpath.evaluate(
            expression, inputSource, XPathConstants.NODESET );
             
            for( int i=0; i<elements.getLength(); i++ )
                if ( elements.item(i) instanceof Element ) {
                    printXML( (Element)elements.item(i) );
                } else
                    System.out.println( elements.item(i) );
        }
     
    }

There are again a lot of imports in this example. The transform code in our printXML() method is drawn from the PrintDOM example with one addition. We’ve set a property on the transformer to omit the standard XML declaration that would normally be output for us at the head of our document. Since we may print more than one (root) element, the output is not well formed XML anyway.

Run the example by passing an XPath expression and the name of an XML file as arguments:

% java XMLGrep "//animal[starts-with(name,'C')]" zooinventory.xml

This example really is useful for trying out XPath. Please give it a whirl. Mastering these expressions (and learning more) will give you great power over XML documents and, again, form the basis for learning about XSL transformations.

XInclude

XInclude is a very simple “import” facility for XML documents. With the XInclude directive, you can easily include one XML document in another either as XML or as plain (and escaped) text. This means that you can break down your documents into as many files as you see fit and reference the pieces in a simple, standard way. We should note that it is also possible to do this in another way, using XML entity declarations, but they are fraught with problems. XInclude is simpler and does what its name implies, including the specified document at the current location; you just have to declare the proper namespace for the new <include> element. Here is an example:

        <Book xmlns:xi="http://www.w3.org/2001/XInclude">
          <Title>Learning Java</Title>
          <xi:include href="chapter1.xml"/>
          <xi:include href="chapter2.xml"/>
          <xi:include href="chapter3.xml"/>
          ...
        </Book>

We’ve used the namespace identifier xi to qualify the <include> elements that we use to import the chapters of our book. By default, the file is imported as XML content, which means that the parser incorporates the included document as part of our document. The resulting DOM or SAX view will show the merged documents as one. Alternatively, we can use the parse attribute to specify that we want the target included as text only. In this case, the text is automatically escaped for us like a CDATA section. For example, we could use it to include an XML example in our book without danger of it being intepreted as part of our file:

        <Example>
          <Title>The Zoo Inventory Example</Title>
          <xi:include parse="text" href="zooinventory.xml"/>
        </Example>

Here, the entire zooinventory.xml file will be included as nicely escaped text for us (not added to our document as XML).

XInclude also allows for “fallback” content to be specified using a nested fallback element. The fallback element may point to another file or simply hold XML to be used if the included file can’t be found. For example:

<xi:include parse="text " href="zooinventory.xml">
    <xi:fallback href="filenotfound.xml"/>
</xi:include>
 
<xi:include parse="text" href="example.xml">
    <xi:fallback>This example is missing...</xi:fallback>
</xi:include>

In the first case, if zooinventory.xml is not found, the filenotfound.xml file will be included. In the second case, the “missing” text will be included instead of the file. If there is no fallback specified, a parse-time fatal error occurs. An empty fallback element can be used to suppress any error. Fallbacks may also be nested within fallbacks to combine these behaviors.

Enabling XInclude

Getting XInclude to work for us requires simply turning on a couple of flags before we begin parsing our file. First, because the XInclude facility uses namespaces, we have to turn on namespace processing in our parser factory. Second, we have to explicitly tell the parser to interpret the include directives. To modify our PrintDOM example to perform the includes before printing the result, we turn these flags on the factory before creating a DocumentBuilder instance:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
 
// enable XInclude processing
factory.setNamespaceAware( true );
factory.setXIncludeAware( true );
 
DocumentBuilder parser = factory.newDocumentBuilder();
Document document = parser.parse( input );

Both of those options should really be the defaults these days. But they have historically come later to XML and so been treated as special features that have to be enabled. We should also mention before we move on that XInclude can make use of XPath expressions (via an API called XPointer) in order to include just selected parts of an XML document.

Validating Documents

Words, words, mere words, no matter from the heart.

William Shakespeare, Troilus and Cressida

In this section, we talk about DTDs and XML Schema, two ways to enforce rules in an XML document. A DTD is a simple grammar guide for an XML document, defining which tags may appear where, in what order, with what attributes, etc. XML Schema is the next generation of DTD. With XML Schema, you can describe the data content of the document as well as the structure. XML Schemas are written in terms of primitives, such as numbers, dates, and simple regular expressions, and also allow the user to define complex types in a grammar-like fashion. The word schema means a blueprint or plan for structure, so we’ll refer to DTDs and XML Schema collectively as schema where either applies.

DTDs, although much more limited in capability, are still widely used. This may be partly due to the complexity involved in writing XML Schemas by hand. The W3C XML Schema standard is verbose and cumbersome, which may explain why several alternative syntaxes have sprung up. The javax.xml.validation API performs XML validation in a pluggable way. Out of the box, it supports only W3C XML Schema, but new schema languages can be added in the future. Validating with a DTD is supported as an older feature directly in the SAX parser. We’ll use both in this section.

Using Document Validation

XML’s validation of documents is a key piece of what makes it useful as a data format. Using a schema is somewhat analogous to the way Java classes enforce type checking in the language. A schema defines document types. Documents conforming to a given schema are often referred to as instance documents of the schema.

This type safety provides a layer of protection that eliminates having to write complex error-checking code. However, validation may not be necessary in every environment. For example, when the same tool generates XML and reads it back in a short time span, validation may not be necessary. It is invaluable, though, during development. Sometimes document validation is used during development and turned off in production environments.

DTDs

The DTD language is fairly simple. A DTD is primarily a set of special tags that define each element in the document and, for complex types, provide a list of the elements it may contain. The DTD <!ELEMENT> tag consists of the name of the tag and either a special keyword for the data type or a parenthesized list of elements.

<!ELEMENT Name ( #PCDATA )>
<!ELEMENT Document ( Head, Body )>

The special identifier #PCDATA (parsed character data) indicates a string. When a list is provided, the elements are expected to appear in that order. The list may contain sublists, and items may be made optional using a vertical bar (|) as an OR operator. Special notation can also be used to indicate how many of each item may appear; two examples of this notation are shown in Table 24-4.

Table 24-4. DTD notation defining occurrences

Character

Meaning

*

Zero or more occurrences

?

Zero or one occurrences

Attributes of an element are defined with the <!ATTLIST> tag. This tag enables the DTD to enforce rules about attributes. It accepts a list of identifiers and a default value:

<!ATTLIST Animal animalClass (unknown | mammal | reptile) "unknown">

This ATTLIST says that the animal element has an animalClass attribute that can have one of several values (e.g.: unknown, mammal, reptile). The default is unknown.

We won’t cover everything you can do with DTDs here. But the following example will guarantee zooinventory.xml follows the format we’ve described. Place the following in a file called zooinventory.dtd (or grab this file from http://oreil.ly/Java_4E):

<!ELEMENT inventory ( animal* )>
<!ELEMENT animal ( name, species, habitat, (food | foodRecipe), temperament, 
    weight )>
<!ATTLIST animal animalClass ( unknown | mammal | reptile | bird | fish ) 
    "unknown">
<!ELEMENT name ( #PCDATA )>
<!ELEMENT species ( #PCDATA )>
<!ELEMENT habitat ( #PCDATA )>
<!ELEMENT food ( #PCDATA )>
<!ELEMENT weight ( #PCDATA )>
<!ELEMENT foodRecipe ( name, ingredient+ )>
<!ELEMENT ingredient ( #PCDATA )>
<!ELEMENT temperament ( #PCDATA )>

The DTD says that an inventory consists of any number of animal elements. An animal has a name, species, and habitat tag followed by either a food or foodRecipe. foodRecipe’s structure is further defined later.

To use a DTD, we associate it with the XML document. We can do this by placing a DOCTYPE declaration in the XML document itself and allow the XML parser to recognize and enforce it. The Java validation API that we’ll talk about in the next section separates the roles of parsing and validation and can be used to validate arbitrary XML against any kind of schema, including DTDs. The problem is that out of the box, the validation API only implements the (newer) XML schema syntax. So we’ll have to rely on the parser to validate the DTD for us here.

In this case, when a validating parser encounters the DOCTYPE, it attempts to load the DTD and validate the document. There are several forms the DOCTYPE can have, but the one we’ll use is:

<!DOCTYPE Inventory SYSTEM "zooinventory.dtd">

Both SAX and DOM parsers can automatically validate documents as they read them, provided that the documents contain a DOCTYPE declaration. However, you have to explicitly ask the parser factory to provide a parser that is capable of validation. To do this, just set the validating property of the parser factory to true before you ask it for an instance of the parser. For example:

...
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setValidating( true );

Again, this setValidating() method is an older, more simplistic way to enable validation of documents that contain DTD references and it is tied to the parser. The new validation package that we’ll discuss later is independent of the parser and more flexible. You should not use the parser-validating method in combination with the new validation API unless you want to validate documents twice for some reason.

Try inserting the setValidating() line in our model builder example after the factory is created. Abuse the zooinventory.xml file by adding or removing an element or attribute and then see what happens when you run the example. You should get useful error messages from the parser indicating the problems and parsing should fail. To get more information about the validation, we can register an org.xml.sax.ErrorHandler object with the parser, but by default, Java installs one that simply prints the errors for us.

XML Schema

Although DTDs can define the basic structure of an XML document, they don’t provide a very rich vocabulary for describing the relationships between elements and say very little about their content. For example, there is no reasonable way with DTDs to specify that an element is to contain a numeric type or even to govern the length of string data. The XML Schema standard addresses both the structural and data content of an XML document. It is the next logical step and it (or one of the competing schema languages with similar capabilities) should replace DTDs in the future.

XML Schema brings the equivalent of strong typing to XML by drawing on many predefined primitive element types and allowing users to define new complex types of their own. These schemas even allow for types to be extended and used polymorphically, like types in the Java language. Although we can’t cover XML Schema in any detail, we’ll present the equivalent W3C XML Schema for our zooinventory.xml file here:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="inventory">
  <xs:complexType>
    <xs:sequence>
       <xs:element maxOccurs="unbounded" ref="animal"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="name" type="xs:string"/>

<xs:element name="animal">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element name="species" type="xs:string"/>
      <xs:element name="habitat" type="xs:string"/>
      <xs:choice>
         <xs:element name="food" type="xs:string"/>
         <xs:element ref="foodRecipe"/>
      </xs:choice>
      <xs:element name="temperament" type="xs:string"/>
      <xs:element name="weight" type="xs:double"/>
    </xs:sequence>
    <xs:attribute name="animalClass" default="unknown">
      <xs:simpleType>
        <xs:restriction base="xs:token">
          <xs:enumeration value="unknown"/>
          <xs:enumeration value="mammal"/>
          <xs:enumeration value="reptile"/>
          <xs:enumeration value="bird"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
  </xs:complexType>
</xs:element>

<xs:element name="foodRecipe">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element maxOccurs="unbounded" name="ingredient" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

</xs:schema>

This schema would normally be placed into an XML Schema Definition file, which has a .xsd extension. The first thing to note is that this schema file is a normal, well-formed XML file that uses elements from the W3C XML Schema namespace. In it, we use nested element declarations to define the elements that will appear in our document. As with most languages, there is more than one way to accomplish this task. Here, we have broken out the “complex” animal and foodRecipe elements into their own separate element declarations and referred to them in their parent elements using the ref attribute. In this case, we did it mainly for readability; it would have been legal to have one big, deeply nested element declaration starting at inventory. However, referring to elements by reference in this way also allows us to reuse the same element declaration in multiple places in the document, if needed. Our name element is a small example of this. Although it didn’t do much for us here, we have broken out the name element and referred to it for both the Animal/Name and the FoodRecipe/Name. Breaking out name like this would allow us to use more advanced features of schema and write rules for what a name can be (e.g., how long, what kind of characters are allowed) in one place and reuse that “type” where needed.

Control directives like sequence and choice allow us to define the structure of the child elements allowed and attributes like minOccurs and maxOccurs let us specify cardinality (how many instances). The sequence directive says that the enclosed elements should appear in the specified order (if they are required). The choice directive allows us to specify alternative child elements like food or foodRecipe. We declared the legal values for our animalClass attribute using a restriction declaration and enumeration tags.

Simple types

Although we’ve not really exercised it here, the type attribute of our elements touches on the standardization of types in XML Schema. All of our “text” elements specify a type xs:string, which is a standard XML Schema string type (kind of equivalent to PCDATA in our DTD). There are many other standard types covering things such as dates, times, periods, numbers, and even URLs. These are called simple types (though some of them are not so simple) because they are standardized or “built-in.” Table 24-5 lists W3C Schema simple types and their corresponding Java types. The correspondence will become useful later when we talk about JAXB and automated binding of XML to Java classes.

Table 24-5. W3C Schema simple types

Schema element type

Java type

Example

xsd:string

java.lang.String

"This is text"

xsd:boolean

boolean

true, false, 1, 0

xsd:byte

byte

 

xsd:unsignedByte

short

 

xsd:integer

java.math.BigInteger

 

xsd:int

int

 

xsd:unsignedInt

long

 

xsd.long

long

 

xsd:short

short

 

xsd:unsignedShort

int

 

xsd:decimal

java.math.BigDecimal

 

xsd:float

float

 

xsd:double

double

 

xsd:Qname

javax.xml.namespace.QName

funeral:corpse

xsd:dateTime

java.util.Calendar

2004-12-27T15:39:05.000-06:00

xsd:base64Binary

byte[]

PGZv

xsd:hexBinary

byte[]

FFFF

xsd:time

java.util.Calendar

15:39:05.000-06:00

xsd:date

java.util.Calendar

2004-12-27

xsd:anySimpleType

java.lang.String

 

For example, we have a floating-point weight element like this in our animal:

<Weight>400.5</Weight>

We can now validate it in our schema by inserting the following entry at the appropriate place:

<xs:element name="weight" type="xs:double"/>

In addition to enforcing that the content of elements matches these simple types, XML Schema can give us much more control over the text and values of elements in our document using simple rules and patterns analogous to regular expressions.

Complex types

In addition to the predefined simple types listed in Table 24-5, we can define our own, complex types in our schema. Complex types are element types that have internal structure and possibly child elements. Our inventory, animal, and foodRecipe elements are all complex types and their content must be declared with the complexType tag in our schema. Complex type definitions can be reused, similar to the way that element definitions can be reused in our schema; that is, we can break out a complex type definition and give it a name. We can then refer to that type by name in the type attributes of other elements. Because all of our complex types were only used once in their corresponding elements, we didn’t give them names. They were considered anonymous type definitions, declared and used in the same spot. For example, we could have separated our animal’s type from its element declaration, like so:

<xs:element name="inventory">
  <xs:complexType>
    <xs:sequence>
       <xs:element name="animal" maxOccurs="unbounded" 
           type="AnimalType"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>
 
<xs:complexType name="AnimalType">
  <xs:sequence>
    <xs:element ref="name"/>
    <xs:element name="species" type="xs:string"/>
    <xs:element name="habitat" type="xs:string"/>
    ...

Declaring the AnimalType separately from the instance of the animal element declaration would allow us to have other, differently named elements with the same structure. For example, our inventory element may hold another element, mainAttraction, which is a type of animal with a different tag name.

There’s a lot more to say about W3C XML Schema and they can get quite a bit more complex than our simple example. However, you can do a lot with the few pieces we’ve previously shown. Some tools are available to help you get started. We’ll talk about one called Trang in a moment. For more information about XML Schema, see the W3C’s site or XML Schema by Eric van der Vlist (O’Reilly). In the next section, we’ll show how to validate a file or DOM model against the XML Schema we’ve just created, using the new validation API.

Generating Schema from XML samples

Many tools can help you write XML Schema. One helpful tool is called Trang. It is part of an alternative schema language project called RELAX NG (which we mention later in this chapter), but Trang is very useful in and of itself. It is an open source tool that can not only convert between DTDs and XML Schema, but also create a rough DTD or XML Schema by reading an “example” XML document. This is a great way to sketch out a basic, starting schema for your documents.

The Validation API

To use our example’s XML schema, we need to exercise the new javax.xml.validation API. As we said earlier, the validation API is an alternative to the simple, parser-based validation supported through the setValidating() method of the parser factories. To use the validation package, we create an instance of a SchemaFactory, specifying the schema language. We can then validate a DOM or stream source against the schema.

The following example, Validate, is in the form of a simple command-line utility that you can use to test out your XML and schemas. Just give it the XML filename and an XML Schema file (.xsd file) as arguments:

    import javax.xml.XMLConstants;
    import javax.xml.validation.*;
    import org.xml.sax.*;
    import javax.xml.transform.sax.SAXSource;
    import javax.xml.transform.Source;
    import javax.xml.transform.stream.StreamSource;
     
    public class Validate
    {
        public static void main( String [] args ) throws Exception {
            if ( args.length != 2 ) {
                System.err.println("usage: Validate xmlfile.xml xsdfile.xsd");
                System.exit(1);
            }
            String xmlfile = args[0], xsdfile = args[1];
             
            SchemaFactory factory =
            SchemaFactory.newInstance( XMLConstants.W3C_XML_SCHEMA_NS_URI);
            Schema schema = factory.newSchema( new StreamSource( xsdfile ) );
            Validator validator = schema.newValidator();
             
            ErrorHandler errHandler = new ErrorHandler() {
                public void error( SAXParseException e ) {
                    System.out.println(e);
                }
                public void fatalError( SAXParseException e ) {
                    System.out.println(e); 
                }
                public void warning( SAXParseException e ) { 
                    System.out.println(e); 
                }
            };
            validator.setErrorHandler( errHandler );
             
            try {
                validator.validate( new SAXSource(
                new InputSource("zooinventory.xml") ) );
            } catch ( SAXException e ) {
                // Invalid Document, no error handler
            }
        }
    }
         

The schema types supported initially are listed as constants in the XMLConstants class. Right now, only W3C XML Schema is implemented and there is also another intriguing type in there that we’ll mention later. Our validation example follows the pattern we’ve seen before, creating a factory, then a Schema instance. The Schema represents the grammar and can create Validator instances that do the work of checking the document structure. Here, we’ve called the validate() method on a SAXSource, which comes from our file, but we could just as well have used a DOMSource to check an in-memory DOM representation:

validator.validate( new DOMSource(document) );

Any errors encountered will cause the validate method to throw a SAXException, but this is just a coarse means of detecting errors. More generally, and as we’ve shown in this example, we’d want to register an ErrorHandler object with the validator. The error handler can be told about many errors in the document and convey more information. When the error handler is present, the exceptions are given to it and not thrown from the validate method.

The errors generated by these parsers can be a bit cryptic. In some cases, the errors may not be able to report line numbers because the validation is not necessarily being done against a stream.

Alternative schema languages

In addition to DTDs and W3C XML Schema, several other popular schema languages are being used today. One interesting alternative that is tantalizingly referenced in the XMLConstants class is called RELAX NG. This schema language offers the most widely used features of XML Schema in a more human-readable format. In fact, it offers both a very compact, non-XML syntax and a regular XML-based syntax. RELAX NG doesn’t offer the same text pattern and value validation that W3C XML Schema does. Instead, these aspects of validation are left to other tools (many people consider this to be “business logic,” more appropriately implemented outside of the schema anyway). If you are interested in exploring other schema languages, be sure to check out RELAX NG and its useful schema conversion utility, Trang.

JAXB Code Binding and Generation

We’ve said that our ultimate goal in this chapter is automated binding of XML to Java classes. Now we’ll discuss the standard Java API for XML Binding, JAXB. (This should not be confused with JAXP, the parser API.) JAXB is a standard extension that is bundled with Java 6 and later. With JAXB, the developer does not need to create any fragile parsing code. An XML schema or Java code can be used as the starting point for transforming XML to Java and back. (“Schema first” and “code first” are both supported.) With JAXB, you can either mark up your Java classes with simple annotations that map (bind) them to XML or start with an XML schema and generate plain Java classes (POJOs) with the necessary annotations included. You can even derive an XML schema from your Java classes to use as a starting point or contract with non-Java systems.

At runtime, JAXB can read an XML document and parse it into the model that you have defined or you can go the other way, populating your object model and then writing it out to XML. In both cases, JAXB can validate the data to make sure it matches a schema. This may sound like the DOM interface, but in this case we’re not using generic classes—we’re using our own model. In this section, we’ll reuse the class model that we created for the SAX example with our zooinventory.xml file. We’ll use the familiar Inventory, Animal, and FoodRecipe classes directly, but this time you’ll see that we’ll be more focused on the schema and names and less on the parsing machinery.

Annotating Our Model

JAXB gives us a great deal of flexibility in mapping our Java classes to XML elements and there are a lot of special cases. But if we accept most of the default behavior for our model, we can get started with very little work. Let’s start by taking our zoo inventory classes and adding the necessary annotations to allow JAXB to bind it to XML:

@XmlRootElement
public class Inventory {
       public List<Animal> animal = new ArrayList<>();
}

Well, that was easy! Yes, in fact as we hinted at the beginning of the chapter, adding just the @XmlRootElement annotation to the “top level” or root class of our model will yield nearly the same XML that we used before. To generate the XML, we’ll use the following test harness:

    import javax.xml.bind.JAXBContext;
    import javax.xml.bind.JAXBException;
    import javax.xml.bind.Marshaller;
    
    public class TestJAXBMarshall
    {
        public static void main( String [] args ) throws JAXBException {
            Inventory inventory = new Inventory();
            FoodRecipe recipe = new FoodRecipe();
            recipe.name = "Gorilla Chow";
            recipe.ingredient.addAll( Arrays.asList( "leaves", "insects",
                "fruit" ) );
            Animal animal = new Animal( Animal.AnimalClass.mammal, "Song Fang", 
                "Giant Panda", "China", "Bamboo", "Friendly", 45.0, recipe );
            inventory.animal.add( animal );
            
            marshall( inventory );
        }
        
        public static void marshall( Object jaxbObject ) throws JAXBException {
            JAXBContext context = JAXBContext.newInstance(
                jaxbObject.getClass() );
            Marshaller marshaller = context.createMarshaller();
            marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT,
                Boolean.TRUE);
            marshaller.marshal(jaxbObject, System.out);
        }
    }

We’ve taken the liberty of adding some constructors to shorten the code for creating the model, but it doesn’t change the behavior here. It’s just the four lines of our marshall() method that actually use JAXB to write out the XML. We first create a JAXBContext, passing in the class type to be marshalled. We’ve made our marshall() method somewhat reusable by getting the class type from the object passed in. However, it’s sometimes necessary to pass in additional classes to the newInstance() method in order for JAXB to be aware of all of the bound classes that may be needed. In that case, we’d simpy pass more class types to the newInstance() method (it accepts a variable argument list with any number of arguments—of class types). We then create a Marshaller from the context and, for our purposes, set a flag indicating that we would like nice, human-readable output (the default output is one long line of XML). Finally, we tell the marshaller to send our object to System.out.

The output looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<inventory>
    <animal>
        <animalClass>mammal</animalClass>
        <name>Song Fang</name>
        <species>Giant Panda</species>
        <habitat>China</habitat>
        <food>Bamboo</food>
        <temperament>Friendly</temperament>
        <weight>45.0</weight>
    </animal>
    <animal>
        <animalClass>mammal</animalClass>
        <name>Cocoa</name>
        <species>Gorilla</species>
        <habitat>Ceneral Africa</habitat>
        <temperament>Know-it-all</temperament>
        <weight>45.0</weight>
        <foodRecipe>
            <name>Gorilla Chow</name>
            <ingredient>fruit</ingredient>
            <ingredient>shoots</ingredient>
            <ingredient>leaves</ingredient>
        </foodRecipe>
    </animal>
</inventory>

As we said, it’s almost identical to the XML we worked with earlier. Admittedly, we chose to create our XML using the same (common) conventions that JAXB uses, so it’s not entirely magic. The first thing to notice is that JAXB automatically mapped our class names to lowercase XML element names (e.g., class Animal to <animal>). If we had used JavaBeans-style getter methods instead of public fields, the same would be true; for example, a getSpecies() method would produce a default element name of species.

If we wanted to map our class names and property names to completely different XML names, we could easily accomplish that using the name attribute of the @XmlRootElement and @XmlElement annotations. For example, we can call our Animal “creature” and rename temperament to “personality” like so:

@XmlRootElement(name="creature")
public class Animal 
{
    ...
    @XmlElement(name="personality")
    public String temperament;

The real difference between our generated XML and our earlier sample is that our animalClass attribute is not acting like an attribute. By default, is has been mapped to an element, like the other properties of Animal. We can rectify that with another annotation, @XmlAttribute:

public class Animal 
{
    @XmlAttribute
    public AnimalClass animalClass;    
    ...

// Produces...

<inventory>
    <animal animalClass="mammal">
        <name>Song Fang</name>

Also note that JAXB has shown the food element in the first animal and the foodRecipe in the second. JAXB will ignore a field or property that is null (as is the case here) unless you specify that the property is “nillable” using @XmlElement(nillable=true). That behavior automatically supported the alternation between our two properties.

There are many additional annotations that provide support for mapping Java classes, fields, and properties to other features of XML. Table 24-6 attempts to provide a concise description of what each annotation is used for. Some of the usages get a little complex,so you may want to refer to the Javadoc for more details.

Table 24-6. JAXB Annotations
Annotation Description
@XmlAccessorOrder Used on a package or class to set alphabetic ordering of marshalled fields and properties. (The default ordering is undefined.) See @XmlType to specify the ordering yourself. As a reminder: package-level annotations in Java are placed on a (lonely) package statement in a special file named package-info.java within the corresponding package structure. (See “Annotations”.)
@XmlAccessorType Used on a package or class to specify whether fields and properties are marshalled by default. You can choose: only fields, only properties (getters/setters), none (only those annotated by the user), or all public fields and properties. See @XmlTransient to exclude items.
@XmlAnyAttribute Designates a Java Map object to receive any unbound XML attribute name-value pairs for an entity (i.e., the Map will collect any leftover attributes for which no corresponding property or field can be found).
@XmlAnyElement Designates a Java List or Array object to receive any unbound XML elements for an entity (i.e., the List will accumulate any leftover elements for which no corresponding property or field can be found).
@XmlAttachmentRef Designates a java.activation.DataHandler object to handle an XML MIME attachment.
@XmlAttribute Binds a Java field or property to an XML attribute. The name attribute can be used to specify an XML attribute name that is different from the name of the field or property. Use the required attribute to specify whether the attribute is required.
@XmlElement Binds a a Java field or property to an XML element. The name attribute can be used to specify an XML element name different from the name of the field or property. Use the required attribute to specify whether the element is required.
@XmlElements Used on a Java collection to specify distinct element names for contained items based on their Java type. Holds a list of @XmlElement annotations with name and type attributes that explicitly map Java types in the collection to XML element names (e.g., in our example, inventory contains animal elements because our List property is named “animal”). If we chose to have subclasses of Animal in our inventory collection, we could map them to XML element names such as gorilla and lemur. See @XmlElementRef.
@XmlElementRef Similar to @XmlElements, used to generate individualized names for Java types in a collection. However, instead of the names for each type being specified directly, they are determined at runtime by the individual types’ Java type bindings (e.g., in our example, inventory contains animal elements because our List is named “animal”). Using @XmlElementRef, we could subclass Animal and have our inventory contain elements like gorilla and lemur, with the names determined by @XmlRootElement annotations on the respective subclasses. See important class binding info in @XmlElementRefs.
@XmlElementRefs Used on a Java collection to provide a list of @XmlElementRef annotations with type attributes that explicitly specify the Java types that may appear in the collection. The effect is the same as using a simple @XmlElementRef on the collection, but we actively tell JAXB the class names that have bindings. If not supplied in this way, we have to provide the full list of bound classes to the JAXBContextnewInstance() method in order for them to be recognized.
@XmlElementWrapper Used on a Java collection to cause the sequence of XML elements to be wrapped in the specified element instead of appearing directly inline in the XML (e.g., our animal elements appear directly in inventory). Using this annotation, we could nest them all within a new animals element.
@XmlEnum Binds a Java Enum to XML and allows @XmlEnumValues annotations to be used to map the enum values for XML if required.
@XmlEnumValue Binds an individual Java Enum value to a string to be used in the XML (e.g., our mammal enum value could be mapped to “mammalia”).
@XmlID Supports referential integrity by designating a Java property or field of a class as being the XML ID attribute (a unique key) for the XML element within the document.
@XmlIDREF Supports referential integrity by designating a Java property or field as an idref attribute pointing to an element with an @XmlID. The annotated property or field must contain an instance of a Java type containing an @XmlID annotation. When marshalled, the attribute name will be the property name and the value will be the contained XML ID value.
@XmlInlineBinaryData Bind a Java byte array to receive base64 binary data encoded in the XML.
@XmlList Used on a Java collection to map items to a single simple content element with a whitespace-separated list of values instead of a series of elements.
@XmlMimeType Used with a Java Image or Source type to specify a MIME type for XML base64-encoded binary data bound to it.
@XmlMixed Binds a Java object collection to XML “mixed content” (i.e., XML containing both text and element tags within it). Text will be added to the collection as String objects interleaved with the usual Java types representing the other elements.
@XmlRootElement Bind a Java class to an XML element optionally provide a name. This is the minimum annotation required on your class to make it possible to marshal it to XML and back.
@XmlElementDecl Used in binding XML schema elements to methods in Java object factories created in some code generation scenarios.
@XmlRegistry Used with @XmlElementDecl in designating Java object factories used in some code generation scenarios.
@XmlSchema Binds a Java package to a default XML namespace.
@XmlNs Used with @XmlSchema to bind a Java package to one or more XML namespace prefixes.
@XmlSchemaType Used on a Java property, field, or package. Specifies a Java type to be used for a standard XML schema built-in types, such as date or a numeric type.
@XmlSchemaTypes Used on a Java package. Holds a list of @XmlSchemaType annotations mapping Java types to built-in XML schema types.
@XmlTransient Designates that a Java property or field should not be marshaled to the XML. This can be used in conjunction with defaults that marshal all properties or fields to exclude individual items. See @XmlAccessorType.
@XmlType Binds a Java class to an XML schema type. Additionally, the propOrder attribute may be used to explicitly list the order in which elements are marshalled to XML.
@XmlValue Designates that a Java property or field contains the “simple” XML content for the Java type; that is, instead of marshalling the class as an XML element containing a nested element for the property, the value of the annotated property will appear directly as the content. The Java type may have only one property designated as @XmlValue.

Unmarshalling from XML

Creating our object model from XML just requires a few lines to create an Unmarshaller from our JAXBContext and a cast to the Java type of our root element:

JAXBContext context = JAXBContext.newInstance( Inventory.class );
Unmarshaller unmarshaller = context.createUnmarshaller();
Inventory inventory = (Inventory)unmarshaller.unmarshal(
    new File("zooinventory.xml") );

The Unmarshaller class has a setValidating() method like the SAXParser, but it is deprecated. Instead, we could use the setSchema() method to set an XML Schema representation if we want validation as part of the parsing process. Alternately, we could just validate the schema separately. See “XML Schema”.

Generating a Java Model from an XML Schema

If you are starting with an XML Schema (xsd file), you can generate annotated Java classes from the schema using the JAXB xjc command-line tool that comes with the JDK.

xjc zooinventory.xsd

// Output
parsing a schema...
compiling a schema...
generated/Animal.java
generated/FoodRecipe.java
generated/Inventory.java
generated/ObjectFactory.java

By default, the output is placed in the default package in a directory named generated. You can control the package name with the -p switch and the directory with -d. See the xjc documentation for more options.

Studying the generated classes will give you some hints as to how many annotations are used, although xjc is a little more verbose than it has to be. Also note that xjc produces a class called ObjectFactory that contains factory methods for each type, such as createInventory() and createAnimal(). If you look at these methods, you’ll see that they really just call new on the plain Java objects and they seem superfluous. The ObjectFactory is mainly there for legacy reasons. In ealier versions of JAXB, before annotations, the generated classes were not as simple to construct. Additionally, the ObjectFactory contains a helper method to create a JAXBElement type, which may be useful in special situations. For the most part, you can ignore these.

Generating an XML Schema from a Java Model

You can also generate an XML Schema directly from your annotated Java classes using the JAXB XML Schema binding generator: schemagen. The schemagen command-line tool comes with the JDK. It can generate a schema starting with Java source or class files. Use the -classpath argument to specify the location of the classes or source files and then provide the name of the root class in your hierarchy:

schemagen -classpath . Inventory

Having worked our way through the options for bridging XML to Java, we’ll now turn our attention to transformations on XML itself with XSL, the styling language for XML.

Transforming Documents with XSL/XSLT

Earlier in this chapter, we used a Transformer object to copy a DOM representation of an example back to XML text. We mentioned that we were not really tapping the potential of the Transformer. Now, we’ll give you the full story.

The javax.xml.transform package is the API for using the XSL/XSLT transformation language. XSL stands for Extensible Stylesheet Language. Like Cascading Stylesheets (CSS) for HTML, XSL allows us to “mark up” XML documents by adding tags that provide presentation information. XSL Transformation (XSLT) takes this further by adding the ability to completely restructure the XML and produce arbitrary output. XSL and XSLT together make up their own programming language for processing an XML document as input and producing another (usually XML) document as output. (From here on in, we’ll refer to them collectively as XSL.)

XSL is extremely powerful, and new applications for its use arise every day. For example, consider a website that is frequently updated and that must provide access to a variety of mobile devices and traditional browsers. Rather than recreating the site for these and additional platforms, XSL can transform the content to an appropriate format for each platform. More generally, rendering content from XML is simply a better way to preserve your data and keep it separate from your presentation information. XSL can be used to render an entire website in different styles from files containing “pure data” in XML, much like a database. Multilingual sites also benefit from XSL to lay out text in different ways for different audiences.

You can probably guess the caveat that we’re going to issue: XSL is a big topic worthy of its own books (see, for example, O’Reilly’s Java and XSLT by Eric Burke), and we can only give you a taste of it here. Furthermore, some people find XSL difficult to understand at first glance because it requires thinking in terms of recursively processing document tags. In recent years, much of the impetus behind XSL as a way to produce web-based content has fallen away in favor of using more JavaScript on the client. However, XSL remains a powerful way to transform XML and is widely used in other document-oriented applications.

XSL Basics

XSL is an XML-based standard, so it should come as no surprise that the language is based on XML. An XSL stylesheet is an XML document using special tags defined by the XSL namespace to describe the transformation. The most basic XSL operations involve matching parts of the input XML document and generating output based on their contents. One or more XSL templates live within the stylesheet and are called in response to tags appearing in the input. XSL is often used in a purely input-driven way, whereas input XML tags trigger output in the order in which they appear, using only the information they contain. But more generally, the output can be constructed from arbitrary parts of the input, drawing from it like a database, composing elements and attributes. The XSLT transformation part of XSL adds things like conditionals and iteration to this mix, which enable any kind of output to be generated based on the input.

An XSL stylesheet contains a stylesheet tag as its root element. By convention, the stylesheet defines a namespace prefix xsl for the XSL namespace. Within the stylesheet, are one or more template tags contain a match attribute that describes the element upon which they operate.

<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

   <xsl:template match="/">
     I found the root of the document!
   </xsl:template>

</xsl:stylesheet>

When a template matches an element, it has an opportunity to handle all the children of the element. The simple stylesheet shown here has one template that matches the root of the input document and simply outputs some plain text. By default, input not matched is simply copied to the output with its tags stripped (HTML convention). But here we match the root so we consume the entire input and nothing but our message appears on the output.

The match attribute can refer to elements using the XPath notation that we described earlier. This is a hierarchical path starting with the root element. For example, match="/inventory/animal" would match only the animal elements from our zooinventory.xml file. In XSL, the path may be absolute (starting with “/”) or relative, in which case, the template detects whenever that element appears in any subcontext (equivalent to “//” in XPath).

Within the template, we can put whatever we want as long as it is well-formed XML (if not, we can use a CDATA section or XInclude). But the real power comes when we use parts of the input to generate output. The XSL value-of tag is used to output the content or child of the element. For example, the following template would match an animal element and output the value of its Name child element:

<xsl:template match="animal">
   Name: <xsl:value-of select="name"/>
</xsl:template>

The select attribute uses an XPath expression relative to the current node. In this case, we tell it to print the value of the name element within animal. We could have used a relative path to a more deeply nested element within animal or even an absolute path to another part of the document. To refer to the “current” element (in this case, the animal element itself), a select expression can use “.” as the path. The select expression can also retrieve attributes from the elements that it references.

If we try to add the animal template to our simple example, it won’t generate any output. What’s the problem? If you recall, we said that a template matching an element has the opportunity to process all its children. We already have a template matching the root (“/”), so it is consuming all the input. The answer to our dilemma—and this is where things get a little tricky—is to delegate the matching to other templates using the apply-templates tag. The following example correctly prints the names of all the animals in our document:

<xsl:stylesheet
   xmlns:xsl="http://www.w3.org/1999/XSL/
   Transform" version="1.0">

   <xsl:template match="/">
      Found the root!
      <xsl:apply-templates/>
   </xsl:template>

   <xsl:template match="animal">
      Name: <xsl:value-of select="name"/>
   </xsl:template>

</xsl:stylesheet>

We still have the opportunity to add output before and after the apply-templates tag. But upon invoking it, the template matching continues from the current node. Next, we’ll use what we have so far and add a few bells and whistles.

Transforming the Zoo Inventory

Your boss just called, and it’s now imperative that your zoo clients have access to the zoo inventory through the Web, today! After reading Chapter 15, you should be thoroughly prepared to build a nice “zoo app.” Let’s get started by creating an XSL stylesheet to turn our zooinventory.xml into HTML:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="inventory">
  <xs:complexType>
    <xs:sequence>
       <xs:element maxOccurs="unbounded" ref="animal"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

<xs:element name="name" type="xs:string"/>

<xs:element name="animal">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element name="species" type="xs:string"/>
      <xs:element name="habitat" type="xs:string"/>
      <xs:choice>
         <xs:element name="food" type="xs:string"/>
         <xs:element ref="foodRecipe"/>
      </xs:choice>
      <xs:element name="temperament" type="xs:string"/>
      <xs:element name="weight" type="xs:double"/>
    </xs:sequence>
    <xs:attribute name="animalClass" default="unknown">
      <xs:simpleType>
        <xs:restriction base="xs:token">
          <xs:enumeration value="unknown"/>
          <xs:enumeration value="mammal"/>
          <xs:enumeration value="reptile"/>
          <xs:enumeration value="bird"/>
        </xs:restriction>
      </xs:simpleType>
    </xs:attribute>
  </xs:complexType>
</xs:element>

<xs:element name="foodRecipe">
  <xs:complexType>
    <xs:sequence>
      <xs:element ref="name"/>
      <xs:element maxOccurs="unbounded" name="ingredient" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

</xs:schema>

The stylesheet contains three templates. The first matches /inventory and outputs the beginning of our HTML document (the header) along with the start of a table for the animals. It then delegates using apply-templates before closing the table and adding the HTML footer. The next template matches inventory/animal, printing one row of an HTML table for each animal. Although there are no other animal elements in the document, it still doesn’t hurt to specify that we will match an animal only in the context of an inventory, because, in this case, we are relying on inventory to start and end our table. (This template makes sense only in the context of an inventory.) Finally, we provide a template that matches foodRecipe and prints a small, nested table for that information. foodRecipe makes use of the "for-each" operation to loop over child nodes with a select specifying that we are only interested in ingredient children. For each ingredient, we output its value in a row.

There is one more thing to note in the animal template. Our apply-templates element has a select attribute that limits the elements affected. In this case, we are using the "|" regular expression-like syntax to say that we want to apply templates for only the foodorfoodRecipe child elements. Why do we do this? Because we didn’t match the root of the document (only inventory), we still have the default stylesheet behavior of outputting the plain text of nodes that aren’t matched anywhere else. We take advantage of this behavior to print the text of the food element. But we don’t want to output the text of all of the other elements of animal that we’ve already printed explicitly, so we process only the food and foodRecipe elements. Alternatively, we could have been more verbose, adding a template matching the root and another template just for the food element. That would also mean that new tags added to our XML would, by default, be ignored and not change the output. This may or may not be the behavior you want, and there are other options as well. As with all powerful tools, there is usually more than one way to do something.

XSLTransform

Now that we have a stylesheet, let’s apply it! The following simple program, XSLTransform, uses the javax.xml.transform package to apply the stylesheet to an XML document and print the result. You can use it to experiment with XSL and our example code.

    import javax.xml.transform.*;
    import javax.xml.transform.stream.*;
    
    public class XSLTransform 
    {
        public static void main( String [] args ) throws Exception {
            if ( args.length < 2 || !args[0].endsWith(".xsl") ) {
                System.err.println("usage: XSLTransform file.xsl file.xml");
                System.exit(1);
            }
            String xslFile = args[0], xmlFile = args[1];
    
            TransformerFactory factory = TransformerFactory.newInstance();
            Transformer transformer = 
                factory.newTransformer( new StreamSource( xslFile ) );
            StreamSource xmlsource = new StreamSource( xmlFile );
            StreamResult output = new StreamResult( System.out );
            transformer.transform( xmlsource, output );
        }
    }

Run XSLTransform, passing the XSL stylesheet and XML input, as in the following command:

% java XSLTransform zooinventory.xsl zooinventory.xml > zooinventory.html

The output should look like Figure 24-2.

Image of the zoo inventory table
Figure 24-2. Image of the zoo inventory table

Constructing the transform is a similar process to that of getting a SAX or DOM parser. The difference from our earlier use of the TransformerFactory is that this time, we construct the transformer, passing it the XSL stylesheet source. The resulting Transformer object is then a dedicated machine that knows how to take input XML and generate output according to its rules.

One important thing to note about XSLTransform is that it is not guaranteed thread-safe. In our example, we run the transform only once. If you are planning to run the same transform many times, you should take the additional step of getting a Templates object for the transform first, then using it to create Transformers.

Templates templates =
    factory.newTemplates( new StreamSource( args[0] ) );
Transformer transformer = templates.newTransformer();

The Templates object holds the parsed representation of the stylesheet in a compiled form and makes the process of getting a new Transformer much faster. The transformers themselves may also be more highly optimized in this case. The XSL transformer actually generates bytecode for very efficient “translets” that implement the transform. This means that instead of the transformer reading a description of what to do with your XML, it actually produces a small compiled program to execute the instructions!

XSL in the Browser

With our XSLTransform example, you can see how you’d go about rendering XML to an HTML document on the server side. But as mentioned in the introduction, modern web browsers support XSL on the client side as well. Browsers can automatically download an XSL stylesheet and use it to transform an XML document. To make this happen, just add a standard XSL stylesheet reference in your XML. You can put the stylesheet directive next to your DOCTYPE declaration in the zooinventory.xml file:

<?xml-stylesheet type="text/xsl" href="zooinventory.xsl"?>

As long as the zooinventory.xsl file is available at the same location (base URL) as the zooinventory.xml file, the browser will use it to render HTML on the client side.

Web Services

As we saw in our web services examples in Chapters 14 and 15, one of the most interesting uses for XML is web services. A web service is simply an application service supplied over the network, making use of XML to describe the request and response. Normally, web services run over HTTP and use an XML-based protocol called Simple Object Access Protocol (SOAP), a W3C standard. The combination of XML and HTTP provides a widely accessible interface for services.

SOAP and other XML-based remote procedure call mechanisms can be used in place of Java RMI for cross-platform communications. Web services are widely used and it is likely that they will continue to grow in importance in coming years. To learn more about Java APIs related to web services, check out the networking chapters of this book and take a look at http://java.sun.com/webservices/.

That’s it for our brief introduction to XML. There is a lot more to learn about this exciting area, and many of the APIs are evolving rapidly. We hope we’ve given you a good start.

The End of the Book

With this chapter, we also wrap up the main part of our book. We hope that you’ve enjoyed Learning Java. This, the fourth edition of Learning Java, is really the sixth edition of the series that began seventeen years ago with Exploring Java. It’s been a long and amazing trip watching Java develop in that time, and we thank those of you who have come along with us over the years. As always, we welcome your feedback to help us keep making this book better in the future. Ready for another decade of Java? We are!



[49] To read Berners-Lee’s original proposal to CERN, go to http://www.w3.org/History/1989/proposal.html.