Javanotes 9.0, Section 11.5 — A Brief Introduction to XML

Amanda Shelton

60 Javanotes 9.0, Section 11.5 — A Brief Introduction to XML

Section 11.5

A Brief Introduction to XML

When data is saved to a file or transmitted over
a network, it must be represented in some way that will allow the same data
to be rebuilt later, when the file is read or the transmission is received.
We have seen that there are good reasons to prefer textual, character-based
representations in many cases, but there are many ways to represent a given
collection of data as text. In this section, we’ll take a brief look at
one type of character-based data representation that has become increasingly
common.

XML (eXtensible Markup Language) is a syntax for creating
data representation languages. There are two aspects or levels of XML.
On the first level, XML specifies a strict but relatively simple syntax.
Any sequence of characters that follows that syntax is a
well-formed XML document. On the second level, XML
provides a way of placing further restrictions on what can appear in
a document. This is done by associating a DTD (Document
Type Definition) with an XML document. A DTD is
essentially a list of things that are allowed to appear in the
XML document. A well-formed XML document that has an associated DTD and that follows
the rules of the DTD is said to be a valid XML
document. The idea is that XML is a general format for data representation,
and a DTD specifies how to use XML to represent a particular kind of data.
(There are also alternatives to DTDs, such as XML schemas,
for defining valid XML documents, but let’s ignore them here.)

There is nothing magical about XML. It’s certainly not perfect. It’s a
very verbose language, and some people think it’s ugly. On the other hand
it’s very flexible. It can be used to represent almost any type of data.
It was built from the start to support all languages and alphabets. Most
important, it has become an accepted standard. There is support in just
about any programming language for processing XML documents. There are
standard DTDs for describing many different kinds of data. There are many
ways to design a data representation language, but XML is one that has happened to
come into widespread use. In fact, it has found its way into almost every
corner of information technology. For example: There are XML languages for representing
mathematical expressions (MathML), musical notation (MusicXML), molecules and
chemical reactions (CML), vector graphics (SVG), and many other kinds of information. XML is
used by OpenOffice and recent versions of Microsoft Office
in the document format for office applications such as word processing,
spreadsheets, and presentations. XML site syndication languages (RSS, ATOM)
make it possible for web sites, newspapers, and blogs to make a list of
recent headlines available in a standard format that can be used by other
web sites and by web browsers; the same format is used to publish podcasts.
And XML is a common format for the electronic exchange of business information.

My purpose here is not to tell you everything there is to know about XML.
I will just explain a few ways in which it can be used in your own programs.
In particular, I will not say anything further about DTDs and
valid XML. For many purposes, it is sufficient to use well-formed XML
documents with no associated DTDs.

11.5.1 Basic XML Syntax

If you know HTML, the language for writing web pages, then XML will look familiar.
An XML document looks a lot like an HTML document.
HTML is not itself an XML language, since it does not follow all the strict XML syntax
rules, but the basic ideas are similar. Here is a short, well-formed XML document:

<?xml version="1.0"?>
<simplepaint version="1.0">
   <background red='1' green='0.6' blue='0.2'/>
   <curve>
      <color red='0' green='0' blue='1'/>
      <symmetric>false</symmetric>
      <point x='83' y='96'/>
      <point x='116' y='149'/>
      <point x='159' y='215'/>
      <point x='216' y='294'/>
      <point x='264' y='359'/>
      <point x='309' y='418'/>
      <point x='371' y='499'/>
      <point x='400' y='543'/>
   </curve>
   <curve>
      <color red='1' green='1' blue='1'/>
      <symmetric>true</symmetric>
      <point x='54' y='305'/>
      <point x='79' y='289'/>
      <point x='128' y='262'/>
      <point x='190' y='236'/>
      <point x='253' y='209'/>
      <point x='341' y='158'/>
   </curve>
</simplepaint>

The first line, which is optional, merely identifies this as an XML document.
This line can also specify other information, such as the character encoding that
was used to encode the characters in the document into binary form. If this
document had an associated DTD, it would be specified in a “DOCTYPE” directive
on the next line of the file.

Aside from the first line, the document is made up of elements,
attributes, and textual content. An element starts with a
tag, such as <curve> and ends with a
matching end-tag such as </curve>.
Between the tag and end-tag is the content of the
element, which can consist of text and nested elements. (In the example, the
only textual content is the true or false in
the <symmetric> elements.) If an element has
no content, then the opening tag and end-tag can be combined into a single
empty tag, such as <point x=’83’ y=’96’/>,
with a “/” before the final “>“.
This is an abbreviation for <point x=’83’ y=’96’></point>.
A tag can include attributes such as the x and y
in <point x=’83’ y=’96’/> or the
version in <simplepaint version=”1.0″>.
A document can also include a few other things, such as comments, that I
will not discuss here.

The author of a well-formed XML document gets to choose the tag names
and attribute names, and meaningful names can be chosen to describe the
data to a human reader. (For a valid XML document
that uses a DTD, it’s the author of the DTD who gets to choose the tag names.)

Every well-formed XML document follows a strict syntax. Here are some
of the most important syntax rules:
Tag names and attribute names in XML are case sensitive. A name must begin with
a letter and can contain letters, digits and certain other characters.
Spaces and ends-of-line
are significant only in textual content. Every tag must
either be an empty tag or have a matching end-tag. By “matching” here,
I mean that elements must be properly nested; if a tag is inside some element,
then the matching end-tag must also be inside that element. A document
must have a root element, which contains all the other
elements. The root element in the above example has tag name simplepaint.
Every attribute must have a value, and that value must be enclosed in quotation
marks; either single quotes or double quotes can be used for this. The
special characters < and &, if they appear
in attribute values or textual content, must be written as <
and &. “<” and “&”
are examples of entities. The entities >,
", and ' are also defined, representing
>, double quote, and single quote. (Additional entities can
be defined in a DTD.)

While this description will not enable you to understand everything that you
might encounter in XML documents, it should allow you to design well-formed
XML documents to represent data structures used in Java programs.

11.5.2 Working With the DOM

The sample XML file shown above was designed to store information
about simple drawings made by the user. The drawings in question are ones that
could be made using the sample program SimplePaint2.java
from Subsection 7.3.3.
We’ll look at another version of that program that can save
the user’s drawing using an XML format for the data file.
The new version is SimplePaintWithXML.java.
The sample XML document shown earlier in this section
can be used with that program. I designed the format of that document
to represent all the data needed to reconstruct a picture in
SimplePaint2. The document encodes the background color
of the picture and a list of curves. Each <curve>
element contains the data from one object of type CurveData.

It is easy enough to write data in a customized XML format, although we
have to be very careful to follow all the syntax rules. Here is how SimplePaintWithXML
writes the data for a SimplePaint2 picture to a
PrintWriter, out. This produces
an XML file with the same structure as the example shown above:

out.println("<?xml version=\"1.0\"?>");
out.println("<simplepaint version=\"1.0\">");
out.println("   <background red='" + backgroundColor.getRed() + "' green='" +
        backgroundColor.getGreen() + "' blue='" + backgroundColor.getBlue() + "'/>");
for (CurveData c : curves) {
    out.println("   <curve>");
    out.println("      <color red='" + c.color.getRed() + "' green='" +
            c.color.getGreen() + "' blue='" + c.color.getBlue() + "'/>");
    out.println("      <symmetric>" + c.symmetric + "</symmetric>");
    for (Point2D pt : c.points)
        out.println("      <point x='" + pt.getX() + "' y='" + pt.getY() + "'/>");
    out.println("   </curve>");
}
out.println("</simplepaint>");

Reading the data back into the program is another matter. To reconstruct
the data structure represented by the XML Document, it is necessary to
parse the document and extract the data from it. This could be difficult to do by
hand. Fortunately, Java has
a standard API for parsing and processing XML Documents. (Actually, it
has two, but we will only look at one of them.)

A well-formed XML document has a certain structure, consisting of elements
containing attributes, nested elements, and textual content. It’s possible to
build a data structure in the computer’s memory that corresponds to the structure
and content of the document. Of course, there are many ways to do this, but there
is one common standard representation known as the Document Object Model,
or DOM. The DOM specifies how to build data structures to represent XML documents,
and it specifies some standard methods for accessing the data in that structure.
The data structure is a kind of tree whose structure mirrors the structure of
the document. The tree is constructed from nodes of various
types. There are nodes to represent elements, attributes, and text. (The tree
can also contain several other types of node, representing aspects of XML that
we can ignore here.) Attributes and text can be processed without directly
manipulating the corresponding nodes, so we will be concerned almost entirely
with element nodes.

(The sample program XMLDemo.java lets you experiment with
XML and the DOM. It has a text area where you can enter an XML document.
Initially, the input area contains the sample XML document from this section.
When you click a button named “Parse XML Input”, the program will attempt
to read the XML from the input box and build a DOM representation of that
document. If the input is not well-formed XML, an error message is displayed.
If it is legal, the program will traverse the DOM representation and
display a list of elements, attributes, and textual content that it
encounters. The program uses a few techniques for processing XML that I won’t discuss here.)

In Java, the DOM representation of an XML document file can be created
with just two statements. If selectedFile is a variable of
type File that represents the XML file, and
xmldoc is of type Document, then

DocumentBuilder docReader 
                 = DocumentBuilderFactory.newInstance().newDocumentBuilder();
xmldoc = docReader.parse(selectedFile);

will open the file, read its contents, and build the DOM representation.
The classes DocumentBuilder and DocumentBuilderFactory
are both defined in the package javax.xml.parsers.
The method docReader.parse() does the actual work. It
will throw an exception if it can’t read the file or if the file does
not contain a legal XML document. If it succeeds, then the value returned
by docReader.parse() is an object that represents the entire
XML document. (This is a very complex task! It has been coded once and
for all into a method that can be used very easily in any Java program. We see
the benefit of using a standardized syntax.)

The structure of the DOM data structure is defined in the package
org.w3c.dom, which contains several data types that represent
an XML document as a whole and the individual nodes in a document.
The “org.w3c” in the name refers to the World Wide Web Consortium,
W3C, which is the standards organization for the Web.
DOM, like XML, is a general standard, not just a Java standard.
The data types that we need here are Document,
Node, Element, and NodeList.
(They are defined as interfaces rather than classes,
but that fact is not relevant here.) We can use methods that are defined
in these data types to access the data in the DOM representation of
an XML document.

An object of type Document represents an entire
XML document. The return value of docReader.parse()—xmldoc
in the above example—is of type Document.
We will only need one method from this class: If xmldoc
is of type Document, then

xmldoc.getDocumentElement()

returns a value of type Element that represents the
root element of the document. (Recall that this is the top-level element
that contains all the other elements.) In the sample XML document from earlier
in this section, the root element consists of the tag
<simplepaint version=”1.0″>, the end-tag
</simplepaint>, and everything in between.
The elements that are nested inside
the root element are represented by their own nodes, which are said to
be children of the root node.
An object of type Element
contains several useful methods. If element is of type
Element, then we have:

element.getTagName() — returns a String
containing the name that is used in the element’s tag. For example, the name
of a <curve> element is the string “curve”.
element.getAttribute(attrName) — if attrName
is the name of an attribute in the element, then this method returns the value
of that attribute. For the element, <point x=”83″ y=”42″/>,
element.getAttribute(“x”) would return the string “83”. Note that the
return value is always a String, even if the attribute
is supposed to represent a numerical value. If the element has no attribute with
the specified name, then the return value is an empty string.
element.getTextContent() — returns a String
containing all of the textual content that is contained in the element. Note that this
includes text that is contained inside other elements that are nested inside the element.
element.getChildNodes() — returns a value of type
NodeList that contains all the Nodes that
are children of the element. The list includes nodes representing other elements and
textual content that are directly nested in the element (as well as some other
types of node that I don’t care about here). The getChildNodes()
method makes it possible to traverse the entire DOM data structure by starting
with the root element, looking at children of the root element, children of
the children, and so on. (There is a similar method that returns the
attributes of the element, but I won’t be using it here.)
element.getElementsByTagName(tagName) — returns
a NodeList that contains all the nodes representing
all elements that are nested inside element and which have the
given tag name. Note that this includes elements that are nested to any level,
not just elements that are directly contained inside element.
The getElementsByTagName() method allows you to reach into the
document and pull out specific data that you are interested in.

An object of type NodeList represents a list of
Nodes. Unfortunately, it does not use the API defined for lists
in the Java Collection Framework. Instead, a value, nodeList,
of type NodeList has two methods:
nodeList.getLength() returns the number of nodes in the
list, and nodeList.item(i) returns the node at position
i, where the positions are numbered 0, 1, …,
nodeList.getLength() – 1. Note that the
return value of nodeList.get() is of type Node,
and it might have to be type-cast to a more specific node type before it is used.

Knowing just this much, you can do the most common types of processing of
DOM representations. Let’s look at a few code fragments. Suppose that
in the course of processing a document you come across an Element
node that represents the element

<background red='1' green='0.6' blue='0.2'/>

This element might be encountered either while traversing the
document with getChildNodes() or in the result of
a call to getElementsByTagName(“background”).
Our goal is to reconstruct the data structure represented by the document, and
this element represents part of that data. In this
case, the element represents a color, and the red, green, and blue components
are given by the attributes of the element. If element is a variable
that refers to the node, the color can be obtained by saying:

double r = Double.parseDouble( element.getAttribute("red") );
double g = Double.parseDouble( element.getAttribute("green") );
double b = Double.parseDouble( element.getAttribute("blue") );
Color bgColor = Color.color(r,g,b);

Suppose now that element refers to the node that represents
the element

<symmetric>true</symmetric>

In this case, the element represents the value of a boolean
variable, and the value is encoded in the textual content of the element.
We can recover the value from the element with:

String bool = element.getTextContent();
boolean symmetric;
if (bool.equals("true"))
   symmetric = true;
else
   symmetric = false;

Next, consider an example that uses a NodeList.
Suppose we encounter an element that represents a list of Point2Ds:

<pointlist>
   <point x='17' y='42'/>   
   <point x='23' y='8'/>   
   <point x='109' y='342'/>   
   <point x='18' y='270'/>   
</pointlist>

Suppose that element refers to the node that represents
the <pointlist> element. Our goal is to build the list
of type ArrayList<Point2D> that is represented by the
element. We can do this by traversing the NodeList
that contains the child nodes of element:

ArrayList<Point2D> points = new ArrayList<>();
NodeList children = element.getChildNodes();
for (int i = 0; i < children.getLength(); i++) {
   Node child = children.item(i);   // One of the child nodes of element.
   if ( child instanceof Element ) {
      Element pointElement = (Element)child;  // One of the <point> elements.
      double x = Double.parseDouble( pointElement.getAttribute("x") );
      double y = Double.parseDouble( pointElement.getAttribute("y") );
      Point2D pt = new Point2D(x,y); // Create the Point represented by pointElement.
      points.add(pt);  // Add the point to the list of points.
   }
}

All the nested <point> elements are children of
the <pointlist> element. The if statement
in this code fragment is necessary because an element can have other
children in addition to its nested elements. In this example, we only
want to process the children that are elements.

All these techniques can be employed to write the file input method for the
sample program SimplePaintWithXML.java. When building
the data structure represented by an XML file, my approach is to start
with a default data structure and then to modify and add to it as I
traverse the DOM representation of the file. It’s not a trivial process,
but I hope that you can follow it:

Color newBackground = Color.WHITE;
ArrayList<CurveData> newCurves = new ArrayList<>();
Element rootElement = xmldoc.getDocumentElement();
if ( ! rootElement.getNodeName().equals("simplepaint") )
    throw new Exception("File is not a SimplePaint file.");
String version = rootElement.getAttribute("version");
try {
    double versionNumber = Double.parseDouble(version);
    if (versionNumber > 1.0)
        throw new Exception("File requires a newer version of SimplePaint.");
}
catch (NumberFormatException e) {
}
NodeList nodes = rootElement.getChildNodes();
for (int i = 0; i < nodes.getLength(); i++) {
   if (nodes.item(i) instanceof Element) {
      Element element = (Element)nodes.item(i);
      if (element.getTagName().equals("background")) {
         double r = Double.parseDouble(element.getAttribute("red"));
         double g = Double.parseDouble(element.getAttribute("green"));
         double b = Double.parseDouble(element.getAttribute("blue"));
         newBackground = Color.color(r,g,b);
      }
      else if (element.getTagName().equals("curve")) {
         CurveData curve = new CurveData();
         curve.color = Color.BLACK;
         curve.points = new ArrayList<>();
         newCurves.add(curve);
         NodeList curveNodes = element.getChildNodes();
         for (int j = 0; j < curveNodes.getLength(); j++) {
           if (curveNodes.item(j) instanceof Element) {
             Element curveElement = (Element)curveNodes.item(j);
             if (curveElement.getTagName().equals("color")) {
               double r = Double.parseDouble(curveElement.getAttribute("red"));
               double g = Double.parseDouble(curveElement.getAttribute("green"));
               double b = Double.parseDouble(curveElement.getAttribute("blue"));
               curve.color = Color.color(r,g,b);
             }
             else if (curveElement.getTagName().equals("point")) {
               double x = Double.parseDouble(curveElement.getAttribute("x"));
               double y = Double.parseDouble(curveElement.getAttribute("y"));
               curve.points.add(new Point2D(x,y));
             }
             else if (curveElement.getTagName().equals("symmetric")) {
               String content = curveElement.getTextContent();
               if (content.equals("true"))
                 curve.symmetric = true;
             }
           }
         }
      }
   }         
}
backgroundColor = newBackground;
curves = newCurves;

You can find the complete source code in SimplePaintWithXML.java.

XML has developed into an extremely important technology, and some applications
of it are very complex. But there is a core of simple ideas that can be easily
applied in Java. Knowing just the basics, you can make good use of XML in
your own Java programs.

60 Javanotes 9.0, Section 11.5 — A Brief Introduction to XML

Section 11.5

A Brief Introduction to XML

11.5.1 Basic XML Syntax

11.5.2 Working With the DOM

License

Share This Book