Windows Developers Journal
Thanks to Shaun Wilde for correcting a couple of mistakes in the code.
XML is a great way to package data. With the approval of the XML Schema standard by the World Wide Web Consortium, developers now have a good tool for validating the form and content of their XML data. This article gives an example of how to perform this validation by using version 4 of Microsoft’s MSXML parser.
Real World XML
Over the past few months I’ve been laboring away at a new book for Cisco Press which has the snappy title Developing Cisco IP Phone Productivity Services. In a nutshell, Cisco IP Phone Services are simply XML files served up to the phone over a standard HTTP link. XML objects can be used to display a wide variety of data on the phone, enabling programs running on a web server to use the phone as a flexible I/O device.
This means that anybody who can put together a web page can create nifty programs that interact with Cisco’s IP phones. Examples that have been created so far include flight tracking apps, appointment books, and joke-a-day services. A well-loved productivity application is shown in Figure 1.
Figure 1 - An XML Object Rendered on the Phone
When the programmers at Cisco’s Dallas engineering center were designing these services, they naturally chose to use XML to wrap up the various objects that were being sent back and forth. Yes, XML was in the air at the time, but there were also practical reasons. First among these was the availability of a compact XML parser that fit in the phone’s firmware footprint. Second was the general feel-good vibe that comes with using an accepted standard as an integral part of our product. An additional boon was the support for XML in off-the-shelf tools, such as Microsoft’s Internet Explorer or Altova’s XML Spy.
The menagerie of XML objects that the phone knows how to render and/or execute is fairly small. There are XML definitions for menus, graphics, directory listings, and so on. As an example, the simplest XML object supported by the phone is the CiscoIPPhoneText element. A typical example of this element might look like this:
Leading to a display on the phone like that shown in Figure 2.
Figure 2 - The CiscoIPPhoneText Object
Unfortunately, there isn’t much to do when the phone gets badly-formed XML. In a typical example, a web designer could inadvertently introduce a typo in the text used in the above example, leading to this XML file on the server:
We’ve all had times when a fat finger on the keyboard accidentally introduced an extra character, as shown above. The extra period in the closing tag for the Text element is enough to turn this into a badly-formed XML file.
When the Cisco IP Phone attempts to render this XML object, its parser naturally rejects the input. With not many good options at this point, the phone simply displays a blank screen, as shown in Figure 3.
Figure 3 - The Cisco IP Phone Responds to Bad XML Input
This is a bad situation for the user of the IP Phone, but it’s even worse for the designer of the web page. There are virtually no clues here to identify the problem. In theory we could load up the phone firmware with additional debugging code to identify problems such as this one, but we are up against constraints in memory space, developer time, and parser intelligence.
The good news is that existing tools such as Internet Explorer can help a lot with problems such as this. Looking at the same XML served up by the web server, Internet Explorer produces the display shown in Figure 4.
Figure 4 - Internet Explorer Finds the Problem
This kind of display is great - it takes you directly to the problem and identifies it clearly. Unfortunately, there is an entire class of XML problems that are not detected by Internet Explorer. Specifically, Internet Explorer (as of today) can’t help diagnose an XML streamed that is well-formed but invalid.
Valid vs. Well-Formed
The XML standard defines two different standards for XML documents: well-formed and valid. A well-formed document is one that follows the basic rules of XML regarding basic syntax, such as nesting of tags, placement of attributes, and so on. Creating well-formed documents is fairly easy for developers, because you can simply add new elements and attributes as the need arises, confident that the parser reading the document will know what to do.
On the other hand, a valid document is held to a somewhat higher standard. A valid document has to adhere to a well-defined set of rules regarding the type, order, and placement of tags that appear in it. In the early days of XML, these rules were typically defined in what was called a Document Type Definition, or DTD.
For various reasons, the DTD format first used with XML proved to be unpopular, and a revised schema definition system called XML Schema was born. XML Schema does all that the original DTDs did, with some enhancements. In addition to knowing about the placement and types of tags, XML Schema allows you to specify information about the text that is embedded inside the markup. So, for example, XML Schema allows me to specify that a value between the <Price> and </Price> tags should be a floating-point value between 1.0 and 9.99.
The World Wide Web Consortium ratified XML Schema as an official recommendation in 2001, and we’re now seeing the first set of tools that support it. I’ve been using XML Spy 4.0, both to create schema files and to test them against XML data. Writing a parser that validates against XML Schema is considerably more difficult than a plain-vanilla parser, so it may be some time before a wide selection of tools are available.
The Invalid XML Problem
The difference between a well-formed and a valid file is easy to demonstrate with the XML objects used for the Cisco IP Phone. Figure 5 shows Internet Explorer happily parsing a well-formed XML document and displaying it on a PC. A quick look at this document makes it look as if all is well, but oddly enough, the IP Phone fails to parse and display the data.
Figure 5 - Well-formed, but not valid
This is the kind of problem that can be pretty tough to spot. In this particular case, the programmer of the web page used a lower case ‘p’ in the opening and closing tags for element CiscoIPPhoneText. Since XML is case-sensitive, this error tripped up the phone, which determined that this is not a recognized document. The end user was faced with the unfriendly blank screen.
The solution was to write a validation tool that compared the XML document to the schema defined for the Cisco IP Phones. For this particular project, I chose to use the MSXML parser from Microsoft. Version 4 is currently in an open beta, and will validate XML documents against DTDs and XML Schema. (A popular alternative is the Xerces parser, available for both C and Java.)
My simple version of this tool allows the user to input either a file name or a Universal Resource Locator (URL), then press the Validate button. The data is read from the source, parsed, and the result shown in a dialog box. The XML document from Figure 5 results in the error display in Figure 6.
Figure 6 - A Validating Parser Finds the Error
As you can see, MSXML does a pretty good job of identifying the offending location in the XML, saving the programmer from quite a bit of head scratching. The next section of this article will show you how to put together a program that accomplishes the same thing as my Validator.
Visual C++ 6.0 has some language extensions that make working with COM objects a bit easier.
(For example, using the
#include statement to include a DLL
directly, creating type libraries on the fly.) I didn’t use these in this application, which
means the code you see here should be easily portable to other C++ compilers. Perhaps this
philosophy will help when porting the code used here to other languages as well.
Installing MSXML 4.0
In October of 2001 Microsoft released MSXML 4.0 as a free download. All versions of the XML parser are found at Microsoft’s XML page, http://www.microsoft.com/XML . You will need to download and install the MSXML 4.0 SDK to build programs that use the new parser.
The install procedure for the parser by default copies header and library files to
"C:\Program Files\MSXML 4.0", with the header files in the
inc subdirectory and the library
lib. The include file
msxml2.h contains all the definitions you will need in your
C++ code, and
msxml2.lib contains links to the appropriate parser routines.
The Project Skeleton
My Validator project is a simple MFC dialog-based application. A complete listing is available by following the links below, so I won’t give you every detail. After creating the project using Microsoft’s Wizard, I made two minor modifications to accommodate the use of MSXML.
- First, I added
"C:\Program Files\MSXML 4.0\inc"to my include path for both the debug and release versions of the program. This is where the header file
msxml2.his located, and it will be included in the main project.
- Second, I added the file
C:\Program Files\MSXML 4.0\lib\msxml2.libto the project. This library contains the CLSID and other definitions needed to access the functions in MSXML 4.0.
To this skeleton, I added some code to let the user either browse to select an XML file for input, or simply type in a URL. These pieces of infrastructure are shown in action in Figure 7.
Figure 7 - The I/O portions of Validator
The Real Work - Validation
There are two steps to perform in order to actually validate an XML document using MSXML 4.0.
After the necessary plumbing needed to instantiate a copy of the MSXML COM Object of type
XMLDOMDocument2, I have to create an
XMLSchemaCache object, then add a copy of the
Cisco IP Phone schema to it. A reference to that schema container is then passed to the DOM
Document. The fragment of code that accomplishes this is shown in Listing 1.
As is usually the case, C++ code that uses COM objects is unattractive, and a bit opaque. But in
this simple example you should be able to look past that and see what is actually happening.
Most importantly, you can see that this program expects to find a copy of
the current directory. This file is the file that actually contains the schema for all the Cisco
IP Phone XML objects.
Once the DOM Document has a reference to the schema, it is ready to do the actual parsing.
Microsoft’s DOM Document object is flexible enough to load either a reference to a local file,
or to a URL which will supply an HTTP stream. The fragment in Listing 2 does this by simply
load method of the object. It then pops up a dialog with one of two possible
outcomes: the successfully parsed XML content, or a formatted error message.
Figure 6 showed an example of what you see when the parser reports a failure. Figure 8 shows what it does upon success, which is to just present a copy of the parsed data found in the xml property of the DOM object.
Figure 8 - Succesfully parsed XML Data
MXSML 4.0 is a great step forward for users of Microsoft’s parser. Adding validation to the tool provides a great leap forward for developers, and for end users who benefit from better checking of the data. There is one feature I wish Microsoft had implemented differently. As the control works now, when parsing fails, the document only keeps a copy of one line from the bad input. It would be nice to keep the entire document, or at least a little more context, which would make interpretation of the error a little bit easier.
But given the price tag and the capabilities the control brings to my programs, I have to give it a better than passing grade.
Validator source files are here in Source.zip.
You will need to install version 4 of Microsoft’s XML parser in order to build and execute the program.