The cover of Windows Developers Journal, January, 2002, here for decorative purposes only. Windows Developers Journal
January, 2002

Thanks to Shaun Wilde for correcting a couple of mistakes in the code.


XML is a great way to package data. With the approval of the XML Schema standard by the World Wide Web Consortium, developers now have a good tool for validating the form and content of their XML data. This article gives an example of how to perform this validation by using version 4 of Microsoft’s MSXML parser.

Real World XML

Over the past few months I’ve been laboring away at a new book for Cisco Press which has the snappy title Developing Cisco IP Phone Productivity Services. In a nutshell, Cisco IP Phone Services are simply XML files served up to the phone over a standard HTTP link. XML objects can be used to display a wide variety of data on the phone, enabling programs running on a web server to use the phone as a flexible I/O device.

This means that anybody who can put together a web page can create nifty programs that interact with Cisco’s IP phones. Examples that have been created so far include flight tracking apps, appointment books, and joke-a-day services. A well-loved productivity application is shown in Figure 1.

A photo of a Cisco phone with a graphical display showing a blackjack game in progress. This is just a demonstration of the phone's ability to render its device-specific XML markup, the actual content seen in the image is just representative, it is not important.
Figure 1 - An XML Object Rendered on the Phone

When the programmers at Cisco’s Dallas engineering center were designing these services, they naturally chose to use XML to wrap up the various objects that were being sent back and forth. Yes, XML was in the air at the time, but there were also practical reasons. First among these was the availability of a compact XML parser that fit in the phone’s firmware footprint. Second was the general feel-good vibe that comes with using an accepted standard as an integral part of our product. An additional boon was the support for XML in off-the-shelf tools, such as Microsoft’s Internet Explorer or Altova’s XML Spy.

Details

The menagerie of XML objects that the phone knows how to render and/or execute is fairly small. There are XML definitions for menus, graphics, directory listings, and so on. As an example, the simplest XML object supported by the phone is the CiscoIPPhoneText element. A typical example of this element might look like this:

<CiscoIPPhoneText>
  <Title>Magic Cube</Title>
  <Text>When you take the pebble from my hand</Text>
</CiscoIPPhoneText>

Leading to a display on the phone like that shown in Figure 2.

The rendering on the phone of the XML object shown above. The disply has the text 'When you take the pebble from my hand', and a title bar that says 'Magic Cube'
Figure 2 - The CiscoIPPhoneText Object

Unfortunately, there isn’t much to do when the phone gets badly-formed XML. In a typical example, a web designer could inadvertently introduce a typo in the text used in the above example, leading to this XML file on the server:

<CiscoIPPhoneText>
  <Title>Magic Cube</Title>
  <Text>When you take the pebble from my hand</Text.>
</CiscoIPPhoneText>

We’ve all had times when a fat finger on the keyboard accidentally introduced an extra character, as shown above. The extra period in the closing tag for the Text element is enough to turn this into a badly-formed XML file.

When the Cisco IP Phone attempts to render this XML object, its parser naturally rejects the input. With not many good options at this point, the phone simply displays a blank screen, as shown in Figure 3.

This shows the phone display responding to bad XML input - the rendering area is completely blank.
Figure 3 - The Cisco IP Phone Responds to Bad XML Input

This is a bad situation for the user of the IP Phone, but it’s even worse for the designer of the web page. There are virtually no clues here to identify the problem. In theory we could load up the phone firmware with additional debugging code to identify problems such as this one, but we are up against constraints in memory space, developer time, and parser intelligence.

The good news is that existing tools such as Internet Explorer can help a lot with problems such as this. Looking at the same XML served up by the web server, Internet Explorer produces the display shown in Figure 4.

A screenshot of Internet Explorer responding to the same bad XML that resulted in a blank screen on the phone. IE is more helpful, providing a display that identifes the error as an unmatched 'Text' tag on line 5, position 3.
Figure 4 - Internet Explorer Finds the Problem

This kind of display is great - it takes you directly to the problem and identifies it clearly. Unfortunately, there is an entire class of XML problems that are not detected by Internet Explorer. Specifically, Internet Explorer (as of today) can’t help diagnose an XML streamed that is well-formed but invalid.

Valid vs. Well-Formed

The XML standard defines two different standards for XML documents: well-formed and valid. A well-formed document is one that follows the basic rules of XML regarding basic syntax, such as nesting of tags, placement of attributes, and so on. Creating well-formed documents is fairly easy for developers, because you can simply add new elements and attributes as the need arises, confident that the parser reading the document will know what to do.

On the other hand, a valid document is held to a somewhat higher standard. A valid document has to adhere to a well-defined set of rules regarding the type, order, and placement of tags that appear in it. In the early days of XML, these rules were typically defined in what was called a Document Type Definition, or DTD.

For various reasons, the DTD format first used with XML proved to be unpopular, and a revised schema definition system called XML Schema was born. XML Schema does all that the original DTDs did, with some enhancements. In addition to knowing about the placement and types of tags, XML Schema allows you to specify information about the text that is embedded inside the markup. So, for example, XML Schema allows me to specify that a value between the <Price> and </Price> tags should be a floating-point value between 1.0 and 9.99.

The World Wide Web Consortium ratified XML Schema as an official recommendation in 2001, and we’re now seeing the first set of tools that support it. I’ve been using XML Spy 4.0, both to create schema files and to test them against XML data. Writing a parser that validates against XML Schema is considerably more difficult than a plain-vanilla parser, so it may be some time before a wide selection of tools are available.

The Invalid XML Problem

The difference between a well-formed and a valid file is easy to demonstrate with the XML objects used for the Cisco IP Phone. Figure 5 shows Internet Explorer happily parsing a well-formed XML document and displaying it on a PC. A quick look at this document makes it look as if all is well, but oddly enough, the IP Phone fails to parse and display the data.

This image shows IE rendering XML data that is well formed, but not valid. It doesn't have any way of knowing that the tags in use are not valid for the Cisco phone, so it happily digests and displays the XML.
Figure 5 - Well-formed, but not valid

This is the kind of problem that can be pretty tough to spot. In this particular case, the programmer of the web page used a lower case ‘p’ in the opening and closing tags for element CiscoIPPhoneText. Since XML is case-sensitive, this error tripped up the phone, which determined that this is not a recognized document. The end user was faced with the unfriendly blank screen.

The Solution

The solution was to write a validation tool that compared the XML document to the schema defined for the Cisco IP Phones. For this particular project, I chose to use the MSXML parser from Microsoft. Version 4 is currently in an open beta, and will validate XML documents against DTDs and XML Schema. (A popular alternative is the Xerces parser, available for both C and Java.)

My simple version of this tool allows the user to input either a file name or a Universal Resource Locator (URL), then press the Validate button. The data is read from the source, parsed, and the result shown in a dialog box. The XML document from Figure 5 results in the error display in Figure 6.

This screen shot shows the validator throwin an error. The XML is well-formed, but the tag 'CiscoIpPhoneText' is not in the schema, so an error is produces when parsing.
Figure 6 - A Validating Parser Finds the Error

As you can see, MSXML does a pretty good job of identifying the offending location in the XML, saving the programmer from quite a bit of head scratching. The next section of this article will show you how to put together a program that accomplishes the same thing as my Validator.

Using MSXML

MSXML 4.0 is an ActiveX control, which means it can be used by quite a few different programming languages, ranging from JavaScript to Visual Basic to C++. I chose to write my validation program as a simple MFC dialog application using Visual C++ 6.0.

Visual C++ 6.0 has some language extensions that make working with COM objects a bit easier. (For example, using the #include statement to include a DLL directly, creating type libraries on the fly.) I didn’t use these in this application, which means the code you see here should be easily portable to other C++ compilers. Perhaps this philosophy will help when porting the code used here to other languages as well.

Installing MSXML 4.0

In October of 2001 Microsoft released MSXML 4.0 as a free download. All versions of the XML parser are found at Microsoft’s XML page, http://www.microsoft.com/XML . You will need to download and install the MSXML 4.0 SDK to build programs that use the new parser.

The install procedure for the parser by default copies header and library files to "C:\Program Files\MSXML 4.0", with the header files in the inc subdirectory and the library files in lib. The include file msxml2.h contains all the definitions you will need in your C++ code, and msxml2.lib contains links to the appropriate parser routines.

The Project Skeleton

My Validator project is a simple MFC dialog-based application. A complete listing is available by following the links below, so I won’t give you every detail. After creating the project using Microsoft’s Wizard, I made two minor modifications to accommodate the use of MSXML.

  • First, I added "C:\Program Files\MSXML 4.0\inc" to my include path for both the debug and release versions of the program. This is where the header file msxml2.h is located, and it will be included in the main project.
  • Second, I added the file C:\Program Files\MSXML 4.0\lib\msxml2.lib to the project. This library contains the CLSID and other definitions needed to access the functions in MSXML 4.0.

To this skeleton, I added some code to let the user either browse to select an XML file for input, or simply type in a URL. These pieces of infrastructure are shown in action in Figure 7.

A screenshot showing the the Validator dialogs for opening files.
Figure 7 - The I/O portions of Validator

The Real Work - Validation

There are two steps to perform in order to actually validate an XML document using MSXML 4.0. After the necessary plumbing needed to instantiate a copy of the MSXML COM Object of type XMLDOMDocument2, I have to create an XMLSchemaCache object, then add a copy of the Cisco IP Phone schema to it. A reference to that schema container is then passed to the DOM Document. The fragment of code that accomplishes this is shown in Listing 1.

hr = CoCreateInstance( CLSID_XMLSchemaCache40, 
                       NULL, 
                       CLSCTX_SERVER, 
                       IID_IXMLDOMSchemaCollection, 
                       (LPVOID*)( &pIXMLDOMSchemaCollection) );
SUCCEEDED(hr) ? 0 : throw hr;
if ( SUCCEEDED( hr ) && pIXMLDOMSchemaCollection )
{
  hr = pIXMLDOMSchemaCollection->add( _bstr_t( _T("") ), 
                                      _variant_t( _T("CiscoIPPhone.xsd")));
    if ( hr == INET_E_OBJECT_NOT_FOUND )
        throw "You need to have a copy of CiscoIPPhone.xsd in the \r\n"
              "same directory that you are executing Validator in. The \r\n"
              "program will not work properly until you rectify this. This\r\n"
              "schema file ships on the CD in the same directory\r\n"
              "as Validator.exe";
    SUCCEEDED(hr) ? 0 : throw hr;
    varValue.vt = VT_DISPATCH;
    varValue.pdispVal = pIXMLDOMSchemaCollection;
    hr = pIXMLDOMDocument2->putref_schemas( varValue );
Listing 1 - Loading the schema object and passing it to the DOM object

As is usually the case, C++ code that uses COM objects is unattractive, and a bit opaque. But in this simple example you should be able to look past that and see what is actually happening. Most importantly, you can see that this program expects to find a copy of CiscoIPPhone.xsd in the current directory. This file is the file that actually contains the schema for all the Cisco IP Phone XML objects.

Once the DOM Document has a reference to the schema, it is ready to do the actual parsing. Microsoft’s DOM Document object is flexible enough to load either a reference to a local file, or to a URL which will supply an HTTP stream. The fragment in Listing 2 does this by simply calling the load method of the object. It then pops up a dialog with one of two possible outcomes: the successfully parsed XML content, or a formatted error message.

CString file;
GetDlgItemText( IDC_EDIT1, file );
hr = pIXMLDOMDocument2->load( _variant_t( _T( file ) ), 
                              &sResult);
if ( SUCCEEDED( hr ) && ( sResult == VARIANT_TRUE ) )
{
    pIXMLDOMDocument2->get_xml( &bstrValue );
    CViewer dlg( "Valid XML Input" );
    dlg.m_Label = "Parsed XML:";
    dlg.m_Text = bstrValue;
    dlg.DoModal();
} else {
    IXMLDOMParseError *pIParseError = NULL;
    hr = pIXMLDOMDocument2->get_parseError( &IParseError );
    SUCCEEDED(hr) ? 0 : throw hr;
    BSTR b;
    hr = pIParseError->get_reason( &b );
    ostringstream out;
    out << "MS Parser reported this error:\r\n";
    long code;
    pIParseError->get_errorCode( &code ) ;
    out << "Code = 0x" << hex << code << "\r\n";
    long line_number;
    long line_pos;
    pIParseError->get_line( &line_number );
    pIParseError->get_linepos( &line_pos );
    out << "Source = Line : " << line_number 
        << "  Char : " << line_pos << "\r\n";
    out << "Error Description = " << (const char *) _bstr_t( b ) << "\r\n";
    hr = pIParseError->get_srcText( &b );
    if ( hr == 0 )
        out << "\r\nSource text where the error occured:\r\n\r\n" 
            << (const char *) _bstr_t( b ) 
            << "\r\n";
    for ( int i = 1 ; i <= line_pos ; i++ )
    {
        if ( i == line_pos )
            out << '^';
        else
            out << "-";
    }
    out << "\r\n";
    CViewer dlg( "Parser Error" );
    dlg.m_Text = out.str().c_str();
    dlg.m_Label = "Parser errors:";
    dlg.DoModal();
    SysFreeString( b );
 }
Listing 2 - Parsing the XML and displaying the results.

Figure 6 showed an example of what you see when the parser reports a failure. Figure 8 shows what it does upon success, which is to just present a copy of the parsed data found in the xml property of the DOM object.

A screen showing the Validator output when an input file is both well-formed and validated.
Figure 8 - Succesfully parsed XML Data

Wrap-up

MXSML 4.0 is a great step forward for users of Microsoft’s parser. Adding validation to the tool provides a great leap forward for developers, and for end users who benefit from better checking of the data. There is one feature I wish Microsoft had implemented differently. As the control works now, when parsing fails, the document only keeps a copy of one line from the bad input. It would be nice to keep the entire document, or at least a little more context, which would make interpretation of the error a little bit easier.

But given the price tag and the capabilities the control brings to my programs, I have to give it a better than passing grade.

Source

Validator source files are here in Source.zip.

You will need to install version 4 of Microsoft’s XML parser in order to build and execute the program.