Windows:: Developer, January, 2002 Windows:: Developer

January, 2002

This article describes a URL validation program I wrote in Java, using standard Internet and database language features. I'll show you why Java is the clear choice of languages for this job. Finally, I'll show you how to work around one particularly egregious shortcoming in Sun's JDK 1.3.

Introduction

For several years now I've maintained a web site called DataCompression.info. (Note - this site passed out of my hands in 2003). I have one steady advertiser who helps me cover my hosting costs, but when you consider the time and energy required to keep the site up you'll understand that it is a labor of love.

The Data Compression Library is simply a collection of links to sites related to data compression. I have a database of over 2000 URLs, categorized with one or more of thirty-some topics. On a periodic basis I run a program that passes over the database and generates a batch of pages I upload to a server. Figure 1 shows the main page for the site, and Figure 2 shows a typical topic page.


Figure 1
The Front Page of the Data Compression Library


Figure 2
A Typical Topic Page

Wish You Were Here

One of the biggest ongoing problems maintaining this site is dealing with bad links. Over the course of a year, I have to contend with literally hundreds of sites that change URLs, move within an existing domain, or just disappear from the face of the earth. My users tend to become annoyed when every other link ends up with an HTTP 404 error.

When I first started working on the site I was operating under the illusion that dead links could be dealt with through manual labor on my part and faithful error reporting from my users. It turns out that both of these assets are in short supply. After a couple of years I had more bad URLs than Microsoft has lawsuits. It was time for a change.

An Automated Solution

I knew exactly what I needed: an automated scanning program that I can run periodically. The scanner could update my database with its results, and generate a report showing me which sites are repeatedly AWOL.

Despite the clear need for this program, I had been avoiding getting down to business for some time. My tool of choice for utility programs has always been Visual C++, and the thought of writing more database code in Microsoft's universe was far from appealing.

The problem in this case isn't C++, Visual Studio, or even Microsoft. I get along with all three on a daily basis. The problem is the unfriendly nature of Microsoft's C++ interface to ODBC, their database API of choice. At least from my perspective, it seems that every piece of data that is passed back and forth between my C++ code and the database has to go through at least one or two type transformations. The database code is never content to take a C++ string type, or even a simple character array. Instead, every string has to be converted to type _variant_t or _bstr_t, two monstrosities that Microsoft apparently cooked up for Visual Basic programmers. Visual Basic is weakly typed, C++ is strongly typed. It's not fair to make either of them use the other's type system, but the API developers in Redmond decided a Procustean fit was just fine.

Data read in from the database typically comes in as a variant type, which means that before I can even use it, I first have to check the type, then examine the value, leading to lots of code that looks something like this:

C++:
  1. if ( pal.vt == VT_BOOL && pal.bVal )
  2. {
  3.   score = pLinks->lds>Item[ _bstr_t( "Score" ) ]->Value;
  4.   if ( score.vt != VT_NULL && score.lVal>= 1 && score.lVal <= 5 )
  5.   {
  6.      char *files[] = { "0.gif", "1.gif", "2.gif", "3.gif", "4.gif" };

To complicate matters more, there was a matter of another familiar complaint I have with Microsoft's COM interfaces: error handling. Working through a database typically involves a fairly long batch of calls to API calls, each of which can return an error. Dealing with deeply nested error conditions leaves you with truly messy code:

C++:
  1. if (open_database() == S_OK )
  2. {
  3.   if ( open_table() == S_OK )
  4.   {
  5.     if ( read_data() == S_OK )
  6.     {
  7.       do_something();
  8.       dispose_of_data();
  9.     }
  10.     else
  11.       cout <<"Error reading data";
  12.   }
  13.   else
  14.     cout <<"Error opening table";
  15.   close_database();
  16. } else
  17.   cout <<"Error opening database";

I don't like code like this, whether I'm reading it or writing it.

Java to the Rescue

Fortunately, before I started writing my scanner in C++, I found myself knee-deep in Java programming at work. To my great pleasure, I saw that database programming in Java using JDBC worked around all of my annoyances with ODBC in a manner that made my job a lot easier.

Java's standard interface to databases is done through the Java Database Connectivity interface, known by the trademarked name JDBC. JDBC has two key points that make it look good in comparison to C++/ODBC. First, JDBC uses native types to interface to the database - eliminating the need to convert between variant or non-standard string types. Second, error handling in JDBC is managed via exceptions, which cuts out all those conditional tests I had to use with ODBC.

With these conditions, I found that I was able to implement my scanner in 100 lines of straightforward Java code. The Java code was easier to write, easier to debug, and has been easier to maintain.

Details

Figure 3 shows a screenshot from a Microsoft Access view of my URL database. That view shows all of the pertinent data from the URL table. The fields in each row that are updated by the scanner program are the Scan History, the Last Good Scan field, and the Last Scan Result field.


Figure 3 - A View of the URL Table

The job of the scanner program is to go through all the rows in the URL table. The program attempts to access the web page pointed to by the URL. If the page is found, a hyphen character is appended to the Scan History. If the page isn't found, the character F is appended. (The Scan History is always truncated to a maximum of 16 characters.)

If the page is found, the Last Good Scan field is set to the current date. In the event of a failure, the Java exception code is written to the Last Scan Result. This gives me the information I need to quickly scan through the database and determine which sites are going to give my users trouble.

The Java code to accomplish this is basically divided into two pieces: setup and scanning. The first step is done in the constructor of the HttpScan object, and the second is done in its member function scan(). I call both of these from the static main function in the HttpScan class:

JAVA:
  1. public static void main(String[] args)
  2. {
  3.  try
  4.  {
  5.   HttpScan scanner = new HttpScan( "jdbc:odbc:CompressionLinks" );
  6.   scanner.scan();
  7.  }
  8.  catch ( Exception e )
  9.  {
  10.   System.out.println("Failed to connect to database. Exception: " + e);
  11.  }
  12. }

Note that both functions are wrapped in a try/catch block. This means I get to handle fatal errors by simply throwing or propagating an exception, knowing that it will be properly displayed when caught.

Connecting to the Database

JDBC ships with an ODBC bridge for Windows, which means the only database setup you need to perform is to set up an ODBC data source. Under Windows 2000, you can do this from the Administrative Tools section. Figure 4 shows part of the process. The text box labeled Data Source Name defines the name that you use to connect to the database with the call to DriverManager.getConnection().


Figure 4 - Setting up the ODBC Data Source

The full code for the connection to the database is in the HttpScan constructor. The steps taken sequentially are:

  1. Load the JDBC-ODBC bridge.
  2. Connect to the database.
  3. Load the recordset containing all of the http links in my database.
  4. Ask the database for the count of links in the database.
  5. Create an update statement that will be used throughout the program.

After the constructor executes, I have a count of records in member variable recordCount, a copy of all the records in member variable results, and a member variable called updateStatement that will be used to update the records in the database.

Scanning the records

Once the connection has been made and the recordset loaded, the member function scan() is called. Its job is to attempt to connect to every URL in the recordset. In Listing 1, you'll see that I declare that scan() throws an SQLException - which means I don't have to do any fancy recovery when a fatal error is encountered. I rely on the caller, main(), to catch the exception and print out the corresponding information.

At that point, the scanner enters the main loop, which executes as long as I can successfully call results.next(), which moves to the next available database record. In Listing 1, you can see some typical code used to get data from a record with code that looks like this:

JAVA:
  1. key = results.getInt( "Key" );
  2. String url = results.getString( "Url" );
  3. history = results.getString( "ScanHistory" );

You can immediately see why I would prefer this code to the equivalent C++ code under Windows. All of the data is returned in native Java types, I pass in parameters in native types, and I allow error handling to be done via exception. The result? Code that's easy to read, easy to write!

Once I have the URL in hand, the next step is to create and open a URLConnection object, part of the standard java.net package. After creating the response code, I just check the responseCode member of
the URLConnection object. If I received a 200, it means that the URLConnection object was able to
locate the web site and read in the page. Anything else means an error of some sort.

Note that the code that the library calls that loads the URL may well throw an exception, which I catch inside the scan() routine. When that happens, the Exception object will typically hold useful information as well.

Performing the Updates

The final task that the scan() loop performs is to update the database with the results for the given URL. This is done from one of four different places in the scan loop, where three points represent various failures, and one represents success.

The Update() routine takes four parameters: the record key, the 16 character history string, the status flag, (which is '-' on success and 'F' on failure,) and a string which contains any error message from the attempt to connect.

Again, this is where Java shines in comparison to the Visual C++ solution. As you can see from Listing 1, creating the SQL statement that performs the update is all done using native types and straightforward code.

Using the Program

Running the program is straightforward. The program is a simple console app that displays its results as it runs. I invoke the program from the command line with the following command:

Java -classpath . HttpScan

Figure 5 shows the program in action:

Figure 5 - HttpScan in Action

I try to run the scanner program once a week or so. After each run, I generate a report showing which sites have failed several consecutive scans. From that point I have to go through the laborious process of determining whether the site has moved, gone under, or is just plain flaky.

A Catch

When I first developed this program, I assumed that one complete run through the database would take on the order of an hour or two. The program reaches most URLs in a matter of seconds, with some of the more distant sites taking as long as sixty seconds. So I was quite disappointed when I let the program run overnight and found that it was stalled on one particular site after only running through roughly 100 URLs.

A little research into Sun's bug database showed that I was far from the first person to see this problem. For an example, look up bug 4304701, 4143518, or 4283433. Basically the problem is that socket connections used by URLConnection are opened without a timeout value, which means they can hang forever when network problems crop up.

Of course, there are quite a few possible ways to solve this problem, and no shortage of suggestions in the appropriate newsgroups. Some of the proposed solutions seem to be platform dependent, and may not work in all environments. The only simple solution I found that worked properly on Win32 platforms was to put my URLConnection attempts in a separate thread, and then to call Thread.stop() after an unreasonable amount of time with no results. But Thread.stop() is a deprecated function call, and has the ugly possibility of creating resource leaks.

My Solution

The solution I implemented in HttpScan was based on a post to the java.sun.com site by a person named Denis Haskin. Denis's workaround replaces the default SocketFactory used by the java.net classes with a slightly modified version. The modified version creates sockets that have a timeout set upon creation. Since classes such as URLConnection rely on the default socket factory for their sockets, inserting a replacement factory is an elegant workaround to the problem.

My version of the factory class is called TimeoutSocketFactory, and is shown in Listing 2. You can see that the class is incredibly simple. All it does is return a copy of a socket class I created called MySocketImpl. This class is derived from the PlainSocketImpl class defined in java.net, and differs in only one function: connect(). My version of the connect() function calls the base class connect(), then sets the connection to have a specified timeout. Since the base class defines all of the remaining behavior, I'm confident that my derived socket class won't cause any other problems.

Once I created this class and integrated it with my HttpScan program, things worked perfectly. I make a single call at the start of the program that creates this new socket factory using this code:

JAVA:
  1. try {
  2.   Socket.setSocketImplFactory( new TimeoutSocketFactory( 30000 ) );
  3. }
  4. catch ( Exception e )
  5. {
  6.   System.out.println( "Hijacking socket factory failed: " + e );
  7.   System.exit( -1 );
  8. }

It now takes a bit over an hour to scan through the entire database, and a fair number sites do actually generate a timeout.

There is one big catch to this technique. I have to derive my special socket class from PlainSocketImpl, the standard socket class used by the java.net package. Unfortunately, PlainSocketImpl is only visible in the java.net package. So my entire socket factory has to be hoisted into the java.net package, which is clearly coloring outside the lines.

In order to add a class to the package like this, I simply create a java/net directory and put the source file there. I then have to add the following switch to the VM when running the program:

-Xbootclasspath/a:.

This tells the VM to look the other way while I inject some additional code into the library space. When running with JBuilder, VM parameters are added to the Project Properties Dialog on the Run tab.

Sun understands that this is a problem, and documents some of the shortcomings of the current architecture in bug ID 4245730. In fact, the bug documentation is pessimistic enough to say "SocketImplFactories are a dead end." Personally, I'm happy with the workaround.

The JDK 1.4 Solution

By the time you read this, Sun may have released version 1.4 of Java. If they have, this program can use a simpler solution to this problem. In the 1.4 JDK, Sun has documented a pair of system properties that allow you to modify the default timeouts used by sockets. These two properties are sun.net.client.defaultConnectTimeout and sun.net.client.defaultReadTimeout, and both specify timeout values in milliseconds.

I was able to modify the static main() method of HttpScan to look likethis with the 1.4 JDK:

JAVA:
  1. try
  2. {
  3.   System.setProperty( "sun.net.client.defaultConnectTimeout", "30000");
  4.   System.setProperty( "sun.net.client.defaultReadTimeout", "30000" );
  5.   Test test = new Test( "jdbc:odbc:CompressionLinks" );
  6.   test.scan();
  7. }

I could then run the program successfully with the 1.4 JDK after removing my modified socket factory. Once that JDK is released and working well I will undoubtedly modify my program to use this more conventional method of controlling timeouts.

Summary

I was very pleased with how easy this project was developed in Java. As I said at the start of the article, my issues with C++ all melted away when I saw the light with Sun. The two technologies used here, JDBC and java.net, are both powerful and easy to use. I hope that the code presented here helps you with your next project in this area.

Listing 1 - HttpScan.java

Listing 2 - java/net/TimeoutSocketFactory.java

Download - source.zip