Building a URL Scanner With Java 1.3
Windows:: Developer January, 2002 |
This article describes a URL validation program I wrote in Java, using standard Internet and database language features. I’ll show you why Java is the clear choice of languages for this job. Finally, I’ll show you how to work around one particularly egregious shortcoming in Sun’s JDK 1.3.
Introduction
For several years now I’ve maintained a web site called DataCompression.info. (Note - this site passed out of my hands in 2003). I have one steady advertiser who helps me cover my hosting costs, but when you consider the time and energy required to keep the site up you’ll understand that it is a labor of love.
The Data Compression Library is simply a collection of links to sites related to data compression. I have a database of over 2000 URLs, categorized with one or more of thirty-some topics. On a periodic basis I run a program that passes over the database and generates a batch of pages I upload to a server. Figure 1 shows the main page for the site, and Figure 2 shows a typical topic page.
Figure 1 - The Front Page of the Data Compression Library
Figure 2 - A Typical Topic Page
Wish You Were Here
One of the biggest ongoing problems maintaining this site is dealing with bad links. Over the course of a year, I have to contend with literally hundreds of sites that change URLs, move within an existing domain, or just disappear from the face of the earth. My users tend to become annoyed when every other link ends up with an HTTP 404 error.
When I first started working on the site I was operating under the illusion that dead links could be dealt with through manual labor on my part and faithful error reporting from my users. It turns out that both of these assets are in short supply. After a couple of years I had more bad URLs than Microsoft has lawsuits. It was time for a change.
An Automated Solution
I knew exactly what I needed: an automated scanning program that I can run periodically. The scanner could update my database with its results, and generate a report showing me which sites are repeatedly AWOL.
Despite the clear need for this program, I had been avoidiNg getting down to business for some time. My tool of choice for utility programs has always been Visual C++, and the thought of writing more database code in Microsoft’s universe was far from appealing.
The problem in this case isn’t C++, Visual Studio, or even Microsoft. I get along with all three
on a daily basis. The problem is the unfriendly nature of Microsoft’s C++ interface to ODBC,
their database API of choice. At least from my perspective, it seems that every piece of data
that is passed back and forth between my C++ code and the database has to go through at least one
or two type transformations. The database code is never content to take a C++ string type, or
even a simple character array. Instead, every string has to be converted to
type _variant_t
or _bstr_t
, two monstrosities that Microsoft apparently cooked up for
Visual Basic programmers. Visual Basic is weakly typed, C++ is strongly typed. It’s not fair to
make either of them use the other’s type system, but the API developers in Redmond decided a
Procustean fit was just fine.
Data read in from the database typically comes in as a variant
type, which means that before I
can even use it, I first have to check the type, then examine the value, leading to lots of
code that looks something like this:
To complicate matters more, there was a matter of another familiar complaint I have with Microsoft’s COM interfaces: error handling. Working through a database typically involves a fairly long batch of calls to API calls, each of which can return an error. Dealing with deeply nested error conditions leaves you with truly messy code:
I don’t like code like this, whether I’m reading it or writing it.
Java to the Rescue
Fortunately, before I started writing my scanner in C++, I found myself knee-deep in Java programming at work. To my great pleasure, I saw that database programming in Java using JDBC worked around all of my annoyances with ODBC in a manner that made my job a lot easier.
Java’s standard interface to databases is done through the Java Database Connectivity interface, known by the trademarked name JDBC. JDBC has two key points that make it look good in comparison to C++/ODBC. First, JDBC uses native types to interface to the database - eliminating the need to convert between variant or non-standard string types. Second, error handling in JDBC is managed via exceptions, which cuts out all those conditional tests I had to use with ODBC.
With these conditions, I found that I was able to implement my scanner in 100 lines of straightforward Java code. The Java code was easier to write, easier to debug, and has been easier to maintain.
Details
Figure 3 shows a screenshot from a Microsoft Access view of my URL database. That view shows all of the pertinent data from the URL table. The fields in each row that are updated by the scanner program are the Scan History, the Last Good Scan field, and the Last Scan Result field.
Figure 3 - A View of the URL Table
The job of the scanner program is to go through all the rows in the URL table. The program attempts to access the web page pointed to by the URL. If the page is found, a hyphen character is appended to the Scan History. If the page isn’t found, the character F is appended. (The Scan History is always truncated to a maximum of 16 characters.)
If the page is found, the Last Good Scan field is set to the current date. In the event of a failure, the Java exception code is written to the Last Scan Result. This gives me the information I need to quickly scan through the database and determine which sites are going to give my users trouble.
The Java code to accomplish this is basically divided into two pieces: setup and scanning. The
first step is done in the constructor of the HttpScan
object, and the second is done in its
member function scan()
. I call both of these from the static main function in the HttpScan
class:
Note that both functions are wrapped in a try/catch block. This means I get to handle fatal errors by simply throwing or propagating an exception, knowing that it will be properly displayed when caught.
Connecting to the Database
JDBC ships with an ODBC bridge for Windows, which means the only database setup you need to
perform is to set up an ODBC data source. Under Windows 2000, you can do this from the
Administrative Tools section. Figure 4 shows part of the process. The text box labeled
Data Source Name defines the name that you use to connect to the database with the call
to DriverManager.getConnection()
.
Figure 4 - Setting up the ODBC Data Source
The full code for the connection to the database is in the HttpScan
constructor. The steps
taken sequentially are:
- Load the JDBC-ODBC bridge.
- Connect to the database.
- Load the recordset containing all of the http links in my database.
- Ask the database for the count of links in the database.
- Create an update statement that will be used throughout the program.
After the constructor executes, I have a count of records in member variable recordCount
,
a copy of all the records in member variable results
, and a member variable called
updateStatement
that will be used to update the records in the database.
Scanning the records
Once the connection has been made and the recordset loaded, the member function scan()
is
called. Its job is to attempt to connect to every URL in the recordset. In Listing 1, you’ll see
that I declare that scan()
throws an SQLException
- which means I don’t have to do any
fancy recovery when a fatal error is encountered. I rely on the caller, main()
, to catch the
exception and print out the corresponding information.
At that point, the scanner enters the main loop, which executes as long as I can successfully
call results.next()
, which moves to the next available database record. In Listing 1, you can
see some typical code used to get data from a record with code that looks like this:
You can immediately see why I would prefer this code to the equivalent C++ code under Windows. All of the data is returned in native Java types, I pass in parameters in native types, and I allow error handling to be done via exception. The result? Code that’s easy to read, easy to write!
Once I have the URL in hand, the next step is to create and open a URLConnection
object, part
of the standard java.net
package. After creating the response code, I just check the
responseCode
member of the URLConnection
object. If I received a 200, it means that the
URLConnection
object was able to locate the web site and read in the page. Anything else
means an error of some sort.
Note that the code that the library calls that loads the URL may well throw an exception,
which I catch inside the scan()
routine. When that happens, the Exception
object will
typically hold useful information as well.
Performing the Updates
The final task that the scan()
loop performs is to update the database with the results for the
given URL. This is done from one of four different places in the scan loop, where three points
represent various failures, and one represents success.
The Update()
routine takes four parameters: the record key, the 16 character history string,
the status flag, (which is ‘-‘ on success and ‘F’ on failure,) and a string which contains any
error message from the attempt to connect.
Again, this is where Java shines in comparison to the Visual C++ solution. As you can see from Listing 1, creating the SQL statement that performs the update is all done using native types and straightforward code.
Using the Program
Running the program is straightforward. The program is a simple console app that displays its results as it runs. I invoke the program from the command line with the following command:
Java -classpath . HttpScan
Figure 5 shows the program in action:
Figure 5 - HttpScan in Action
I try to run the scanner program once a week or so. After each run, I generate a report showing which sites have failed several consecutive scans. From that point I have to go through the laborious process of determining whether the site has moved, gone under, or is just plain flaky.
A Catch
When I first developed this program, I assumed that one complete run through the database would take on the order of an hour or two. The program reaches most URLs in a matter of seconds, with some of the more distant sites taking as long as sixty seconds. So I was quite disappointed when I let the program run overnight and found that it was stalled on one particular site after only running through roughly 100 URLs.
A little research into Sun’s bug database showed that I was far from the first person to see this
problem. For an example, look up bug 4304701, 4143518, or 4283433. Basically the problem is that
socket connections used by URLConnection
are opened without a timeout value, which means they
can hang forever when network problems crop up.
Of course, there are quite a few possible ways to solve this problem, and no shortage of
suggestions in the appropriate newsgroups. Some of the proposed solutions seem to be platform
dependent, and may not work in all environments. The only simple solution I found that worked
properly on Win32 platforms was to put my URLConnection
attempts in a separate thread, and then
to call Thread.stop()
after an unreasonable amount of time with no results.
But Thread.stop()
is a deprecated function call, and has the ugly possibility of creating
resource leaks.
My Solution
The solution I implemented in HttpScan was based on a post to the java.sun.com site by a
person named Denis Haskin. Denis’s workaround replaces the default SocketFactory
used by the
java.net
classes with a slightly modified version. The modified version creates sockets that
have a timeout set upon creation. Since classes such as URLConnection
rely on the default
socket factory for their sockets, inserting a replacement factory is an elegant workaround to
the problem.
My version of the factory class is called TimeoutSocketFactory
, and is shown in Listing 2. You
can see that the class is incredibly simple. All it does is return a copy of a socket class I
created called MySocketImpl
. This class is derived from the PlainSocketImpl
class defined in
java.net
, and differs in only one function: connect()
. My version of the connect()
function
calls the base class connect()
, then sets the connection to have a specified timeout. Since the
base class defines all of the remaining behavior, I’m confident that my derived socket class
won’t cause any other problems.
Once I created this class and integrated it with my HttpScan program, things worked perfectly. I make a single call at the start of the program that creates this new socket factory using this code:
It now takes a bit over an hour to scan through the entire database, and a fair number sites do actually generate a timeout.
There is one big catch to this technique. I have to derive my special socket class from
PlainSocketImpl
, the standard socket class used by the java.net package. Unfortunately,
PlainSocketImpl
is only visible in the java.net
package. So my entire socket factory has to
be hoisted into the java.net package, which is clearly coloring outside the lines.
In order to add a class to the package like this, I simply create a java/net directory and put the source file there. I then have to add the following switch to the VM when running the program:
-Xbootclasspath/a:.
This tells the VM to look the other way while I inject some additional code into the library space. When running with JBuilder, VM parameters are added to the Project Properties Dialog on the Run tab.
Sun understands that this is a problem, and documents some of the shortcomings of the current architecture in bug ID 4245730. In fact, the bug documentation is pessimistic enough to say “SocketImplFactories are a dead end.” Personally, I’m happy with the workaround.
The JDK 1.4 Solution
By the time you read this, Sun may have released version 1.4 of Java. If they have, this program
can use a simpler solution to this problem. In the 1.4 JDK, Sun has documented a pair of system
properties that allow you to modify the default timeouts used by sockets. These two properties
are sun.net.client.defaultConnectTimeout
and sun.net.client.defaultReadTimeout
, and both
specify timeout values in milliseconds.
I was able to modify the static main()
method of HttpScan to look likethis with the 1.4 JDK:
I could then run the program successfully with the 1.4 JDK after removing my modified socket factory. Once that JDK is released and working well I will undoubtedly modify my program to use this more conventional method of controlling timeouts.
Summary
I was very pleased with how easy this project was developed in Java. As I said at the start of the article, my issues with C++ all melted away when I saw the light with Sun. The two technologies used here, JDBC and java.net, are both powerful and easy to use. I hope that the code presented here helps you with your next project in this area.
- Listing 1 - HttpScan.java
- Listing 2 - java/net/TimeoutSocketFactory.java
- Download - source.zip