File Verification Using CRC
Dr. Dobb's Journal May, 1992 |
Recently I have found myself thinking a lot about file verification. By file verification, I mean the process of determining whether a file on my computer has been modified unexpectedly. Whether it happened through hardware failure, program error, or malicious tampering, I like to know when a file has had its contents altered. Likewise, I would like a convenient way to check the integrity of a file to verify that it hasn’t been changed. This article will show you one way to verify the file’s contents: checking its 32-bit CRC value.
The problem of file integrity has been on my mind because of several nearly simultaneous incidents. First of all, I recently ran dozens of relatively untested programs through my home systems while I was judging the Dr. Dobb’s Data Compression Contest. At least two of these programs caused inadvertent damage to the file systems on my computer, one under UNIX and one under MS-DOS. In both cases, I was able to spot a lot of the damage, but after I restored the data that looked bad, I was left feeling unsure about the rest of my system. Had other files been damaged in more subtle ways? I suddenly felt as though I couldn’t trust my system.
An even more alarming incident occurred a couple of weeks later. A programmer who supplies us with a product for resale called us up and casually mentioned that his office had been infested with the notorious Stoned virus. Had we by any chance noticed anything funny in our systems? We see funny things on our systems on an hourly basis, so suddenly we were once again in the position of not trusting any of the files on our computers. (Fortunately this turned out to be a false alarm).
Finally, as part of a recent product release at Greenleaf Software , we decided to implement a program that would allow our customers to download short patch files from our BBS to apply to the source code they purchased from us. I created a small program that could read in the patch file and make modifications to an existing source file, resulting in a corrected output file. However, to keep the program simple, we had to have a way to be sure that the input file we were patching was the file we expected it to be, and hadn’t been modified in any way. Our patch program would be capable of really fouling up the file if a programmer had just changed a few lines here and there before trying to patch it.
The solution to all of these problems consists of two parts. The first part of the solution is the use of the CRC-32 algorithm to provide a fingerprint method of file identification. The second is a general purpose program called CRCMAN that can develop a catalog of CRC values for all the files in a directory tree, and can later check the files in the same directory tree against the catalog.
CRC-32
CRC-32 is an acronym for the 32 bit Cyclical Redundancy Check algorithm. CRC-32 generally refers to a specific 32 bit CRC formula sanctioned by the CCITT, an international standards body primarily concerned with telecommunications. CRC-32 is used in communications protocols such as HDLC and ZMODEM to verify the integrity of blocks of data being transferred through various media.
CRC calculations are done using a technique with the formidable name of polynomial division. A block of data, regardless of how long, is treated as if each bit in the block is the coefficient in a long polynomial. For example, a single hexadecimal byte, F0H, would be considered to the polynomial:
Since the terms with coefficients of 0 drop out, the polynomial can be expressed as:
In the case where we are calculating the CRC of an entire file, the exponents will be very large, but this is not a problem. The actual value of the exponents do not come into play during the calculation of the CRC, so they can grow indefinitely without affecting the algorithm.
The calculation of the CRC is done by dividing a second polynomial, known as the generator polynomial into the message polynomial, producing a quotient and a remainder. The generator polynomial used by the CRC-32 is:
After dividing this generator polynomial int our message polynomial, we end up with a quotient and a remainder. We simply discard the quotient, and use the remainder as our 32 bit CRC.
Polynomial division to create a CRC was originally done using hardware shift registers and boolean glue logic. Fortunately for us, cookbook algorithms now exist to implement the CRC on desktop computers in a relatively fast and efficient manner. In this program, I use a table lookup version of the algorithm that exchanges a small increase in storage space for fast calculation.
The nuts and bolts of how the CRC calculations work have been discussed many places before, so I won’t go any further into the details in this article. Some excellent resources are listed in the references at the end of this book. If you are interested in exploring this topic further, they would make a good place to start.
The Qualities of the CRC-32
CRCMAN uses the CRC-32 algorithm to generate a 32 bit number for any given file. We then treat this 32 bit number as a somewhat unique fingerprint for that file. This fingerprint differs somewhat from the human fingerprint. It often said that no two people have identical fingerprints. This can’t be the case for our CRC fingerprint. Since there are more than 4,294,967,296 different files in the world, it is a foregone conclusion that some of them must have identical CRC values.
However, the CRC-32 does have attributes that make it very attractive for the verification of files. These include the following:
- Every bit in the file contributes to the CRC. This means that changing any bit in the file should change the CRC.
- Relatively small changes in the file should always result in changes in the CRC. We want to be sure that it would take an extremely unlikely combination of errors to produce an identical CRC.
- The histogram of output CRC values for input files should tend to be flat. For a given input file, we want the probability of a given CRC being produced to be nearly equal across the entire range of possible CRCs from 0 to FFFFFFFF.
These are the goals that the CCITT had in mind when selecting the CRC-32 algorithm, and we assume that they made a good choice. In practice, the chances of inadvertently damaging or modifying a file without modifying the CRC is vanishingly small, so for all practical purposes a program like CRCMAN can be considered to be infallible.
Another characteristic that we would like to see in the CRC-32 would be non-invertability. This characteristic isn’t really necessary if we are just using the CRC to guard against accidental file corruption, but it becomes much more important if we want to detect virus infestations of our files.
A typical virus might operate by modifying the MS-DOS command interpreter, COMMAND.COM
.
My version of COMMAND.COM
happens to have a CRC-32 of 02f8690cH. In the event that a
virus modifies this file, it will undoubtedly have a new CRC-32. The challenge to the virus
programmer would then be to add new bytes to the end of the file so that the original CRC was
restored.
Unfortunately, there are techniques for doing just this that are far better than brute force solutions. So when it comes to file security against outside attack, you are better off using a digest algorithm designed for this purpose, such as one from the Secure Hash Algorithm family (SHA). These calculations will be more computationally expensive, but will offer better protection against malicious agents.
Implementation
CRCMAN is a command line MS-DOS program that can perform one of two tasks It has two operating modes, one for building a list of CRC values, and another for checking files against that list. The command line for CRCMAN has one of two forms. The first form is used to create a CRC listing file that has the CRC and file name of every file in a directory tree. The syntax for invoking CRCMAN in this mode is:
CRCMAN -b directory crc-file-name
The directory parameter passed to CRCMAN is the name a of a root directory. CRCMAN will calculate the CRC-32 for every file in and under that directory, and store the results in the crc file named as the second parameter. The crc file created is an ordinary ASCII text file that can be edited and manipulated using any text editor. All it contains is a sequence of lines that contain a CRC-32 value followed by a file name.
For example, I ran CRCMAN on the directory that holds my work for this article on my MS-DOS machine with the following command:
CRCMAN -b . C:\CRCFILES\TEST.CRC
This created a CRC file named TEST.CRC, which had the following contents:
363476a4 .\\CRC.TXT
b97a5169 .\\CRC.BAK
d6a5f5f5 .\\CRCMAN.C
02f8690c .\\TEST\COMMAND.COM
88f2e4d6 .\\CRCMAN.EXE
23123e1c .\\TEMP.CRC
Later on, I can check the integrity of these files by running CRCMAN in its second mode, which takes this command line:
CRCMAN C:\CRCFILES\TEST.CRC
In this mode, CRCMAN just reads in each line of the CRC file, calculates the CRC of the file, and determines if it matches the stored CRC. When working on this article, I changed a few lines in the text file, then ran CRCMAN. It produced the following lines:
Checking .\\CRC.TXT . Error: Expected 76a414c6, got 86793634
Checking .\\CRC.BAK . Error: Expected 516914c6, got 76a4b97a
Checking .\\CRCMAN.C . OK
Checking .\\TEST\\COMMAND.COM .. OK
Checking .\\CRCMAN.EXE . OK
Checking .\\TEMP.CRC . OK
CRCMAN correctly detected the changes in the two files. In the current implementation of CRCMAN, all that happens when an error is detected is that an error message is printed out to the screen. In a production version of this program, the action taken can obviously grow to be as sophisticated as you like.
The Code
The complete C listing for CRCMAN is shown below. This version of the program is designed to run
under most MS-DOS C Compilers, as well as K&R implementations under UNIX, and the hybrid
Microsoft C compiler under UNIX. For the most part, the portability of the program doesn’t intrude
when you read the code, but there are a few exceptions. There are a few places in the code where
code is bracketed with #ifdef UNIX
statements. This puts the burden on a user compiling under
UNIX or XENIX to define the macro UNIX either in the program or on the command line.
The main()
routine of CRCMAN has to first perform the initialization of the table used when
calculating CRC-32 values. This routine, found in BuildCRCTable()
, initializes all the values
in the array CRCTable[]
. These are used later in the program any time a CRC value is calculated.
Once main()
has built the CRC table, it next checks to see which mode the user has selected,
based solely on the number of arguments passed on the command line. If argc
is equal to 2,
main()
assumes that it has been invoked with a single file name as an argument, and it calls
CheckFiles()
. If argc
is equal to 4 and the first argument is -b
, main()
assumes it has been invoked to build a CRC file, and it callls BuildCRCFile()
. If neither of
these turns out to be true, a usage message is printed out and the program exits.
Building the CRC File
Of the two possible jobs given to this program, building the CRC file is the more complex. Both tasks have to calculate the CRC-32 values for one or more files, but building the file has the additional job of navigating through the directory tree. Complicating this even more is the fact that navigating the directory tree has to be done differently under UNIX and MS-DOS.
BuildCRCFile()
sets things up for the task by opening up the output file that is going to
receive all the file names and CRC values. It then makes a call to the routine that does all the
work, ProcessAllFiles()
. This routine takes two arguments, a path name and a crc
file FILE
pointer.
ProcessAllFiles()
has the same flow of control under UNIX, Xenix, and various MS-DOS compilers.
Unfortunately, the function names and structures needed to implement the loop vary quite a bit
between the various environments. The pseudo code for this routine looks like this:
The underlying O/S calls to implement this pseudo-code are different under MS-DOS and UNIX. UNIX
uses a pair of functions called opendir()
and readdir()
to open the directory and get the next
file name. Under MS-DOS, there are a pair of MS-DOS function calls named findfirst()
and findnext()
. To complicate things under MS-DOS, Borland has implemented these system calls
using different function and structure names than those selected by other vendors.
The result of all these variations is a routine that has been coded using a rat’s nest of #ifdef
statments and macro definitions. However, if you understand the underlying pseudo-code, the C is
not too bad. One nice thing about this particular function is that it provides an excellent
example of a job that is truly easier to do using recursion. Implementing this same function
without being able to use recursion would be considerably more difficult.
Examining the body of ProcessAllFiles()
shows that near the bottom of the routine a call is
made to CalculateCRCFile()
. This routine calculates the CRC-32 file for the file. The result is
then printed out along with the file name to the CRC log file. As ProcessAllFiles()
does its
work, a complete listing of the entire directory tree is built up, for later use by CRCMAN in its
checking mode.
Calculating the File CRC
Calculating the CRC-32 for a given file is a relatively easy matter. The CalculateFileCRC()
routine repeatedly reads in blocks of 512 bytes, and passes them through the
CalculateBufferCRC()
routine. CalculateBufferCRC()
takes as an argument the CRC of the file
up to that point, and returns the new CRC for the file so far. This process repeats until the
entire file has been processed.
When calculating the file CRC, this routine initializes the CRC value to all 1’s, or 0xFFFFFFFFL. After the file calculation has completed, the bits in the CRC are inverted by XOR’ing with the same value, 0xFFFFFFFFL. This pre- and post-conditioning is intended to provide additional error immunity, and is used by protocols such as Zmodem, as well as programs like PKZIP and ARJ. The CRC values produced by CRCMAN consequently match up with those you would see in the listing of a ZIP file containing the same list of files.
Checking the Files
The second mode of operation for this program is the CRC check. Most of the work here is done in
the CheckFiles()
routine. It gets to bypass the directory tree navigation, since all of the
file names it needs to check are already stored in the CRC log file. All this routine has to do
is repeatedly read in a line from the CRC log file containing a 32 bit CRC value and a file name.
It then calculates the actual CRC for the file, and reports on whether the stored and calculated
CRC values match up.
The simple ASCII format of this file makes for easy maintenance of the CRC log files. For example,
if the log file contains the CRC values for the directory tree containing your Borland C++
compiler, you will probably regularly get reports from CRCMAN that your configuration file,
TURBOC.CFG
, has changed. Since you know this file is supposed to change, it is a
simple enough matter to edit the log file and delete the line that refers to the file.
Using the Program
CRCMAN can be set up to provide a quick way to check the integrity of any or all of the files on
your system. By calling CRCMAN with the -b parameter for every directory full of
executables, you create a set of CRC log files that can be periodically checked with a single
call to CRCMAN. CRCMAN operates quickly enough that you can even include it as part of your
AUTOEXEC.BAT
file under MS-DOS, without letting it slow down your work too much.
CRCMAN could also be modified to provide rudimentary virus checking with just a few modifications. To make things a little tougher on the virus programmer, you would probably want to store the length of the file along with CRC-32, and perhaps even store the file time stamp. While even these measures don’t give you an iron-clad guarantee against a virus attack, they would probably detect the significant majority of infestations.
Ultimately, the best way to use CRCMAN would be as a concurrent process that runs continuously on your machine at low priority. This is fairly easy to implement under OS/2 or UNIX, but is somewhat problematic under MS-DOS.
References
C Programmer’s Guide to Serial Communications, Joe Campbell, Howard W. Sams & Company, Indianapolis, Indiana, 1988.
Ramabadran, Tenkasi V., and Gaitonde, Sunil S., A Tutorial on CRC Computations, IEEE Micro, August 1988, pp. 62-75.