Nobody likes getting ripped off, and I’m no exception. I search the web from time to time to see who’s copying my stuff, and it’s always a little disheartening.

This week I ran a check to see who was copying my 20-year old LZW Compression article. Mind you, I’m not talking about isolated quotes taken without attribution; for the most part I’m looking for people who have posted a wholesale copy of the article - a complete rip-off. Looking through the top 25 hits yields some interesting statistics:

  • About 30% of the people who copy my work are University faculty. The assign the article as reading for a class, and instead of simply posting a link, they scrape the article off the web and post a private copy.
  • Another 40% are people who are blatantly plagiarizing - they’ve incorporated my work into a paper or thesis. Unfortunately for them Google now crawls PDF and PostScript files, which makes detection pretty easy.
  • The remainder are blogging programmers who, for some reason, delight in taking my article and posting it on their site, reformatted and unattributed, but often with my name and contact information still intact.

Taking Action

Finding these rip-off artists is easy, but getting the stolen material removed from the web is another matter. In the cases where I can clearly identify a person who owns the site, I usually start with a friendly email. Maybe 25% of the time this works, but the typical response is dead silence.

When the informal methods fail, the next step is the formal takedown notice. In the United States, web publishers enjoy protection from claims of copyright infringement under the Online Copyright Infringement Liability Limitation Act if they register a copyright agent who handles complaints, and if they respond to those complaints in a timely fashion.

This means that a site like Blogger.com, owned by Google, provides a formal mechanism for handling notices. When I can’t find a link to an abuse agent, I use the WHOIS database to find the hosting service, and send an email to their address. This generally works pretty well. For example, Scribd responded to my requests within a matter of hours, and generally assumes that my complaints are legitimate unless the poster of the material puts up a decent defense.

Things aren’t always so simple though. Just as an example, CiteSeer, a very popular database of academic publishing, has a cached copy of a stolen article that their crawler found. In their FAQ, under the question “How can I remove a copy of my article from your database?”, they give this unhelpful tidbit:

Papers within CiteSeerX corpus are crawled from the web. The only reason a papers of yours is in the CiteSeerX database is because it was/is available from the web.

No kidding. And this helps me remove your illegal copy how?

The Tough Cases

With enough perserverance, I’m usually able to remove a large percentage of the illegal copies. But some problems remain intractable. Overseas servers in countries where English is not widely spoken are particularly difficult. I could certainly sue Baidu.com in Federal Court, but I have a feeling that wouldn’t get me very far.

Even when I don’t succeed, there is some entertainment value in the excuses. Today I got an email from a gentleman in India who incorporated my work in a paper published in a peer-reviewed article. He told me that he would work on taking it out, but right now he is busy taking care of his mother, who is in poor health. He hopes I will be patient.

Patient I will remain. Not like I have a choice.