Saturday, November 24, 2007

Attack of the Blog Scrapers

Like most bloggers, I try to keep track of who's linking to us here at The Opinionated Marketers. And in the last few weeks, the number of "blog scrapers" - spam blogs that lift our content, run it in whole or part, sometimes with a link and sometimes not, usually with incorrect or no attribution at all - seems to have increased dramatically. The same thing has happened with my personal blog (puppy pictures, random notes and rants, etc. - if you're really curious I'll send you a link) and a news and politics blog I write for the Houston Chronicle.

Here's an example: Maureen wrote a post here recently on using church bulletins as marketing vehicles. Suddenly, this appears: something that looks (slightly) like someone saying "Hey, interesting post," except that this "blog" clearly isn't about anything at all and Maureen's name has suddenly become "Robert."

Every day my feeds of Google Blog and Technorati searches are crammed with this stuff.

I understand what these people are doing: they're gathering up content from all over the place and throwing it onto their own site as search engine bait, and then running Google AdSense ads on the page to make money off of ad clicks.

This is a problem in several ways. First of all, this kind of garbage makes blogs, blog search engines, and the web in general less useful for everybody. Second, as a content creator, I don't want to see my content being stolen and used as part of someone's spam scheme.

But what can be done? You could identify the host of each of these sites and send them a complaint asking them to remove the material; when I worked for a web hosting company, we actually did that when someone demonstrated that our customers were reproducing content they didn't own. Of course, we were a legitimate company, and I'm guessing when one of these sites is hosted by some tiny company in another country, it's unlikely that anything will happen. Google has a form for you to complain about advertiser behavior; if you complain about one of their advertisers stealing content, their response is that they will tell the advertiser. Google, I think they know that already.

This is Google's usual reaction to anything related to copyrights: "Leave us out of it, please." It's the wrong approach, because the more that these spam blogs turn up in search results, the more useless the search tools become; Technorati seems to be accelerating toward an advanced state of uselessness already, and if Google doesn't address these issues Google Blog Search will not be far behind.

Have any of you come across this? Is there anything to be done about it? Or do we all need to accept that when our content is out there in digital form, it's going to get stolen?

2 comments:

Boris said...

You raise an excellent point, John. And, frankly, I'm surprised that the major search engines aren't doing more to control the amount of content duplication that is going on. I don't think it would be that difficult to cut down on the duplication. But, for some reason, few of the engines are doing something about it.

That said, I think you're being unnecessarily hard on Technorati. When I've used blog search at our search tool, Zuula , I've actually been impressed at the extent to which Technorati's results are relatively free of duplicate content.

Google's blog search results, in contrast, are typically very full of duplicate content, particularly when you re-order the results so the most recent results are displayed first.

Still, you raise an important point, and I hope the search engines start doing more to address it.

Des said...

When I wrote for a while for one of the blog networks, I used to flick these to the network managers and presumably they would deal with it.

On my own, I would not send a cease and desist notice - on the basis that the non-lawyer who seeks to represent him/herself has "a fool for an attorney and a damn fool for a client".

I agree that head in the sand is not a smart position for the search engines to adopt on this one.