Scraping A Website: Development and Wordpress Tutorials

I run a website for a client where they display a large database of information that they have gathered accurately and slowly over the years. They are finding their data across the web in various places. More than likely its due to a scraper going through their site page by page and extracting the information they need into a database of their own. And in case you’re wondering, they know it’s their data because of a single planted piece of data in each category on their site.

I’ve done a lot of research on this the past couple of days, and I can tell you that there is not a perfect catch-all solution. I have found several things to do to make accomplishing this a bit harder for them however. This is what I implemented for the client.
Ajaxified paginated data

If you have a lot of paginated data, and you are paginating your data by just appending a different number on to the end of your URL, i.e. http://www.domain.com/category/programming/2 – Then you are making the crawler’s job that much easier. First problem is, its in an easily identifiable pattern, so setting a scraper loose on these pages is easy as pie. Second problem, regardless of the URL of the subsequent pages in the category, more than likely there would be a next and previous link for them to latch on to.

By loading the paginated data through javascript without a page reload, this significantly complicates the job for a lot of scrapers out there. Google only recently itself started parsing javascript on page. There is little disadvantage to reloading the data like this. You provide a few less pages for Google to index, but, technically, paginated data should all be pointing to the root category page via canonicalization anyway. Ajaxify your paged pages of data.
Randomize template output

Scrapers will often be slightly customized for your data specifically. They will latch on to a certain div id or class for the title, the 3rd cell in every row for your description, etc. There is an easily identifiable pattern for most scrapers to work with as most data that is coming from the same table, is displayed by the same template. Randomize your div ids and class names, insert blank table columns at random with 0 width. Show your data in a table on one page, in styled divs and a combination on another template. By presenting your data predictably, it can be scraped predictably and accurately.
HoneyPot

This is pretty neat in its simplicity. I’ve come across this method on several pages about preventing site scraping.

    Create a new file on your server called gotcha.html.
    In your robots.txt file, add the following:
    User-agent: *
    Disallow: /gotcha.html
    This tells all the robots and spiders out there indexing your site to not index the file gotcha.html. Any normal web crawler will respect the wishes of your robots.txt file and not access that file. i.e., Google and Bing. You may actually want to implement this step, and wait 24 hours before going to the next step. This will ensure that a crawler doesn’t accidentally get blocked by you due to the fact that it was already mid-crawl when you updated your robots.txt file.
    Place a link to gotcha.html somewhere on your website. Doesn’t matter where. I’d recommend in the footer, however, make sure this link is not visible, in CSS, display:none;
    Now, log the IP/general information of the perp who visited this page and block them. Alternatively, you could come up with a script to provide them with incorrect and garbage data. Or maybe a nice personal message from you to them.

Regular web viewers won’t be able to see the link, so it won’t accidentally get clicked. Reputable crawlers(Google for example), will respect the wishes of your robots.txt and not visit the file. So, the only computers that should stumble across this page are those with malicious intentions, or somebody viewing your source code and randomly clicking around(and oh well if that happens).

There are a couple of reasons this might not always work. First, a lot of scrapers don’t function like normal web crawlers, and don’t just discover the data by following every link from every page on your site. Scrapers are often built to fix in on certain pages and follow only certain structures. For example, a scraper might be started on a category page, and then told only to visit URLs with the word /data in the slug. Second, if someone is running their scraper on the same network as others, and there is a shared IP being used, you will have ban the whole network. You would have to have a very popular website indeed for this to be a problem.
Write data to images on the fly

Find a smaller field of data, not necessarily long strings of text as this can make styling the page a bit more difficult. Output this data inside of an image, I feel quite confident there are methods in just about every programming language to write text to an image dynamically(in php, imagettftext). This is probably most effective with numerical values as numbers provide a much more insignificant SEO advantage.
Alternative

This wasn’t an option for this project. Requiring a login after a certain amount of pageviews, or displaying a limited amount of the data without being logged in. i.e., if you have 10 columns, only display 5 to non-logged in users.
Don’t make this mistake

Don’t bother trying to come up with some sort of solution based on the user-agent of the bot. This information can easily be spoofed by a scraper who knows what they’re doing. The google bot for example can be easily emulated. You more than likely don’t want to ban Google.

Source: http://www.techjunkie.com/preventing-site-scraping/

Scraping A Website

Monday, 27 May 2013

Development and Wordpress Tutorials

No comments:

Post a Comment