Thursday 30 May 2013

How to scrape an entire website onto your hard drive

I got a new hard drive recently and decided to dump in some of those shady websites I know and love, for the purposes of building a sort of database that can be indexed and infer connections.

The easiest way to scrape a website is the command wget in the Terminal:

    wget -m --tries=5 "http://www.debka.com"

which will scan through DEBKA's site and drop all the files into your current directory, inside another directory named www.debka.com. wget digs out the links to every page and file, and based on all the links it sees, it will clone all the files it can find.

The Tries=5 ensures it won't get stuck on some jammed file. The -m signifies mirroring the site. wget is also helpful for getting files from places that download badly, using

    wget -c http://www.whatever.com/fatfile.zip

wget can also spoof what browser it reports itself as to the server. just use

    wget --user-agent="Mozilla 4.0" -m http://whatever.com

when you need to pretend to be another kind of browser, because some of them try to block wget.

DEBKAfile is a site run by crusty Israeli intelligence officers and their scheming friends. There are lots of stories about various figures in the mideast world. I certainly wouldn't believe everything on a site like that, but certainly it is useful to know what crusty Israeli intel types want to publicize about Mugniyeh and the other kingpins they like to prattle on about.

If you are interested in an organization or individual that might get discredited or scandalized, the information published on the site might be short-lived, so it is good to get dumps or scrapes of sites before they get scrubbed by Public Relations flacks. Like how all congressment deleted their pictures with Mark Foley.

So I am going to yank down some choice bits of the internet and see if the data-mining software I've got spits out anything interesting. Solid. It's all public stuff, though, just snooping through like any other search engine. I'm just cloning some sites to make my own little haystack and see if there are any needles.


Source: http://www.hongpong.com/archives/2006/11/25/how_scrape_entire_website_your_hard_drive

No comments:

Post a Comment