Scraping A Website: May 2013

Thursday 30 May 2013

How to scrape an entire website onto your hard drive

I got a new hard drive recently and decided to dump in some of those shady websites I know and love, for the purposes of building a sort of database that can be indexed and infer connections.

The easiest way to scrape a website is the command wget in the Terminal:

    wget -m --tries=5 "http://www.debka.com"

which will scan through DEBKA's site and drop all the files into your current directory, inside another directory named www.debka.com. wget digs out the links to every page and file, and based on all the links it sees, it will clone all the files it can find.

The Tries=5 ensures it won't get stuck on some jammed file. The -m signifies mirroring the site. wget is also helpful for getting files from places that download badly, using

    wget -c http://www.whatever.com/fatfile.zip

wget can also spoof what browser it reports itself as to the server. just use

    wget --user-agent="Mozilla 4.0" -m http://whatever.com

when you need to pretend to be another kind of browser, because some of them try to block wget.

DEBKAfile is a site run by crusty Israeli intelligence officers and their scheming friends. There are lots of stories about various figures in the mideast world. I certainly wouldn't believe everything on a site like that, but certainly it is useful to know what crusty Israeli intel types want to publicize about Mugniyeh and the other kingpins they like to prattle on about.

If you are interested in an organization or individual that might get discredited or scandalized, the information published on the site might be short-lived, so it is good to get dumps or scrapes of sites before they get scrubbed by Public Relations flacks. Like how all congressment deleted their pictures with Mark Foley.

So I am going to yank down some choice bits of the internet and see if the data-mining software I've got spits out anything interesting. Solid. It's all public stuff, though, just snooping through like any other search engine. I'm just cloning some sites to make my own little haystack and see if there are any needles.

Source: http://www.hongpong.com/archives/2006/11/25/how_scrape_entire_website_your_hard_drive

Monday 27 May 2013

Development and Wordpress Tutorials

I run a website for a client where they display a large database of information that they have gathered accurately and slowly over the years. They are finding their data across the web in various places. More than likely its due to a scraper going through their site page by page and extracting the information they need into a database of their own. And in case you’re wondering, they know it’s their data because of a single planted piece of data in each category on their site.

I’ve done a lot of research on this the past couple of days, and I can tell you that there is not a perfect catch-all solution. I have found several things to do to make accomplishing this a bit harder for them however. This is what I implemented for the client.
Ajaxified paginated data

If you have a lot of paginated data, and you are paginating your data by just appending a different number on to the end of your URL, i.e. http://www.domain.com/category/programming/2 – Then you are making the crawler’s job that much easier. First problem is, its in an easily identifiable pattern, so setting a scraper loose on these pages is easy as pie. Second problem, regardless of the URL of the subsequent pages in the category, more than likely there would be a next and previous link for them to latch on to.

By loading the paginated data through javascript without a page reload, this significantly complicates the job for a lot of scrapers out there. Google only recently itself started parsing javascript on page. There is little disadvantage to reloading the data like this. You provide a few less pages for Google to index, but, technically, paginated data should all be pointing to the root category page via canonicalization anyway. Ajaxify your paged pages of data.
Randomize template output

Scrapers will often be slightly customized for your data specifically. They will latch on to a certain div id or class for the title, the 3rd cell in every row for your description, etc. There is an easily identifiable pattern for most scrapers to work with as most data that is coming from the same table, is displayed by the same template. Randomize your div ids and class names, insert blank table columns at random with 0 width. Show your data in a table on one page, in styled divs and a combination on another template. By presenting your data predictably, it can be scraped predictably and accurately.
HoneyPot

This is pretty neat in its simplicity. I’ve come across this method on several pages about preventing site scraping.

    Create a new file on your server called gotcha.html.
    In your robots.txt file, add the following:
    User-agent: *
    Disallow: /gotcha.html
    This tells all the robots and spiders out there indexing your site to not index the file gotcha.html. Any normal web crawler will respect the wishes of your robots.txt file and not access that file. i.e., Google and Bing. You may actually want to implement this step, and wait 24 hours before going to the next step. This will ensure that a crawler doesn’t accidentally get blocked by you due to the fact that it was already mid-crawl when you updated your robots.txt file.
    Place a link to gotcha.html somewhere on your website. Doesn’t matter where. I’d recommend in the footer, however, make sure this link is not visible, in CSS, display:none;
    Now, log the IP/general information of the perp who visited this page and block them. Alternatively, you could come up with a script to provide them with incorrect and garbage data. Or maybe a nice personal message from you to them.

Regular web viewers won’t be able to see the link, so it won’t accidentally get clicked. Reputable crawlers(Google for example), will respect the wishes of your robots.txt and not visit the file. So, the only computers that should stumble across this page are those with malicious intentions, or somebody viewing your source code and randomly clicking around(and oh well if that happens).

There are a couple of reasons this might not always work. First, a lot of scrapers don’t function like normal web crawlers, and don’t just discover the data by following every link from every page on your site. Scrapers are often built to fix in on certain pages and follow only certain structures. For example, a scraper might be started on a category page, and then told only to visit URLs with the word /data in the slug. Second, if someone is running their scraper on the same network as others, and there is a shared IP being used, you will have ban the whole network. You would have to have a very popular website indeed for this to be a problem.
Write data to images on the fly

Find a smaller field of data, not necessarily long strings of text as this can make styling the page a bit more difficult. Output this data inside of an image, I feel quite confident there are methods in just about every programming language to write text to an image dynamically(in php, imagettftext). This is probably most effective with numerical values as numbers provide a much more insignificant SEO advantage.
Alternative

This wasn’t an option for this project. Requiring a login after a certain amount of pageviews, or displaying a limited amount of the data without being logged in. i.e., if you have 10 columns, only display 5 to non-logged in users.
Don’t make this mistake

Don’t bother trying to come up with some sort of solution based on the user-agent of the bot. This information can easily be spoofed by a scraper who knows what they’re doing. The google bot for example can be easily emulated. You more than likely don’t want to ban Google.

Source: http://www.techjunkie.com/preventing-site-scraping/

Friday 24 May 2013

Is Web Scraping Legal?

I had a friend get in touch with me a while back about the legalities of web scraping. He found, and I’m finding too, a tremendous lack of information about web scraping. I think this is a result of there being so many strange ramifications depending on the many variables in the facts of each situation. I got interested in the legal issues involved in web scraping, and so I put together a hypothetical to test some of them out.

I am going to reiterate the disclaimer in the legal notice of this website: this is not legal advice. The situation I describe here is incredibly specific and is the product of my imagination. There is almost no chance this situation is going to be the same as yours. In fact, the situation here isn’t even a complete (or real) one. I’m not going to spend the time to come up with a technologically-savvy hypothetical. This will have to do. Your situation is going to contain facts, details, and nuances different and exclusive from the one here. If you’re reading this for educational purposes, great – this should be a wonderful starting point to better inform yourself. Talking to a lawyer about your specific situation should be the next step in informing yourself.

The Hypothetical Situation:

My home ski area publishes the status of their lifts online. I develop a program that jumps onto the site, downloads the page to memory, scans that page for the lift status, uploads that status to a database, and then dumps all the data. With my iPhone I can then hit my new app which connects to the database on the server and grabs the data to display on my iPhone. Now I can see what the wait time is for a lift on the other side of the mountain while I’m skiing, or I can decide if I want to stay home for the day if I’m looking at the app off-mountain. What exactly are the consequences of doing this? Can I get in trouble for web scraping?

Web scraping brings many possible areas of liability into focus. It can potentially implicate contracts, copyright, trademark, patent, internet law, various federal statutes, and some other areas, too. Let’s hit them one by one.

1. Contract: Terms of Use

Perhaps the easiest and most straightforward analysis. Terms of use, terms and conditions, end use license agreements – whatever the agreement may be called, you often agree to use a site according to its terms when you access and stay on the site. Courts routinely uphold terms of use despite the fact that you’ve probably never read them.

Terms of use – and thus violation of those terms – can be highly unique. What is allowed on one site may be prohibited on another site but may be permitted only to a limited extent on still another site. It is almost impossible to draw any sort of conclusion about whether web scraping will violate terms of use without having the site’s terms available. My home ski area actually doesn’t have any terms of use, for some reason, so this shouldn’t be a problem. However, if the terms of use said something like: “the site is only to be used for personal use” or “reproduction or display of any material from the site is prohibited” or “scraping content from this site is not allowed,” I’m probably in hot water.

2. Copyright

Copyright can be troublesome for this app. Information on a website can be protected by copyright. Copyright protection exists in creative material fixed in some semi-permanent medium. However, copyright protection does not extend to facts because they aren’t considered creative. This creative threshold is quite low, but facts don’t pass it; creative arrangements of facts, however, can qualify for protection. My home ski area’s website has a good deal of information on it. The lift status is displayed in a few ways: with a green/yellow/red icon next to the lift’s name, with wait time next to the lift’s name, or with a green/yellow/red line superimposed over the lift’s route on the trail map. However, that is merely the display. The data scraped is just that: raw data. Raw facts. Most likely, copyright protection does not extend to this, so I should be in the clear for the data itself.

However, I’m clear only if my app scrapes just the data. If it loads the entire page, culls the code for the data it needs, and discards the rest, a temporary copy has been made of the page. The page is almost certainly protected by copyright, and courts have found that even a temporary copy stored in RAM is a sufficiently permanent copy such that it can lead to infringement. So, the app may be infringing the ski area’s protection of the webpage that contains the lift data, even though I’m just trying to grab the data itself.

3. Trademark

Trademark law shouldn’t be much of a problem. Trademark law protects the public from becoming confused about the source of a product. My app will obviously display the name of the ski area so the user can look up a resort by name and find its lift wait times. The display of that name can’t create the appearance that the mountain is sponsoring the app. This shouldn’t be too hard. It is necessary to use the name in the app, but it can be done carefully: by stating something like “lift times at X Ski Area” should be sufficient to not endanger a likelihood that a consumer would be confused by the use of the name. Something like, “lift times at X Ski Area, provided by MyiPhoneAppName” would be even clearer. A disclaimer somewhere would be an additional safeguard against consumer confusion.

4. Patent

Patent infringement can be a tricky area. Patent owners can exclude anyone from making, using, or selling their technology. By accessing the site and interacting with the data, the app would arguably be using the technology. It is really difficult to analyze whether this app would infringe any patents without knowing exactly what the ski resort has patented (or licensed). Typically, software isn’t much of a patent-heavy industry, because it changes so rapidly that the time and money necessary to file and procure a patent just isn’t worth it. Further, a lot of ski resorts (and probably other places with online wait times) have this similar feature, which means either everyone is licensing it (doubtful), websites are stealing it (also doubtful), or the technology is in the public domain (most likely). I would venture to guess that my ski area doesn’t have a patent on the technology involved in the lift status display, but you never know.

Now, there may be other apps out there that use similar technology. My app could possibly be stepping on their patents if they have any. But do they have any? Again, hard to say. Only a really thorough freedom-to-operate opinion could tell me whether anyone has a patent on this technology and, if so, whether my app infringes it. Most likely though, because of the short-lived effective life of software and iPhone apps, there probably isn’t a patent on this sort of technology. If the technology has been around for a while, the chance that it is patented is even smaller.

5. Trespass to Chattels

Trespass to chattels is a physical-world legal wrong that has been adapted to the internet. In the tangible world, trespass to chattels is interference with someone’s personal property – trespassing on their stuff. The theory has been successfully applied to spammers, with ISPs claiming that the volume of spam ate up their bandwidth, reduced the quality of their service, and ultimately risked their business. The law has also been applied against bots that crawl sites looking for information, where those bots occupied only a small percentage of the site’s bandwidth but the risk of increased usage was feared. However, in California, where most of this law arises, the theory has been trimmed significantly, and actual damage or impairment is now required.

Central to the question of whether my app risks trespass to chattels is the coding. If the app has to jump onto the ski resort’s site every time to download information, then I risk having thousands of iPhones querying the site every day during the winter. The aggregated traffic from all these apps could cause some degradation of the site. However, if the app communicates instead with a central database, as described in the hypothetical, then the load on the site is reduced. Instead of having thousands of queries from thousands of iPhones, the site is touched only by one database several times a day, and the iPhones get all the information they need from the database without burdening the ski resort’s site.

6. Computer Fraud and Abuse Act

The Computer Fraud and Abuse Act (“CFAA”) is a federal statute that imposes civil liability where someone or something accesses a computer without authorization, or accesses a computer in a manner that exceeds the authorization that it did have. For example, if you hack into a database on a server that you were never given access to, you can be liable. If you had access to the server, but not the database, you’ve exceeded your authorized access, and can still be liable. Of course, you’re only liable if there is resulting loss or damage, but this is generally easy to find. There must be $5,000 in damage, and it can come in the form of lost revenue, repair costs, damage assessments, impairment to data, or costs of responding to the unauthorized access. The breadth of the types of damages, and the relative ease with which they can be shown (hire an IT guy to mull over your system, hire an attorney to respond to the hacker, etc.), make this element an easy one to satisfy.

The ski area gives people access to its site, of course – it wants people to visit, see the lift waits, and then come to the mountain. Those are people, though – not bots. Whether a bot has access may depend on the Terms of Use of the site and also the robots.txt file. And, even if access to the webpage is given, the ski area would almost certainly argue that scraping the data exceeded any access that was authorized. This is a big hurdle to overcome for my app.

7. Digital Millennium Copyright Act

The Digital Millennium Copyright Act (“DMCA”) is a controversial law that many see as an unnecessary clamp-down on fair use rights. The DMCA is designed to give copyright owners greater protection of their digital content. The DMCA creates liability for working around technological measures that protect copyrighted works (or trafficking in products that do so). For example, if you crack an RSA key to access someone’s computer and copy documents on it, you’ve not only committed copyright infringement, but you’ve also violated the DMCA for circumventing the protection that was blocking your access to the copyrighted work. A recent case has changed the law slightly, noting that the DMCA only prevents you from circumventing technological measures protecting copyrighted work and copying that work; if you circumvent the technology but only access the work, there is no DMCA liability. The case is incredibly new and there will probably be some fallout from it across the country. After all, merely “accessing” work online still necessarily requires a RAM copy to be made, and other courts have found that a RAM copy is sufficient to find copyright infringement.

My iPhone app probably doesn’t run afoul of the DMCA, though. The app doesn’t work around any technological measures protecting the lift status or the webpage. The web page source code can be viewed and scraped without bypassing any security measures. Therefore, the DMCA is probably not a problem. If, however, the lift status were hidden behind a CAPTCHA code, this would bring the activity under the DMCA.

So, it looks like my app has a couple of problems. Of course, there are some factors that balance in my favor. Does the ski area want to sue me, a skier, a customer, and a developer of a helpful iPhone app? If they do, they’ll have to spend some pricey legal fees, and they also risk the possibility that the public gets upset about a ski area suing its customers. It generally doesn’t fit with a ski area’s image.

Source: http://www.galvanilegal.com/is-web-scraping-legal

Friday 17 May 2013

Give An Opportunity To Website Data Scraping

All company or organization, surveys and market research plays an important role in the strategic decision-making. Data mining and web scraping technique for personal or business use of the knowledge and information are essential tools. Many companies make use of people to your web pages to manually copy and paste data. This leads to a waste of time and processing effort, since it is extremely expensive, but highly reliable.

The extracted data to a CSV file, database, XML file, or any other source with the required form is stored. Data is collected and stored, the information in the data mining process of extracting hidden patterns and trends that can be used. Data can also be stored for later use.

Here are some common examples of the process of data collection:
• Websites pricing and product facility data scraping competitor
• the use of pictures of the web site design or web scraping to upload videos and pictures

Automatic Data Collection

Here are a few examples of automated data:
• Monitoring price information storage
• collects mortgage rates on a daily basis by various financial institutions
• must check regularly to the weather forecast

Then analyze the data in a spreadsheet or database that can be downloaded compare.

Data mining services, prices, shipping, database, profile data, and the consistently competitors, it is possible to get information about it.

Different technologies and processes designed to collect and analyze data has evolved over time. Web scraping business recently on the market is one of the processes.

The air is clean and the people let the legal process scaling of data is good.
Most of the people to keep unsavory behavior techniques. His main argument on the basis of process water will increase over time and can lead to equality of plagiarism.

So the only information on the various websites and databases to collect a wide range of web scraping can be defined processing. Processing , either manually or using a software, which can be web mining, data mining and web crawling has increased its use. Cutting back on other important tasks, such as business and process data are analyzed. One of the most important part of these companies is that they must be experts in service.

Some of the common ways of web scraping, web crawling, entertaining text, DOM parsing and matching process is HTML pages or meaning can be achieved by labeling.

The main issue is the importance of web scraping touch. The process is the importance of business? Answer is yes. Of the largest companies in the world and the number of awards derivative use is that it says it all.

Competition analysis to extract information on the internet web scraping is highly recommended. If so, then you can work on a specific market, it should be ensured that you prefer the patterns or trends of the market.

Source: http://data.ezinemark.com/give-an-opportunity-to-website-data-scraping-7d3895fd825a.html

Monday 6 May 2013

Scraping

Scraping, or "web scraping," is the process of extracting large amounts of information from a website. This may involve downloading several web pages or the entire site. The downloaded content may include just the text from the pages, the full HTML, or both the HTML and images from each page.

There are many different methods of scraping a website. The most basic is manually downloading web pages. This can be done by either copying and pasting the content from each page into a text editor or using your browser's File → Save As… command to save local copies of individual pages. Scraping can also be done automatically using web scraping software. This is the most common way to download a large number of pages from a website. In some cases, bots can be used to scrape a website a regular intervals.

Web scraping may be done for several different purposes. For instance, you may want to archive a section of a website for offline access. By downloading several pages to your computer, you can read them at a later time without being connected to the Internet. Web developers sometimes scrape their own websites when testing for broken links and images within each page. Scraping can also done for unlawful purposes, such as copying a website and republishing it under a different name. This type of scraping is viewed as a copyright violation and can lead to legal prosecution.

NOTE: While scraping a website for the purpose of republishing information is always wrong, scraping a site for other purposes may still violate the website's terms of use. Therefore, you should always read a website's terms of use before downloading content from the site.

Source: http://www.techterms.com/definition/scraping

Wednesday 1 May 2013

Is Data Scraping Unethical?

Perhaps the biggest challenges that website owners face, in addition to attracting visitors, is coming up with original content to publish on their websites.

Search engines are ravenously hungry creatures. They are constantly scraping the web, seeking content that they can add to the index, and if your site publishes good quality original content, the chances are very likely that you will receive higher ranking on the SERPs.

The process may not be as simple as it sounds, as there are perhaps millions of competitors in the same area, who may be competing for ranking on the same keyword(s).

Because of the challenges that are faced, with the very time-consuming and labor intensive tasks of continually creating and publishing original content, website owners may often seek shortcuts or use methods and applications that are frowned on by the search engines.

In order to remain competitive, another of the tasks, with which website owners are faced, is with keeping an eye on the competition. You need to know what the competition is up to, and you need to be able to react to, or else you can easily get left behind. One of the ways that you can do this is by developing applications that focus on data scraping. Obtaining the data may be harmless, but how it is used is where the questions often arise.

While the practice may appear to fairly innocuous and can be useful, there are several instances where it may be questioned.

There is now rapidly expanding industry for data mining. Reports are that we now create more data every day than we did in the over the last two decades, and the market for data mining continues to expand exponentially. Marketers are constantly scraping the web to build profiles of consumers, and we may be making easier for them, by leaving trails that they can easily follow.

It may be it bit disconcerting to know that every website that you ever visit is actually logged, and the data can be used to build a profile of your habits. Many uses may find it intrusive, to find that information that should be considered as private, is now available for public consumption.

Scraping can involve not only personal data, but also your buying behaviour as well as customary habits or hobbies. All of your online activities can be tracked, and although it may be stated as otherwise, there are ways that your data can be shared by other third parties, with contravening any laws

It is easy to collect detailed information, such as cell-phone numbers, email addresses and even your posts on the social networks, can easily be tracked and analysed.

There is considerable debate as to the ownership of data that is posted on the social networks. To whom does it really belong, and who should be allowed access to it?

It is also not surprising that media outlets are using data scraping methods by employing what are called listening devices to monitor what is being said on the social networks in real time. It is one of the ways that they can observe what is being said about specific organisations, products or people.

The debate is sure to continue, but there is no doubt that it can be useful.

Source: http://www.twm.co.nz/is-data-scraping-unethical/

Note:

Rose Marley is experienced web scraping consultant and writes articles on data scraping services, web data scraping, web scraper, data scraping services, website scraping, eBay product scraping, Forms Data Entry etc.

Let’s scrape the page, Using Python BeautifulSoup

Web scraping is the easiest job you can ever find. It is quite useful because even if you don’t have access to database of the website , you can still get the data out of those sites using web scraping. For web data extraction we are creating a script that will visit the websites and extract the data you want, without any extra authentication and with these scripts its easy to get more data in less time from these websites.
I always relied on python to do tasks and in this too there is a good third party library, Beautiful Soup. The official site itself have good documentation and it is clearly understandable. For those who don’t want to read that lengthy one and just want to try something using Beautiful Soup and python, read this simple script with the explanation.
Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.
Dependencies : python and Beautiful Soup

Script with explanation:

We have to use urllib2 to open the URL. Before we proceed further we should know these things. Web scraping will be effective only if we can find patterns used in the websites for denoting contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot

Here we opened the university of texas using urllib2.urlopen(url) and we create a BeautifulSoup Object using soup = BeautifulSoup(page.read()) . Now we can manipulate the webpage using the methods of the soup object.

Here we used the findAll method that will search through the soup object to match for text, html tags and their attributes or anything within the page. Here we know that university name and url has a pattern which has the html tag ‘a’ and that tag has css class ‘institution’.
So that’s why we used soup.findAll(‘a’,{‘class’:'institution’}). The css class institution actually filtered the search otherwise if we simple give findAll(‘a) , script could have returned all the links within the page. We could have done the same thing using regular expression but using BeautifulSoup is more better in this case compared to regexp.

Here we traversed thorugh the list of universities. During execution of the loop. eachuniversity['href'] will give us the link to the university because in the initial pattern we saw that the link to each university is within the ‘a’ tag’s href property and the name of the university is the string following the ‘a’ tag and that’s why we used eachuniversity.string.

Source: http://kochi-coders.com/2011/05/30/lets-scrape-the-page-using-python-beautifulsoup/

Note:

Rose Marley is experienced web scraping consultant and writes articles on web data scraping, website data scraping, data scraping services, web scraping services, website scraping, eBay product scraping, Forms Data Entry etc.