Wednesday 1 May 2013

Let’s scrape the page, Using Python BeautifulSoup

Web scraping is the easiest job you can ever find. It is quite useful because even if you don’t have access to database of the website , you can still get the data out of those sites using web scraping. For web data extraction we are creating a script that will visit the websites and extract the data you want, without any extra authentication and with these scripts its easy to get more data in less time from these websites.
I always relied on python to do tasks and in this too there is a good third party library, Beautiful Soup. The official site itself have good documentation and it is clearly understandable. For those who don’t want to read that lengthy one and just want to try something using Beautiful Soup and python, read this simple script with the explanation.
Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.
Dependencies : python and Beautiful Soup

Script with explanation:

We have to use urllib2 to open the URL. Before we proceed further we should know these things. Web scraping will be effective only if we can find patterns used in the websites for denoting contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot

Here we opened the university of texas using urllib2.urlopen(url) and we create a BeautifulSoup Object using soup = BeautifulSoup(page.read()) . Now we can manipulate the webpage using the methods of the soup object.

Here we used the findAll method that will search through the soup object to match for text, html tags and their attributes or anything within the page. Here we know that university name and url has a pattern which has the html tag ‘a’ and that tag has css class ‘institution’.
So that’s why we used soup.findAll(‘a’,{‘class’:'institution’}). The css class institution actually filtered the search otherwise if we simple give findAll(‘a) , script could have returned all the links within the page. We could have done the same thing using regular expression but using BeautifulSoup is more better in this case compared to regexp.

Here we traversed thorugh the list of universities. During execution of the loop. eachuniversity['href'] will give us the link to the university because in the initial pattern we saw that the link to each university is within the ‘a’ tag’s href property and the name of the university is the string following the ‘a’ tag and that’s why we used eachuniversity.string.

Source: http://kochi-coders.com/2011/05/30/lets-scrape-the-page-using-python-beautifulsoup/

Note:

Rose Marley is experienced web scraping consultant and writes articles on web data scraping, website data scraping, data scraping services, web scraping services, website scraping, eBay product scraping, Forms Data Entry etc.

No comments:

Post a Comment