Scraping A Website

Friday, 22 August 2014

What is the right way of storing screen-scraping data?

i'm working on a web site. it is scraping product details(names, features, prices etc.) from various web sites, processing and displaying them. i'am considering to run update script on each day and keep data fresh.

    scrape data
    process them
    store on database
    read(from db) and display them

i'am already storing all the data in a sql schema but i'm not sure. After each update, all the old records are vanishing. if the scraped new data comes corrupted somehow, there is nothing to show.

so, is there any common way to archive the old data? which one is more convenient: seperate sql schemas or xml files? or something else?

Source: http://stackoverflow.com/questions/13686474/what-is-the-right-way-of-storing-screen-scraping-data

Scraping dynamic data

I am scraping profiles on ask.fm for a research question. The problem is that only the top most recent questions are viewable and I have to click "view more" to see the next 15.

The source code for clicking view more looks like this:

<input class="submit-button-more submit-button-more-active" name="commit" onclick="return Forms.More.allowSubmit(this)" type="submit" value="View more" />

What is an easy way of calling this 4 times before scraping it. I want the most recent 60 posts on the site. Python is preferable.

You could probably use selenium to browse to the website and click on the button/link a few times. You can get that here:

 https://pypi.python.org/pypi/selenium

Or you might be able to do it with mechanize:

 http://wwwsearch.sourceforge.net/mechanize/

I have also heard good things about twill, but never used it myself:

 http://twill.idyll.org/

Source: http://stackoverflow.com/questions/19437782/scraping-dynamic-data

Thursday, 21 August 2014

Web Scraping data from different sites

I am looking for a few ideas on how can I solve a design problem I'm going to be faced with building a web scraper to scrape multiple sites. Writing the scraper(s) is not the problem, matching the data from different sites (which may have small differences) is.

For the sake of being generic assume that I am scraping something like this from two or more different sites:

    public class Data {
        public int id;
        public String firstname;
        public String surname;
        ....
    }

If i scrape this from two different sites, I will encounter the situation where I could have the following:

Site A: id=100, firstname=William, surname=Doe

Site B: id=1974, firstname=Bill, surname=Doe

Essentially, I would like to consider these two sets of data the same (they are the same person but with their name slightly different on each site). I am looking for possible design solutions that can handle this.

The only idea I've come up with is scraping the data from a third location and using it as a reference list. Then when I scrape site A or B I can, over time, build up a list of failures and store them in a list for each scraper so that it can know (if i find id=100 then i know that the firstname will be William etc). I can't help but feel this is a rubbish idea!

If you need any more info, or if you think my description is a bit naff, let me know!

Thanks,

DMcB

Source: http://stackoverflow.com/questions/23970057/web-scraping-data-from-different-sites

Wednesday, 20 August 2014

Scrape Data Point Using Python

I am looking to scrape a data point using Python off of the url http://www.cavirtex.com/orderbook .

The data point I am looking to scrape is the lowest bid offer, which at the current moment looks like this:

<tr>
<td>Jan. 19, 2014, 2:37 a.m.</td>
<td>0.0775/0.1146</td>
<td>860.00000</td>
<td>66.65 CAD</td>
</tr>

The relevant point being the 860.00 . I am looking to build this into a script which can send me an email to alert me of certain price differentials compared to other exchanges.

I'm quite noobie so if in your explanations you could offer your thought process on why you've done certain things it would be very much appreciated.

Thank you in advance!

Edit: This is what I have so far which will return me the name of the title correctly, I'm having trouble grabbing the table data though.

import urllib2, sys
from bs4 import BeautifulSoup

site= "http://cavirtex.com/orderbook"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
print soup.title

Here is the code for scraping the lowest bid from the 'Buying BTC' table:

from selenium import webdriver

fp = webdriver.FirefoxProfile()
browser = webdriver.Firefox(firefox_profile=fp)
browser.get('http://www.cavirtex.com/orderbook')

lowest_bid = float('inf')
elements = browser.find_elements_by_xpath('//div[@id="orderbook_buy"]/table/tbody/tr/td')

for element in elements:
 text = element.get_attribute('innerHTML').strip('|')
 try:
 bid = float(text)
 if lowest_bid > bid:
 lowest_bid = bid
 except:
 pass

browser.quit()
print lowest_bid

In order to install Selenium for Python on your Windows-PC, run from a command line:

pip install selenium (or pip install selenium --upgrade if you already have it).

If you want the 'Selling BTC' table instead, then change "orderbook_buy" to "orderbook_sell".

If you want the 'Last Trades' table instead, then change "orderbook_buy" to "orderbook_trades".

Note:

If you consider performance critical, then you can implement the data-scraping via URL-Connection instead of Selenium, and have your program running much faster. However, your code will probably end up being a lot "messier", due to the tedious XML parsing that you'll be obliged to apply...

Here is the code for sending the previous output in an email from yourself to yourself:

import smtplib,ssl

def SendMail(username,password,contents):
 server = Connect(username)
 try:
 server.login(username,password)
 server.sendmail(username,username,contents)
 except smtplib.SMTPException,error:
 Print(error)
 Disconnect(server)

def Connect(username):
 serverName = username[username.index("@")+1:username.index(".")]
 while True:
 try:
 server = smtplib.SMTP(serverDict[serverName])
 except smtplib.SMTPException,error:
 Print(error)
 continue
 try:
 server.ehlo()
 if server.has_extn("starttls"):
 server.starttls()
 server.ehlo()
 except (smtplib.SMTPException,ssl.SSLError),error:
 Print(error)
 Disconnect(server)
 continue
 break
 return server

def Disconnect(server):
 try:
 server.quit()
 except smtplib.SMTPException,error:
 Print(error)

serverDict = {
 "gmail" :"smtp.gmail.com",
 "hotmail":"smtp.live.com",
 "yahoo" :"smtp.mail.yahoo.com"
}

SendMail("your_username@your_provider.com","your_password",str(lowest_bid))

The above code should work if your email provider is either gmail or hotmail or yahoo.

Please note that depending on your firewall configuration, it may ask your permission upon the first time you try it...

Source: http://stackoverflow.com/questions/21217034/scrape-data-point-using-python

Sunday, 17 August 2014

Has It Been Done Before? Optimize Your Patent Search Using Patent Scraping Technology

Has it been done before? Optimize your Patent Search using Patent Scraping Technology.

Since the US patent office opened in 1790, inventors across the United States have been submitting all sorts of great products and half-baked ideas to their database. Nowadays, many individuals get ideas for great products only to have the patent office do a patent search and tell them that their ideas have already been patented by someone else! Herin lies a question: How do I perform a patent search to find out if my invention has already been patented before I invest time and money into developing it?

The US patent office patent search database is available to anyone with internet access.

US Patent Search Homepage

Performing a patent search with the patent searching tools on the US Patent office webpage can prove to be a very time consuming process. For example, patent searching the database for "dog" and "food" yields 5745 patent search results. The straight-forward approach to investigating the patent search results for your particular idea is to go through all 5745 results one at a time looking for yours. Get some munchies and settle in, this could take a while! The patent search database sorts results by patent number instead of relevancy. This means that if your idea was recently patented, you will find it near the top but if it wasn't, you could be searching for quite a while. Also, most patent search results have images associated with them. Downloading and displaying these images over the internet can be very time consuming depending on you internet connection and the availability of the patent search database servers.

Because patent searches take such a long time, many companies and organizations are looking ways to improve the process. Some organizations and companies will hire employees for the sole purpose of performing patent searches for them. Others contract out the job to small business that specialize in patent searches. The latest technology for performing patent searches is called patent scraping.

Patent scraping is the process of writing computer automated scripts that analyze a website and copy only the content you are interested in into easily accessible databases or spreadsheets on your computer. Because it is a computerized script performing the patent search, you don't need a separate employee to get the data, you can let it run the patent scraping while you perform other important tasks! Patent scraping technology can also extract text content from images. By saving the images and textual content to your computer, you can then very efficiently search them for content and relevancy; thus saving you lots of time that could be better spent actually inventing something!

To put a real-world face on this, let us consider the pharmaceutical industry. Many different companies are competing for the patent on the next big drug. It has become an indispensible tactic of the industry for one company to perform patent searches for what patents the other companies are applying for, thus learning in which direction the research and development team of the other company is taking them. Using this information, the company can then choose to either pursue that direction heavily, or spin off in a different direction. It would quickly become very costly to maintain a team of researchers dedicated to only performing patent searches all day. Patent scraping technology is the means for figuring out what ideas and technologies are coming about before they make headline news. It is by utilizing patent scraping technology that the large companies stay up to date on the latest trends in technology.

While some companies choose to hire their own programming team to do their patent scraping scripts for them, it is much more cost effective to contract out the job to a qualified team of programmers dedicated to performing such services.

Source:http://ezinearticles.com/?Has-It-Been-Done-Before?-Optimize-Your-Patent-Search-Using-Patent-Scraping-Technology&id=171000

Wednesday, 13 August 2014

Business Intelligence Data Mining

Data mining can be technically defined as the automated extraction of hidden information from large databases for predictive analysis. In other words, it is the retrieval of useful information from large masses of data, which is also presented in an analyzed form for specific decision-making.

Data mining requires the use of mathematical algorithms and statistical techniques integrated with software tools. The final product is an easy-to-use software package that can be used even by non-mathematicians to effectively analyze the data they have. Data Mining is used in several applications like market research, consumer behavior, direct marketing, bioinformatics, genetics, text analysis, fraud detection, web site personalization, e-commerce, healthcare, customer relationship management, financial services and telecommunications.

Business intelligence data mining is used in market research, industry research, and for competitor analysis. It has applications in major industries like direct marketing, e-commerce, customer relationship management, healthcare, the oil and gas industry, scientific tests, genetics, telecommunications, financial services and utilities. BI uses various technologies like data mining, scorecarding, data warehouses, text mining, decision support systems, executive information systems, management information systems and geographic information systems for analyzing useful information for business decision making.

Business intelligence is a broader arena of decision-making that uses data mining as one of the tools. In fact, the use of data mining in BI makes the data more relevant in application. There are several kinds of data mining: text mining, web mining, social networks data mining, relational databases, pictorial data mining, audio data mining and video data mining, that are all used in business intelligence applications.

Some data mining tools used in BI are: decision trees, information gain, probability, probability density functions, Gaussians, maximum likelihood estimation, Gaussian Baves classification, cross-validation, neural networks, instance-based learning /case-based/ memory-based/non-parametric, regression algorithms, Bayesian networks, Gaussian mixture models, K-means and hierarchical clustering, Markov models and so on.

Source:http://ezinearticles.com/?Business-Intelligence-Data-Mining&id=196648

Friday, 1 August 2014

Importance of Data Cleansing Services

In companies, there is huge amount of data that is available and essential in the decision making and strategies. Unfortunately, the data is sometimes inaccurate or incomplete because of the updates that are available from time to time. With this, companies are looking for ways to eradicate the information that is not needed by the company. Cleansing of data is one of the processes that can eliminate unnecessary data of the companies. Data cleansing identifies the information that is fraudulent or inaccurate and deletes them or replaces them with the accurate information. Unclean facts have no place in companies because they can also cause inefficiencies and inaccuracies in the decisions. After the cleaning of data, there are no inconsistencies and the data sets are already the same with each other.

There are different techniques used in data cleansing data transformation, parsing or detecting the syntax errors, duplicate eradication, and statistical method. These techniques will ensure that the data are clean and good. There are also criteria to tell if the data set is clean. This are the things that companies look for when getting data cleansing services.

Data should be accurate in which density, integrity, and consistency are there. They should also be complete in order to ensure that there are no differences in the data set. The density will show the relationship of the omitted and the total number of values in the data set. You can tell that the data set is good if it has a good density. Data should also be uniform and the irregularities should be eliminated in the set. Consistency should also be present that eliminates the syntactical errors in the set. Cleaning the data should also give the uniqueness of the set in order to tell the number of duplicates that were present before the cleaning. Lastly, the data should have integrity in combining the criteria of soundness and completeness. If the above criteria are met, it is ensured that the data set is in the best state.

Considering in getting a data cleansing service will offer you different available services. Removal of duplicate ideas is one of the most common features of data cleansing. Same records or data sets are tagged and identified and the duplicates are eradicated. Data are also validated and the bogus data are eliminated. The set will also be checked for outdated data because outdated ones are removed by data cleansing. Incomplete figures are also identified so that they will be given attention. If the incomplete data are identified, the facts will be improved in such a way that they are assembled in order and organized as a set.

Aside from the benefits that companies get from data cleansing services, there are also problems present in data cleansing. Sometimes, some data are lost because of the eradication of limited information. As for the companies that offer the services, they should maintain good service since data cleansing is expensive and time consuming.

Source:http://ezinearticles.com/?Importance-of-Data-Cleansing-Services&id=5013611