Wednesday 4 January 2012

Screen Scrapping using Python

This blog gives an idea on scrapping a web page into a valid data.I would like to share my experience in screen scrapping.

I wrote a program in python that parses the web page and gives only the required data.  
I tried this with this site http://old.rajinifans.com/movie_database/movie_list_detail.php.

When looked close into the source of the page the pattern of line was almost similar to all data. And another observation confirmed that the data is contained in lines with a constant no. of lines between them.

import urllib
u=urllib.urlopen("http://old.rajinifans.com/movie_database/movie_list_detail.php")
c=u.readlines()


First the content of the site is read. I found the following pattern of lines repeated in the site
color:black"><font size="1">Apoorva Raagangal</font></span></td>

So I took that value as string and replaced to empty string

opentag=' color:black"><font size="1">'
closetag='</font></span></td>'
closetag2='&nbsp;&nbsp;&nbsp; <br>'
for line1 in c:
 i=i+1
 a=481
 if(i==a):
  s=line1
  s=s.replace(opentag,'')
  s=s.replace(closetag,'')
  print "Name : "+s+"\n"


When a match for the line number is found, the unwanted part of it is replaced and the remaining part(the required data) is displayed as and then.

We can also also write regular expression for the repeating pattern and exclude those pattern.Also the unwanted pattern can be removed using either the split method of string or re.

By repeating the if block for some more line numbers the first row of data is retrieved. To retrieve the whole data the line number check can be neglected and using the pattern check the data can be displayed. 

Name : Apoorva Raagangal

Language            : Tamil

Producer : P.R.Govindarajan
                          Duraiswamy

Director : K.Balachander

Music By : M.S.Viswanathan

Lyrics By : Kannadasan

Playback Singers: Yesudas
                         Vani Jairam
                         Sasikala
                         Sheik Mohamed

Male artists : Kamalahasan
                       Rajnikant
                       Sundarrajan
                       Nagesh

Female Artists : Srividya
                      Jayasudha

Released On : 18.08.1975


For every site the pattern of data always differs. Depending on the found pattern we can go in for various ways to scrap.