Re: Improving the web page download code.

mukesh tiwari Tue, 27 Aug 2013 13:57:07 -0700

On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB  wrote:
> On 27/08/2013 20:41, mukesh tiwari wrote:
> 
> > Hello All,
> 
> > I am doing web stuff first time in python so I am looking for suggestions. 
> > I wrote this code to download the title of webpages using as much less 
> > resource ( server time, data download)  as possible and should be fast 
> > enough. Initially I used BeautifulSoup for parsing but the person who is 
> > going to use this code asked me not to use this and use regular expressions 
> > ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I 
> > was downloading the the whole page but finally I restricted to only 30000 
> > characters to get the title of almost all the pages. Write now I can see 
> > only two shortcomings of this code, one when I kill the code by SIGINT ( 
> > ctrl-c ) then it dies instantly. I can modify this code to process all the 
> > elements in queue and let it die. The second is one IO call per iteration 
> > in download url function ( May be I can use async IO call but I am not sure 
> > ). I don't have much web programming experience so I am looking for 
> > suggestion to make it more robust. top-1m.
 c
> 
> sv
> 
> >   is file downloaded from alexa[1]. Also some suggestions to write more 
> > idiomatic python code.
> 
> >
> 
> > -Mukesh Tiwari
> 
> >
> 
> > [1]http://www.alexa.com/topsites.
> 
> >
> 
> >
> 
> > import urllib2, os, socket, Queue, thread, signal, sys, re
> 
> >
> 
> >
> 
> > class Downloader():
> 
> >
> 
> >     def __init__( self ):
> 
> >             self.q = Queue.Queue( 200 )
> 
> >             self.count = 0
> 
> >     
> 
> >
> 
> >
> 
> >     def downloadurl( self ) :
> 
> >             #open a file in append mode and write the result ( Improvement 
> > think of writing in chunks )
> 
> >             with open('titleoutput.dat', 'a+' ) as file :   
> 
> >                     while True :
> 
> >                             try :
> 
> >                                     url = self.q.get( )
> 
> >                                     data = urllib2.urlopen ( url , data = 
> > None , timeout = 10 ).read( 30000 )
> 
> >                                     regex = 
> > re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
> 
> >                                     #Read data line by line and as soon you 
> > find the title go out of loop.
> 
> >                                     #title = None
> 
> >                                     #for r in data:
> 
> >                                     #       if not r :
> 
> >                                     #               raise StopIteration
> 
> >                                     #       else:
> 
> >                                     #               title = regex.search( r 
> > )
> 
> >                                     #               if title is not None: 
> > break
> 
> >
> 
> >                                     title = regex.search( data )
> 
> >                                     result =  ', '.join ( [ url , 
> > title.group(1) ] )
> 
> >                                     #data.close()
> 
> >                                     file.write(''.join( [ result , '\n' ] ) 
> > )
> 
> >                             except urllib2.HTTPError as e:
> 
> >                                    print ''.join ( [ url, '  ', str ( e ) ] 
> > )
> 
> >                             except urllib2.URLError as e:
> 
> >                                     print ''.join ( [ url, '  ', str ( e ) 
> > ] )
> 
> >                             except Exception as e :
> 
> >                                     print ''.join ( [ url, '  ', str( e )  
> > ] )
> 
> >                     #With block python calls file.close() automatically.    
> >         
> 
> >                             
> 
> >
> 
> >     def createurl ( self ) :
> 
> >
> 
> >             #check if file exist. If not then create one with default value 
> > of 0 bytes read.
> 
> >             if os.path.exists('bytesread.dat'):
> 
> >                     f = open ( 'bytesread.dat','r')
> 
> >                     self.count = int ( f.readline() )
> 
> >                                     
> 
> >             else:
> 
> >                     f=open('bytesread.dat','w')
> 
> >                     f.write('0\n')
> 
> >                     f.close()
> 
> >
> 
> >             #Reading data in chunks is fast but we can miss some sites due 
> > to reading the data in chunks( It's worth missing because reading is very 
> > fast)
> 
> >             with open('top-1m.csv', 'r') as file:
> 
> >                     prefix = ''
> 
> >                     file.seek(  self.count * 1024 )
> 
> >                     #you will land into the middle of bytes so discard upto 
> > newline
> 
> >                     if ( self.count ): file.readline()      
> 
> >                     for lines in iter ( lambda : file.read( 1024 ) , ''):
> 
> >                             l = lines.split('\n')
> 
> >                             n = len ( l )
> 
> >                             l[0] = ''.join( [ prefix , l[0] ] )
> 
> >                             for i in xrange ( n - 1 ) : self.q.put ( 
> > ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) )
> 
> >                             prefix = l[n-1]
> 
> >                             self.count += 1
> 
> >
> 
> >                     
> 
> >     #do graceful exit from here.
> 
> >     def handleexception ( self , signal , frame) :
> 
> >             with open('bytesread.dat', 'w') as file:
> 
> >                     print ''.join ( [ 'Number of bytes read ( probably 
> > unfinished ) ' , str ( self.count ) ] )
> 
> >                     file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
> 
> >                     file.close()                    
> 
> >                     sys.exit(0)
> 
> >
> 
> > if __name__== '__main__':
> 
> >     u = Downloader()
> 
> >     signal.signal( signal.SIGINT , u.handleexception)
> 
> >     thread.start_new_thread ( u.createurl , () )
> 
> >     for i in xrange ( 5 ) :
> 
> >             thread.start_new_thread ( u.downloadurl , () )
> 
> >     while True : pass
> 
> >                     
> 
> >
> 
> My preferred method when working with background threads is to put a 
> 
> sentinel such as None at the end and then when a worker gets an item 
> 
> from the queue and sees that it's the sentinel, it puts it back in the 
> 
> queue for the other workers to see, and then returns (terminates). The 
> 
> main thread can then call each worker thread's .join method to wait for 
> 
> it to finish. You currently have the main thread running in a 'busy 
> 
> loop', consuming processing time doing nothing!


Hi MRAB,
Thank you for the reply. I wrote this while loop only because of there is no 
thread.join in thread[1] library but I got your point. I am simply running a 
while loop for doing nothing. So if somehow I can block the main without too 
much computation then it will great. 

-Mukesh Tiwari

[1] http://docs.python.org/2/library/thread.html#module-thread
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Improving the web page download code.

Reply via email to