On Wednesday, 28 August 2013 01:49:59 UTC+5:30, MRAB wrote: > On 27/08/2013 20:41, mukesh tiwari wrote: > > > Hello All, > > > I am doing web stuff first time in python so I am looking for suggestions. > > I wrote this code to download the title of webpages using as much less > > resource ( server time, data download) as possible and should be fast > > enough. Initially I used BeautifulSoup for parsing but the person who is > > going to use this code asked me not to use this and use regular expressions > > ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I > > was downloading the the whole page but finally I restricted to only 30000 > > characters to get the title of almost all the pages. Write now I can see > > only two shortcomings of this code, one when I kill the code by SIGINT ( > > ctrl-c ) then it dies instantly. I can modify this code to process all the > > elements in queue and let it die. The second is one IO call per iteration > > in download url function ( May be I can use async IO call but I am not sure > > ). I don't have much web programming experience so I am looking for > > suggestion to make it more robust. top-1m. c > > sv > > > is file downloaded from alexa[1]. Also some suggestions to write more > > idiomatic python code. > > > > > > -Mukesh Tiwari > > > > > > [1]http://www.alexa.com/topsites. > > > > > > > > > import urllib2, os, socket, Queue, thread, signal, sys, re > > > > > > > > > class Downloader(): > > > > > > def __init__( self ): > > > self.q = Queue.Queue( 200 ) > > > self.count = 0 > > > > > > > > > > > > def downloadurl( self ) : > > > #open a file in append mode and write the result ( Improvement > > think of writing in chunks ) > > > with open('titleoutput.dat', 'a+' ) as file : > > > while True : > > > try : > > > url = self.q.get( ) > > > data = urllib2.urlopen ( url , data = > > None , timeout = 10 ).read( 30000 ) > > > regex = > > re.compile('<title.*>(.*?)</title>' , re.IGNORECASE) > > > #Read data line by line and as soon you > > find the title go out of loop. > > > #title = None > > > #for r in data: > > > # if not r : > > > # raise StopIteration > > > # else: > > > # title = regex.search( r > > ) > > > # if title is not None: > > break > > > > > > title = regex.search( data ) > > > result = ', '.join ( [ url , > > title.group(1) ] ) > > > #data.close() > > > file.write(''.join( [ result , '\n' ] ) > > ) > > > except urllib2.HTTPError as e: > > > print ''.join ( [ url, ' ', str ( e ) ] > > ) > > > except urllib2.URLError as e: > > > print ''.join ( [ url, ' ', str ( e ) > > ] ) > > > except Exception as e : > > > print ''.join ( [ url, ' ', str( e ) > > ] ) > > > #With block python calls file.close() automatically. > > > > > > > > > > > def createurl ( self ) : > > > > > > #check if file exist. If not then create one with default value > > of 0 bytes read. > > > if os.path.exists('bytesread.dat'): > > > f = open ( 'bytesread.dat','r') > > > self.count = int ( f.readline() ) > > > > > > else: > > > f=open('bytesread.dat','w') > > > f.write('0\n') > > > f.close() > > > > > > #Reading data in chunks is fast but we can miss some sites due > > to reading the data in chunks( It's worth missing because reading is very > > fast) > > > with open('top-1m.csv', 'r') as file: > > > prefix = '' > > > file.seek( self.count * 1024 ) > > > #you will land into the middle of bytes so discard upto > > newline > > > if ( self.count ): file.readline() > > > for lines in iter ( lambda : file.read( 1024 ) , ''): > > > l = lines.split('\n') > > > n = len ( l ) > > > l[0] = ''.join( [ prefix , l[0] ] ) > > > for i in xrange ( n - 1 ) : self.q.put ( > > ''.join ( [ 'http://www.', l[i].split(',')[1] ] ) ) > > > prefix = l[n-1] > > > self.count += 1 > > > > > > > > > #do graceful exit from here. > > > def handleexception ( self , signal , frame) : > > > with open('bytesread.dat', 'w') as file: > > > print ''.join ( [ 'Number of bytes read ( probably > > unfinished ) ' , str ( self.count ) ] ) > > > file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) ) > > > file.close() > > > sys.exit(0) > > > > > > if __name__== '__main__': > > > u = Downloader() > > > signal.signal( signal.SIGINT , u.handleexception) > > > thread.start_new_thread ( u.createurl , () ) > > > for i in xrange ( 5 ) : > > > thread.start_new_thread ( u.downloadurl , () ) > > > while True : pass > > > > > > > > My preferred method when working with background threads is to put a > > sentinel such as None at the end and then when a worker gets an item > > from the queue and sees that it's the sentinel, it puts it back in the > > queue for the other workers to see, and then returns (terminates). The > > main thread can then call each worker thread's .join method to wait for > > it to finish. You currently have the main thread running in a 'busy > > loop', consuming processing time doing nothing!
Hi MRAB, Thank you for the reply. I wrote this while loop only because of there is no thread.join in thread[1] library but I got your point. I am simply running a while loop for doing nothing. So if somehow I can block the main without too much computation then it will great. -Mukesh Tiwari [1] http://docs.python.org/2/library/thread.html#module-thread -- http://mail.python.org/mailman/listinfo/python-list