On Jan 28, 10:06 am, Bruno Desthuilliers <bruno. 42.desthuilli...@websiteburo.invalid> wrote: > perfr...@gmail.com a écrit : > > > > > hi, > > > i am doing a series of very simple string operations on lines i am > > reading from a large file (~15 million lines). i store the result of > > these operations in a simple instance of a class, and then put it > > inside of a hash table. i found that this is unusually slow... for > > example: > > > class myclass(object): > > __slots__ = ("a", "b", "c", "d") > > def __init__(self, a, b, c, d): > > self.a = a > > self.b = b > > self.c = c > > self.d = d > > def __str__(self): > > return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d) > > def __hash__(self): > > return hash((self.a, self.b, self.c, self.d)) > > def __eq__(self, other): > > return (self.a == other.a and \ > > self.b == other.b and \ > > self.c == other.c and \ > > self.d == other.d) > > __repr__ = __str__ > > If your class really looks like that, a tuple would be enough. > > > n = 15000000 > > table = defaultdict(int) > > t1 = time.time() > > for k in range(1, n): > > hint : use xrange instead. > > > myobj = myclass('a' + str(k), 'b', 'c', 'd') > > table[myobj] = 1 > > hint : if all you want is to ensure unicity, use a set instead. > > > t2 = time.time() > > print "time: ", float((t2-t1)/60.0) > > hint : use timeit instead. > > > this takes a very long time to run: 11 minutes!. for the sake of the > > example i am not reading anything from file here but in my real code i > > do. also, i do 'a' + str(k) but in my real code this is some simple > > string operation on the line i read from the file. however, i found > > that the above code shows the real bottle neck, since reading my file > > into memory (using readlines()) takes only about 4 seconds. i then > > have to iterate over these lines, but i still think that is more > > efficient than the 'for line in file' approach which is even slower. > > iterating over the file, while indeed a bit slower on a per-line basis, > avoid useless memory comsuption which can lead to disk swapping - so for > "huge" files, it might still be better wrt/ overall performances. > > > in the above code is there a way to optimize the creation of the class > > instances ? i am using defaultdicts instead of ordinary ones so i dont > > know how else to optimize that part of the code. is there a way to > > perhaps optimize the way the class is written? if takes only 3 seconds > > to read in 15 million lines into memory it doesnt make sense to me > > that making them into simple objects while at it would take that much > > more... > > Did you bench the creation of a 15.000.000 ints list ?-) > > But anyway, creating 15.000.000 instances (which is not a small number) > of your class takes many seconds - 23.466073989868164 seconds on my > (already heavily loaded) machine. Building the same number of tuples > only takes about 2.5 seconds - that is, almost 10 times less. FWIW, > tuples have all the useful characteristics of your above class (wrt/ > hashing and comparison). > > My 2 cents...
thanks for your insight ful reply - changing to tuples made a big change! -- http://mail.python.org/mailman/listinfo/python-list