perfr...@gmail.com a écrit :
hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
    __slots__ = ("a", "b", "c", "d")
    def __init__(self, a, b, c, d):
        self.a = a
        self.b = b
        self.c = c
        self.d = d
    def __str__(self):
        return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
    def __hash__(self):
        return hash((self.a, self.b, self.c, self.d))
    def __eq__(self, other):
        return (self.a == other.a and \
                self.b == other.b and \
                self.c == other.c and \
                self.d == other.d)
    __repr__ = __str__


If your class really looks like that, a tuple would be enough.

n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):

hint : use xrange instead.

    myobj = myclass('a' + str(k), 'b', 'c', 'd')
    table[myobj] = 1

hint : if all you want is to ensure unicity, use a set instead.

t2 = time.time()
print "time: ", float((t2-t1)/60.0)

hint : use timeit instead.

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

iterating over the file, while indeed a bit slower on a per-line basis, avoid useless memory comsuption which can lead to disk swapping - so for "huge" files, it might still be better wrt/ overall performances.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...

Did you bench the creation of a 15.000.000 ints list ?-)

But anyway, creating 15.000.000 instances (which is not a small number) of your class takes many seconds - 23.466073989868164 seconds on my (already heavily loaded) machine. Building the same number of tuples only takes about 2.5 seconds - that is, almost 10 times less. FWIW, tuples have all the useful characteristics of your above class (wrt/ hashing and comparison).

My 2 cents...
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to