On Jan 15, 1:39 pm, Per Freem <perfr...@yahoo.com> wrote: > hello > > i have an optimization questions about python. i am iterating through > a file and counting the number of repeated elements. the file has on > the order > of tens of millions elements... > > i create a dictionary that maps elements of the file that i want to > count > to their number of occurs. so i iterate through the file and for each > line > extract the elements (simple text operation) and see if it has an > entry in the dict: > > for line in file: > try: > elt = MyClass(line)# extract elt from line... > my_dict[elt] += 1 > except KeyError: > my_dict[elt] = 1 > > i am using try/except since it is supposedly faster (though i am not > sure > about this? is this really true in Python 2.5?). > > the only 'twist' is that my elt is an instance of a class (MyClass) > with 3 fields, all numeric. the class is hashable, and so my_dict[elt] > works well. > the __repr__ and __hash__ methods of my class simply return str() > representation > of self, while __str__ just makes everything numeric field into a > concatenated string: > > class MyClass > > def __str__(self): > return "%s-%s-%s" %(self.field1, self.field2, self.field3) > > def __repr__(self): > return str(self) > > def __hash__(self): > return hash(str(self)) > > is there anything that can be done to speed up this simply code? right > now it is taking well over 15 minutes to process, on a 3 Ghz machine > with lots of RAM (though this is all taking CPU power, not RAM at this > point.) > > any general advice on how to optimize large dicts would be great too > > thanks for your help.
You can use a tuple instead of a string, which should be a little quicker: def __hash__(self): return self.field1, self.field2, self.field3 You could speed it up even more if you just saved a single attribute "fields" as a tuple to begin with. Also, you can use defauldict and get rid of the try/except. I don't think try/except is slow, but avoiding it will give you a speed up. from collections import defaultdict my_dict = defaultdict(int) for line in file: elt = MyClass(line)# extract elt from line... my_dict[elt] += 1 You might even consider turning "MyClass" into just a function that extracts the values from the line and returns a tuple, which should give you even more of a boost since a tuple is completely implemented in C. Matt -- http://mail.python.org/mailman/listinfo/python-list