> >It never occured to me to use the Python dict/set approach. Now I >wonder if it would've worked better somehow. Of course my file was >26,000 X larger than the one in this problem, and definitely would >not fit in memory. I suspect that there were as many as a million >duplicates for some messages in that file. Would the generator >version above have helped me out, I wonder? > > > >
You could use a dbm file approach which would provide a external dict/set interface through Python bindings. This would use less memory. 1. Add records to dbm as keys 2. dbm (if configured correctly) will only keep unique keys 3. Count keys Cheers, Ben -- http://mail.python.org/mailman/listinfo/python-list