On Sat, 10 Nov 2007 13:56:35 -0800, Michael Bacarella wrote: > The id2name.txt file is an index of primary keys to strings. They look > like this: > > 11293102971459182412:Descriptive unique name for this record\n > 950918240981208142:Another name for another record\n > > The file's properties are: > > # wc -l id2name.txt > > 8191180 id2name.txt > # du -h id2name.txt > 517M id2name.txt > > I'm loading the file into memory with code like this: > > id2name = {} > for line in iter(open('id2name.txt').readline,''): > id,name = line.strip().split(':') > id = long(id) > id2name[id] = name
That's an awfully complicated way to iterate over a file. Try this instead: id2name = {} for line in open('id2name.txt'): id,name = line.strip().split(':') id = long(id) id2name[id] = name On my system, it takes about a minute and a half to produce a dictionary with 8191180 entries. > This takes about 45 *minutes* > > If I comment out the last line in the loop body it takes only about 30 > _seconds_ to run. This would seem to implicate the line id2name[id] = > name as being excruciatingly slow. No, dictionary access is one of the most highly-optimized, fastest, most efficient parts of Python. What it indicates to me is that your system is running low on memory, and is struggling to find room for 517MB worth of data. > Is there a fast, functionally equivalent way of doing this? > > (Yes, I really do need this cached. No, an RDBMS or disk-based hash is > not fast enough.) You'll pardon me if I'm skeptical. Considering the convoluted, weird way you had to iterate over a file, I wonder what other less-than-efficient parts of your code you are struggling under. Nine times out of ten, if a program runs too slowly, it's because you're using the wrong algorithm. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list