> That's an awfully complicated way to iterate over a file. Try this > instead:
> > id2name = {} > for line in open('id2name.txt'): > id,name = line.strip().split(':') > id = long(id) > id2name[id] = name > > > This takes about 45 *minutes* > > > On my system, it takes about a minute and a half to produce a dictionary > with 8191180 entries. Doing something similar on my system is very fast as well. $ cat dict-8191180.py #!/usr/bin/python v = {} for i in xrange(8191180): v[i] = i $ time ./dict-8191180.py real 0m5.877s user 0m4.953s sys 0m0.924s But... > > If I comment out the last line in the loop body it takes only about 30 > > _seconds_ to run. This would seem to implicate the line id2name[id] = > > name as being excruciatingly slow. > > No, dictionary access is one of the most highly-optimized, fastest, most > efficient parts of Python. What it indicates to me is that your system is > running low on memory, and is struggling to find room for 517MB worth of > data. If only it were so easy. $ free total used free shared buffers cached Mem: 7390244 2103448 5286796 0 38996 1982756 -/+ buffers/cache: 81696 7308548 Swap: 2096472 10280 2086192 Here's your Python implementation running as badly as mine did. $ wc -l id2name.txt 8191180 id2name.txt $ cat cache-id2name.py #!/usr/bin/python id2name = {} for line in open('id2name.txt'): id,name = line.strip().split(':',1) id = long(id) id2name[id] = name $ time ./cache-id2name.py ^C I let it go 30 minutes before killing it since I had to leave. Here it is in top before I did the deed. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18802 root 25 0 1301m 1.2g 1148 R 99.9 17.1 36:05.99 cache-id2name.p 36 minutes, 05.99 seconds. To rule out the file reading/parsing logic as culprit, here's same thing, but with dictionary insertion removed. $ cat nocache-id2name.py #!/usr/bin/python id2name = {} for line in open('id2name.txt'): id,name = line.strip().split(':',1) id = long(id) $ time ./nocache-id2name.py real 0m33.518s user 0m33.070s sys 0m0.415s Here's a Perl implementation running very fast. $ cat cache-id2name.pl #!/usr/bin/perl my %id2name = (); my $line; my $id; my $name; open(F,"<id2name.txt"); foreach $line (<F>) { chomp($line); ($id,$name) = split(/:/,$line,1); $id = int($id); $id2name{$id} = $name; } $ time ./cache-id2name.pl real 0m46.363s user 0m43.730s sys 0m2.611s So, you think the Python's dict implementation degrades towards O(N) performance when it's fed millions of 64-bit pseudo-random longs? -- http://mail.python.org/mailman/listinfo/python-list