tplib.py from Python 2.3 and also dropped in the one from
Python 2.5 with no difference. Running on Linux kernel 2.6 (CentOS's,
specifically).
Any responses CC me as I'm not subscribed [since Python has worked so
flawlessly for me otherwise ]
--
Michael Bacarella <[EMAIL PROTECTE
For various reasons I need to cache about 8GB of data from disk into core on
application startup.
Building this cache takes nearly 2 hours on modern hardware. I am surprised
to discover that the bottleneck here is CPU.
The reason this is surprising is because I expect something like this to
> > For various reasons I need to cache about 8GB of data from disk into
core on
> > application startup.
>
> Are you sure? On PC hardware, at least, doing this doesn't make any
> guarantee that accessing it actually going to be any faster. Is just
> mmap()ing the file a problem for some reason?
>
> On the problem PCs, both of these methods give me the same information
> (i.e. only the processor name). However, if I go to "System
> Properties" and look at the "General" tab, it lists the CPU name and
> processor speed. Does anyone else know of another way to get at this
> information?
This i
> Note that you're not doing the same thing at all. You're
> pre-allocating the array in the C code, but not in Python (and I don't
> think you can). Is there some reason you're growing a 8 gig array 8
> bytes at a time?
>
> They spend about the same amount of time in system, but Python spends 4.7
> > This information is hardware dependent and probably unreliable.
> >
> > Why not run a benchmark and report the results instead?
> > Like bogomips? http://en.wikipedia.org/wiki/Bogomips>
>
> That's an interesting idea, but this is in a login script, so I can't
> exactly run benchmarks while lo
> > Very sure. If we hit the disk at all performance drops
> > unacceptably. The application has low locality of reference so
> > on-demand caching isn't an option. We get the behavior we want when
> > we pre-cache; the issue is simply that it takes so long to build
> > this cache.
>
> The way I
> > How do you feel about multithreading support?
> >
> > A multithreaded application in Python will only use a single CPU on
> > multi-CPU machines due to big interpreter lock, whereas the "right
> thing"
> > happens in Java.
>
> Note that this is untrue for many common uses of threading (e.g. u
> In our company we are looking for one language to be used as default
> language. So far Python looks like a good choice (slacking behind
> Java). A few requirements that the language should be able cope with
> are:
How do you feel about multithreading support?
A multithreaded application in Pyt
> > > > A multithreaded application in Python will only use a single CPU
> on
> > > > multi-CPU machines due to big interpreter lock, whereas the
> "right
> > > thing"
> > > > happens in Java.
> > >
> > > Note that this is untrue for many common uses of threading (e.g.
> using
> > > threads to wait
The id2name.txt file is an index of primary keys to strings. They look like
this:
11293102971459182412:Descriptive unique name for this record\n
950918240981208142:Another name for another record\n
The file's properties are:
# wc -l id2name.txt
8191180 id2name.txt
# du -h id2name.txt
517M
> That's an awfully complicated way to iterate over a file. Try this
> instead:
>
> id2name = {}
> for line in open('id2name.txt'):
>id,name = line.strip().split(':')
>id = long(id)
>id2name[id] = name
>
> > This takes about 45 *minutes*
> >
> On my system, it takes about a minute an
> - Original Message
> From: Paul Rubin <http://[EMAIL PROTECTED]>
> To: python-list@python.org
> Sent: Sunday, November 11, 2007 12:45:44 AM
> Subject: Re: Populating a dictionary, fast
>
> Michael Bacarella <[EMAIL PROTECTED]> writes:
> > If on
> Steven D'Aprano wrote:
> > (2) More memory will help avoid paging. If you can't get more memory,
try
> > more virtual memory. It will still be slow, but at least the
operating
> > system doesn't have to try moving blocks around as much.
>
> Based on his previous post, it would seem he has 7GB
Firstly, thank you for all of your help so far, I really appreciate it.
> > So, you think the Python's dict implementation degrades towards
O(N)
> > performance when it's fed millions of 64-bit pseudo-random longs?
>
> No.
Yes.
I tried your code (with one change, time on feedback lines) and go
> > This would seem to implicate the line id2name[id] = name as being
excruciatingly slow.
>
> As others have pointed out there is no way that this takes 45
> minutes.Must be something with your system or setup.
>
> A functionally equivalent code for me runs in about 49 seconds!
> (it ends up usi
> > I tried your code (with one change, time on feedback lines) and got
the
> > same terrible
> > performance against my data set.
> >
> > To prove that my machine is sane, I ran the same against your
generated
>> sample file and got _excellent_ performance. Start to finish in
under a minute.
> id2name[key >> 40][key & 0x100] = name
Oops, typo. It's actually:
Id2name[key >> 40][key & 0xff] = name
--
http://mail.python.org/mailman/listinfo/python-list
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
>
> On my system (windows vista) your code (using your data) runs in:
>
> 36 seconds with python 2.4
> 25 seconds
> > You can download the list of keys from here, it's 43M gzipped:
> > http://www.sendspace.com/file/9530i7
> >
> > and see it take about 45 minutes with this:
> >
> > $ cat cache-keys.py
> > #!/usr/bin/python
> > v = {}
> > for line in open('keys.txt'):
> > v[long(line.strip())] = True
See end for solution.
> >> (3) Are you sure you need all eight-million-plus items in the cache
> >> all at once?
> >
> > Yes.
>
> I remain skeptical, but what do I know, I don't even know what you're
> doing with the data once you have it :-)
It's OK, I'd be skeptical too. ;)
> $ cat /proc/cpui
> Shouldn't this be:
>
> id2name[key >> 40][key & 0xff] = name
Yes, exactly, I had done hex(pow(2,40)) when I meant hex(pow(2,40)-1)
I sent my correction a few minutes afterwards but Mailman
queued it for moderator approval (condition with replying to
myself?)
--
http://mail.pytho
> On Nov 15, 2:11 pm, Istvan Albert <[EMAIL PROTECTED]> wrote:
> > There is nothing wrong with neither creating nor deleting
> > dictionaries.
>
> I suspect what happened is this: on 64 bit
> machines the data structures for creating dictionaries
> are larger (because pointers take twice as much s
> On Thu, 15 Nov 2007 15:51:25 -0500, Michael Bacarella wrote:
>
> > Since some people missed the EUREKA!, here's the executive summary:
> >
> > Python2.3: about 45 minutes
> > Python2.4: about 45 minutes
> > Python2.5: about _30 seconds_
&g
> Do you really believe that you cannot create or delete a large
> dictionary with python versions less than 2.5 (on a 64 bit or multi-
> cpu system)? That a bug of this magnitude has not been noticed until
> someone posted on clp?
You're right, it is completely inappropriate for us to be showing
25 matches
Mail list logo