On 23 April 2013 17:57, Blind Anagram <blindanag...@nowhere.org> wrote: > On 23/04/2013 15:49, Steven D'Aprano wrote: >> On Tue, 23 Apr 2013 08:05:53 +0100, Blind Anagram wrote: >> >>> I did a lot of work comparing the overall performance of the sieve when >>> using either lists or arrays and I found that lists were a lot faster >>> for the majority use case when the sieve is not large. >> >> And when the sieve is large? > > I don't know but since the majority use case is when the sieve is small, > it makes sense to choose a list.
That's an odd comment given what you said at the start of this thread: Blind Anagram wrote: > I would be grateful for any advice people can offer on the fastest way > to count items in a sub-sequence of a large list. > > I have a list of boolean values that can contain many hundreds of > millions of elements for which I want to count the number of True values > in a sub-sequence, one from the start up to some value (say hi). >> I don't actually believe that the bottleneck is the cost of taking a list >> slice. That's pretty fast, even for huge lists, and all efforts to skip >> making a copy by using itertools.islice actually ended up slower. But >> suppose it is the bottleneck. Then *sooner or later* arrays will win over >> lists, simply because they're smaller. > > Maybe you have not noticed that, when I am discussing a huge sieve, I am > simply pushing a sieve designed primarily for a small sieve lengths to > the absolute limit. This is most definitely a minority use case. > > In pushing the size of the sieve upwards, it is the slice operation that > is the first thing that 'breaks'. This is because the slice can be > almost as big as the primary array so the OS gets driven into memory > allocation problems for a sieve that is about half the length it would > otherwise be. It still works but the cost of the slice once this point > is reached rises from about 20% to over 600% because of all the paging > going on. You keep mentioning that you want it to work with a large sieve. I would much rather compute the same quantities with a small sieve if possible. If you were using the Lehmer/Meissel algorithm you would be able to compute the same quantity (i.e. pi(1e9)) using a much smaller sieve with 30k items instead of 1e9. that would fit *very* comfortably in memory and you wouldn't even need to slice the list. Or to put it another way, you could compute pi(~1e18) using your current sieve without slicing or paging. If you want to lift the limit on computing pi(x) this is clearly the way to go. > > The unavailable list.count(value, limit) function would hence allow the > sieve length to be up to twice as large before running into problems and > would also cut the 20% slice cost I am seeing on smaller sieve lengths. > > So, all I was doing in asking for advice was to check whether there is > an easy way of avoiding the slice copy, not because this is critical, > but rather because it is a pity to limit the performance because Python > forces a (strictly unnecessary) copy in order to perform a count within > a part of a list. > > In other words, the lack of a list.count(value, limit) function makes > Python less effective than it would otherwise be. I haven't looked at > Python's C code base but I still wonder if there a good reason for NOT > providing this? If you feel that this is a good suggestion for an improvement to Python consider posting it on python-ideas. I wasn't aware of the equivalent functionality on strings but I see that the tuple.count() function is the same as list.count(). Oscar -- http://mail.python.org/mailman/listinfo/python-list