On 23/04/2013 12:08, Oscar Benjamin wrote: > On 23 April 2013 08:05, Blind Anagram <blindanag...@nowhere.org> wrote: >> On 23/04/2013 00:01, Steven D'Aprano wrote: >>> On Mon, 22 Apr 2013 15:15:19 +0100, Blind Anagram wrote: >>> >>>> But when using a sub-sequence, I do suffer a significant reduction in >>>> speed for a count when compared with count on the full list. When the >>>> list is small enough not to cause memory allocation issues this is about >>>> 30% on 100,000,000 items. But when the list is 1,000,000,000 items, OS >>>> memory allocation becomes an issue and the cost on my system rises to >>>> over 600%. > [snip] >>> >>> Another solution might be to use arrays rather than lists. Since your >>> sieve list is homogeneous, you could possibly use an array of 1 or 0 >>> bytes rather than a list of True or False bools. That would reduce the >>> memory overhead by a factor of four, and similarly reduce the overhead of >>> any copying: >> >> I did a lot of work comparing the overall performance of the sieve when >> using either lists or arrays and I found that lists were a lot faster >> for the majority use case when the sieve is not large. > > Okay, now I understand. I thought you were trying to optimise for > large lists, but in fact you have already optimised for small lists. > As a result you have made algorithmic choices that don't scale very > well. Since you're still concerned about performance on small lists > you don't want to rethink those choices. Instead you want a > micro-optimisation that would compensate for them. > > Elsewhere you said: > >> I would not dream of doing this job by copying a slice in any other >> language that I have used so I was simply asking for advice to discover >> whether this copy could be avoided whilst staying with the simple sieve >> design. > > So you already knew that there would be problems with this method, but > you've chosen it anyway since it turned out to be fastest for small > lists. You could always just do a different thing when the list is > large:
Your analysis of my rationale is sound except that I only found out that I had a problem with counting in a subset of a list when I actually tried this for a huge sieve. It was only then that I discovered that there was no way of setting the upper limit in list.count(x) and hence that I would have to create a slice or find an alternative approach. I then wondered why count for lists has no limits whereas count for other objects (e.g. strings) has these. I also wondered whether there was an easy way of avoiding the slice, not that this is critical, but rather because it is just nice to have a sieve that still actually works for huge values, albeit inefficiently. > def pi(self, n): > if n < 1000000: > return self.indicator[:n].sum() > else: > return sum(itertools.islice(self.indicator, n)) > I have looked at itertools.islice as a complete replacement for count() where, on average, it was a lot slower. But I have not tried hybrid strategies as you suggest - I'll take a look at this. My thanks to you and others for the advice that has been offered. Brian -- http://mail.python.org/mailman/listinfo/python-list