Re: take() reads every partition if the first one is empty

Andrew Ash Fri, 22 Aug 2014 12:37:59 -0700

Hi Paul,

I agree that jumping straight from reading N rows from 1 partition to N
rows from ALL partitions is pretty aggressive.  The exponential growth
strategy of doubling the partition count every time seems better -- 1, 2,
4, 8, 16, ... will be much more likely to prevent OOMs than the 1 -> ALL
strategy.


Andrew


On Fri, Aug 22, 2014 at 9:50 AM, pnepywoda <pnepyw...@palantir.com> wrote:

> On line 777
>
> https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771
> the logic for take() reads ALL partitions if the first one (or first k) are
> empty. This has actually lead to OOMs when we had many partitions
> (thousands) and unfortunately the first one was empty.
>
> Wouldn't a better implementation strategy be
>
> numPartsToTry = partsScanned * 2
>
> instead of
>
> numPartsToTry = totalParts - 1
>
> (this doubling is similar to most memory allocation strategies)
>
> Thanks!
> - Paul
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: take() reads every partition if the first one is empty

Reply via email to