Hi Paul, I agree that jumping straight from reading N rows from 1 partition to N rows from ALL partitions is pretty aggressive. The exponential growth strategy of doubling the partition count every time seems better -- 1, 2, 4, 8, 16, ... will be much more likely to prevent OOMs than the 1 -> ALL strategy.
Andrew On Fri, Aug 22, 2014 at 9:50 AM, pnepywoda <pnepyw...@palantir.com> wrote: > On line 777 > > https://github.com/apache/spark/commit/42571d30d0d518e69eecf468075e4c5a823a2ae8#diff-1d55e54678eff2076263f2fe36150c17R771 > the logic for take() reads ALL partitions if the first one (or first k) are > empty. This has actually lead to OOMs when we had many partitions > (thousands) and unfortunately the first one was empty. > > Wouldn't a better implementation strategy be > > numPartsToTry = partsScanned * 2 > > instead of > > numPartsToTry = totalParts - 1 > > (this doubling is similar to most memory allocation strategies) > > Thanks! > - Paul > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/take-reads-every-partition-if-the-first-one-is-empty-tp7956.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >