I ran a 124M dataset on my laptop
with isEmpty it took 32 minutes
without isEmpty it took 18 minutes all but 1.5 minutes were in writing to
Elasticsearch, which is on the same laptop
So excluding the time writing to Elasticsearch, which was nearly the same in
both cases, the core Spark code took
ement of 1 partition. I can imagine pathological cases
> where that's slow, but, do you have any more info? how slow is slow
> and what is slow?
>
> On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <mailto:p...@occamsmachete.com>> wrote:
>> I’m getting *huge* execution tim
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote:
> The “Any” is required by the code it is being passed to, which is the
> Elasticsearch Spark index writing code. The values are actually RDD[(String,
> Map[String, String])]
(Is it frequently a big big map by any chance?)
> No shuffle that I kno
he driver. This means evaluating
> at least 1 element of 1 partition. I can imagine pathological cases
> where that's slow, but, do you have any more info? how slow is slow
> and what is slow?
>
> On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote:
>
> I’m getting *huge* e
gt; On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote:
>> I’m getting *huge* execution times on a moderate sized dataset during the
>> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
>> calculation. I’m using Spark 1.5.1 and from researching I would expec
s slow?
>
> On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote:
>> I’m getting *huge* execution times on a moderate sized dataset during the
>> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
>> calculation. I’m using Spark 1.5.1 and from researchin
tting *huge* execution times on a moderate sized dataset during the
> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
> calculation. I’m using Spark 1.5.1 and from researching I would expect this
> calculation to be linearly proportional to the number of partitions as a
ng *huge* execution times on a moderate sized dataset during the
> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
> calculation. I’m using Spark 1.5.1 and from researching I would expect this
> calculation to be linearly proportional to the number of partitions
I’m getting *huge* execution times on a moderate sized dataset during the
RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty
calculation. I’m using Spark 1.5.1 and from researching I would expect this
calculation to be linearly proportional to the number of partitions as a