Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I ran a 124M dataset on my laptop with isEmpty it took 32 minutes without isEmpty it took 18 minutes all but 1.5 minutes were in writing to Elasticsearch, which is on the same laptop So excluding the time writing to Elasticsearch, which was nearly the same in both cases, the core Spark code took

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
ement of 1 partition. I can imagine pathological cases > where that's slow, but, do you have any more info? how slow is slow > and what is slow? > > On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel <mailto:p...@occamsmachete.com>> wrote: >> I’m getting *huge* execution tim

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
On Wed, Dec 9, 2015 at 7:49 PM, Pat Ferrel wrote: > The “Any” is required by the code it is being passed to, which is the > Elasticsearch Spark index writing code. The values are actually RDD[(String, > Map[String, String])] (Is it frequently a big big map by any chance?) > No shuffle that I kno

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
he driver. This means evaluating > at least 1 element of 1 partition. I can imagine pathological cases > where that's slow, but, do you have any more info? how slow is slow > and what is slow? > > On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote: > > I’m getting *huge* e

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
gt; On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote: >> I’m getting *huge* execution times on a moderate sized dataset during the >> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty >> calculation. I’m using Spark 1.5.1 and from researching I would expec

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
s slow? > > On Wed, Dec 9, 2015 at 4:41 PM, Pat Ferrel wrote: >> I’m getting *huge* execution times on a moderate sized dataset during the >> RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty >> calculation. I’m using Spark 1.5.1 and from researchin

Re: RDD.isEmpty

2015-12-09 Thread Pat Ferrel
tting *huge* execution times on a moderate sized dataset during the > RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty > calculation. I’m using Spark 1.5.1 and from researching I would expect this > calculation to be linearly proportional to the number of partitions as a

Re: RDD.isEmpty

2015-12-09 Thread Sean Owen
ng *huge* execution times on a moderate sized dataset during the > RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty > calculation. I’m using Spark 1.5.1 and from researching I would expect this > calculation to be linearly proportional to the number of partitions

RDD.isEmpty

2015-12-09 Thread Pat Ferrel
I’m getting *huge* execution times on a moderate sized dataset during the RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty calculation. I’m using Spark 1.5.1 and from researching I would expect this calculation to be linearly proportional to the number of partitions as a