Hi Sean:

Thanks for your feedback. Scala is much faster. The count is performed in
~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap
seems to be more than that. Is that also your conclusion?

Thanks.


Best,

Guillaume Guy

* +1 919 - 972 - 8750*

On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com> wrote:

> That's very slow, and there are a lot of possible explanations. The
> first one that comes to mind is: I assume your YARN and HDFS are on
> the same machines, but are you running executors on all HDFS nodes
> when you run this? if not, a lot of these reads could be remote.
>
> You have 6 executor slots, but your data exists in 96 blocks on HDFS.
> You could read with up to 96-way parallelism. You say you're CPU-bound
> though, but normally I'd wonder if this was simply a case of
> under-using parallelism.
>
> I also wonder if the bottleneck is something to do with pyspark in
> this case; might be good to just try it in the spark-shell to check.
>
> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy
> <guillaume.c....@gmail.com> wrote:
> > Dear Spark users:
> >
> > I want to see if anyone has an idea of the performance for a small
> cluster.
> >
> > Reading from HDFS, what should be the performance of  a count()
> operation on
> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage,
> all 6
> > are at 100%.
> >
> > Details:
> >
> > master yarn-client
> > num-executors 3
> > executor-cores 2
> > driver-memory 5g
> > executor-memory 2g
> > Distribution: Cloudera
> >
> > I also attached the screenshot.
> >
> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a
> decent
> > performance with similar configuration?
> >
> > If it's way off, I would appreciate any pointers as to ways to improve
> > performance.
> >
> > Thanks.
> >
> > Best,
> >
> > Guillaume
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to