Hi Sean: Thanks for your feedback. Scala is much faster. The count is performed in ~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap seems to be more than that. Is that also your conclusion?
Thanks. Best, Guillaume Guy * +1 919 - 972 - 8750* On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com> wrote: > That's very slow, and there are a lot of possible explanations. The > first one that comes to mind is: I assume your YARN and HDFS are on > the same machines, but are you running executors on all HDFS nodes > when you run this? if not, a lot of these reads could be remote. > > You have 6 executor slots, but your data exists in 96 blocks on HDFS. > You could read with up to 96-way parallelism. You say you're CPU-bound > though, but normally I'd wonder if this was simply a case of > under-using parallelism. > > I also wonder if the bottleneck is something to do with pyspark in > this case; might be good to just try it in the spark-shell to check. > > On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy > <guillaume.c....@gmail.com> wrote: > > Dear Spark users: > > > > I want to see if anyone has an idea of the performance for a small > cluster. > > > > Reading from HDFS, what should be the performance of a count() > operation on > > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, > all 6 > > are at 100%. > > > > Details: > > > > master yarn-client > > num-executors 3 > > executor-cores 2 > > driver-memory 5g > > executor-memory 2g > > Distribution: Cloudera > > > > I also attached the screenshot. > > > > Right now, I'm at 17 minutes which seems quite slow. Any idea how a > decent > > performance with similar configuration? > > > > If it's way off, I would appreciate any pointers as to ways to improve > > performance. > > > > Thanks. > > > > Best, > > > > Guillaume > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org >