That's very slow, and there are a lot of possible explanations. The first one that comes to mind is: I assume your YARN and HDFS are on the same machines, but are you running executors on all HDFS nodes when you run this? if not, a lot of these reads could be remote.
You have 6 executor slots, but your data exists in 96 blocks on HDFS. You could read with up to 96-way parallelism. You say you're CPU-bound though, but normally I'd wonder if this was simply a case of under-using parallelism. I also wonder if the bottleneck is something to do with pyspark in this case; might be good to just try it in the spark-shell to check. On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy <guillaume.c....@gmail.com> wrote: > Dear Spark users: > > I want to see if anyone has an idea of the performance for a small cluster. > > Reading from HDFS, what should be the performance of a count() operation on > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, all 6 > are at 100%. > > Details: > > master yarn-client > num-executors 3 > executor-cores 2 > driver-memory 5g > executor-memory 2g > Distribution: Cloudera > > I also attached the screenshot. > > Right now, I'm at 17 minutes which seems quite slow. Any idea how a decent > performance with similar configuration? > > If it's way off, I would appreciate any pointers as to ways to improve > performance. > > Thanks. > > Best, > > Guillaume > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org