Dear Spark users: I want to see if anyone has an idea of the performance for a small cluster.
Reading from HDFS, what should be the performance of a count() operation on an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, all 6 are at 100%. Details: - master yarn-client - num-executors 3 - executor-cores 2 - driver-memory 5g - executor-memory 2g - Distribution: Cloudera I also attached the screenshot. Right now, I'm at 17 minutes which seems quite slow. Any idea how a decent performance with similar configuration? If it's way off, I would appreciate any pointers as to ways to improve performance. Thanks. Best, Guillaume
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org