Dear Spark users:

I want to see if anyone has an idea of the performance for a small cluster.

Reading from HDFS, what should be the performance of  a count() operation
on an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage,
all 6 are at 100%.

Details:

   - master yarn-client
   - num-executors 3
   - executor-cores 2
   - driver-memory 5g
   - executor-memory 2g
   - Distribution: Cloudera

I also attached the screenshot.

Right now, I'm at 17 minutes which seems quite slow. Any idea how a decent
performance with similar configuration?

If it's way off, I would appreciate any pointers as to ways to improve
performance.

Thanks.

Best,

Guillaume
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to