That's very slow, and there are a lot of possible explanations. The
first one that comes to mind is: I assume your YARN and HDFS are on
the same machines, but are you running executors on all HDFS nodes
when you run this? if not, a lot of these reads could be remote.

You have 6 executor slots, but your data exists in 96 blocks on HDFS.
You could read with up to 96-way parallelism. You say you're CPU-bound
though, but normally I'd wonder if this was simply a case of
under-using parallelism.

I also wonder if the bottleneck is something to do with pyspark in
this case; might be good to just try it in the spark-shell to check.

On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy
<guillaume.c....@gmail.com> wrote:
> Dear Spark users:
>
> I want to see if anyone has an idea of the performance for a small cluster.
>
> Reading from HDFS, what should be the performance of  a count() operation on
> an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, all 6
> are at 100%.
>
> Details:
>
> master yarn-client
> num-executors 3
> executor-cores 2
> driver-memory 5g
> executor-memory 2g
> Distribution: Cloudera
>
> I also attached the screenshot.
>
> Right now, I'm at 17 minutes which seems quite slow. Any idea how a decent
> performance with similar configuration?
>
> If it's way off, I would appreciate any pointers as to ways to improve
> performance.
>
> Thanks.
>
> Best,
>
> Guillaume
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to