What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which will make it very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will be fixed very soon.
Davies On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy <guillaume.c....@gmail.com> wrote: > Hi Sean: > > Thanks for your feedback. Scala is much faster. The count is performed in ~1 > minutes (vs 17min). I would expect scala to be 2-5X faster but this gap > seems to be more than that. Is that also your conclusion? > > Thanks. > > > Best, > > Guillaume Guy > +1 919 - 972 - 8750 > > On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com> wrote: >> >> That's very slow, and there are a lot of possible explanations. The >> first one that comes to mind is: I assume your YARN and HDFS are on >> the same machines, but are you running executors on all HDFS nodes >> when you run this? if not, a lot of these reads could be remote. >> >> You have 6 executor slots, but your data exists in 96 blocks on HDFS. >> You could read with up to 96-way parallelism. You say you're CPU-bound >> though, but normally I'd wonder if this was simply a case of >> under-using parallelism. >> >> I also wonder if the bottleneck is something to do with pyspark in >> this case; might be good to just try it in the spark-shell to check. >> >> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy >> <guillaume.c....@gmail.com> wrote: >> > Dear Spark users: >> > >> > I want to see if anyone has an idea of the performance for a small >> > cluster. >> > >> > Reading from HDFS, what should be the performance of a count() >> > operation on >> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage, >> > all 6 >> > are at 100%. >> > >> > Details: >> > >> > master yarn-client >> > num-executors 3 >> > executor-cores 2 >> > driver-memory 5g >> > executor-memory 2g >> > Distribution: Cloudera >> > >> > I also attached the screenshot. >> > >> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a >> > decent >> > performance with similar configuration? >> > >> > If it's way off, I would appreciate any pointers as to ways to improve >> > performance. >> > >> > Thanks. >> > >> > Best, >> > >> > Guillaume >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org