No. It should not be that slow. In my Mac, it took 1.4 minutes to do `rdd.count()` on 4.3G text file ( 25M / s / CPU).
Could you turn on profile in pyspark to see what happened in Python process? spark.python.profile = true On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy <guillaume.c....@gmail.com> wrote: > It is a simple text file. > > I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it? > > > On Friday, February 27, 2015, Davies Liu <dav...@databricks.com> wrote: >> >> What is this dataset? text file or parquet file? >> >> There is an issue with serialization in Spark SQL, which will make it >> very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will >> be fixed very soon. >> >> Davies >> >> On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy >> <guillaume.c....@gmail.com> wrote: >> > Hi Sean: >> > >> > Thanks for your feedback. Scala is much faster. The count is performed >> > in ~1 >> > minutes (vs 17min). I would expect scala to be 2-5X faster but this gap >> > seems to be more than that. Is that also your conclusion? >> > >> > Thanks. >> > >> > >> > Best, >> > >> > Guillaume Guy >> > +1 919 - 972 - 8750 >> > >> > On Fri, Feb 27, 2015 at 9:12 AM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> That's very slow, and there are a lot of possible explanations. The >> >> first one that comes to mind is: I assume your YARN and HDFS are on >> >> the same machines, but are you running executors on all HDFS nodes >> >> when you run this? if not, a lot of these reads could be remote. >> >> >> >> You have 6 executor slots, but your data exists in 96 blocks on HDFS. >> >> You could read with up to 96-way parallelism. You say you're CPU-bound >> >> though, but normally I'd wonder if this was simply a case of >> >> under-using parallelism. >> >> >> >> I also wonder if the bottleneck is something to do with pyspark in >> >> this case; might be good to just try it in the spark-shell to check. >> >> >> >> On Fri, Feb 27, 2015 at 2:00 PM, Guillaume Guy >> >> <guillaume.c....@gmail.com> wrote: >> >> > Dear Spark users: >> >> > >> >> > I want to see if anyone has an idea of the performance for a small >> >> > cluster. >> >> > >> >> > Reading from HDFS, what should be the performance of a count() >> >> > operation on >> >> > an 10GB RDD with 100M rows using pyspark. I looked into the CPU >> >> > usage, >> >> > all 6 >> >> > are at 100%. >> >> > >> >> > Details: >> >> > >> >> > master yarn-client >> >> > num-executors 3 >> >> > executor-cores 2 >> >> > driver-memory 5g >> >> > executor-memory 2g >> >> > Distribution: Cloudera >> >> > >> >> > I also attached the screenshot. >> >> > >> >> > Right now, I'm at 17 minutes which seems quite slow. Any idea how a >> >> > decent >> >> > performance with similar configuration? >> >> > >> >> > If it's way off, I would appreciate any pointers as to ways to >> >> > improve >> >> > performance. >> >> > >> >> > Thanks. >> >> > >> >> > Best, >> >> > >> >> > Guillaume >> >> > >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> > > > > > -- > > Best, > > Guillaume Guy > +1 919 - 972 - 8750 > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org