Re: Speed Benchmark

2015-03-04 Thread Guillaume Guy
Sorry for the confusion. All are running Hadoop services. Node 1 is the namenode whereas Nodes 2 and 3 are datanodes. Best, Guillaume Guy * +1 919 - 972 - 8750* On Sat, Feb 28, 2015 at 1:09 AM, Sean Owen wrote: > Is machine 1 the only one running an HDFS data node? You describe it as > one

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
No. It should not be that slow. In my Mac, it took 1.4 minutes to do `rdd.count()` on 4.3G text file ( 25M / s / CPU). Could you turn on profile in pyspark to see what happened in Python process? spark.python.profile = true On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy wrote: > It is a simple

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
It is a simple text file. I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it? On Friday, February 27, 2015, Davies Liu wrote: > What is this dataset? text file or parquet file? > > There is an issue with serialization in Spark SQL, which will make it > very slow, see http

Re: Speed Benchmark

2015-02-27 Thread Sean Owen
Is machine 1 the only one running an HDFS data node? You describe it as one running Hadoop services. On Feb 27, 2015 9:44 PM, "Guillaume Guy" wrote: > Hi Jason: > > Thanks for your feedback. > > Beside the information above I mentioned, there are 3 machines in the > cluster. > > *1st one*: Driver

Re: Speed Benchmark

2015-02-27 Thread Davies Liu
What is this dataset? text file or parquet file? There is an issue with serialization in Spark SQL, which will make it very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will be fixed very soon. Davies On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy wrote: > Hi Sean: > > Thanks for

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
Hi Sean: Thanks for your feedback. Scala is much faster. The count is performed in ~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap seems to be more than that. Is that also your conclusion? Thanks. Best, Guillaume Guy * +1 919 - 972 - 8750* On Fri, Feb 27, 2015 at 9

Re: Speed Benchmark

2015-02-27 Thread Guillaume Guy
Hi Jason: Thanks for your feedback. Beside the information above I mentioned, there are 3 machines in the cluster. *1st one*: Driver + has a bunch of Hadoop services. 32GB of RAM, 8 cores (2 used) *2nd + 3rd: *16B of RAM, 4 cores (2 used each) I hope this helps clarify. Thx. GG Best, Gui

Re: Speed Benchmark

2015-02-27 Thread Jason Bell
How many machines are on the cluster? And what is the configuration of those machines (Cores/RAM)? "Small cluster" is very subjective statement. Guillaume Guy wrote: Dear Spark users: I want to see if anyone has an idea of the performance for a small cluster.

Re: Speed Benchmark

2015-02-27 Thread Sean Owen
That's very slow, and there are a lot of possible explanations. The first one that comes to mind is: I assume your YARN and HDFS are on the same machines, but are you running executors on all HDFS nodes when you run this? if not, a lot of these reads could be remote. You have 6 executor slots, but