Sorry for the confusion.
All are running Hadoop services. Node 1 is the namenode whereas Nodes 2 and
3 are datanodes.
Best,
Guillaume Guy
* +1 919 - 972 - 8750*
On Sat, Feb 28, 2015 at 1:09 AM, Sean Owen wrote:
> Is machine 1 the only one running an HDFS data node? You describe it as
> one
No. It should not be that slow. In my Mac, it took 1.4 minutes to do
`rdd.count()` on 4.3G text file ( 25M / s / CPU).
Could you turn on profile in pyspark to see what happened in Python process?
spark.python.profile = true
On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy
wrote:
> It is a simple
It is a simple text file.
I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it?
On Friday, February 27, 2015, Davies Liu wrote:
> What is this dataset? text file or parquet file?
>
> There is an issue with serialization in Spark SQL, which will make it
> very slow, see http
Is machine 1 the only one running an HDFS data node? You describe it as one
running Hadoop services.
On Feb 27, 2015 9:44 PM, "Guillaume Guy" wrote:
> Hi Jason:
>
> Thanks for your feedback.
>
> Beside the information above I mentioned, there are 3 machines in the
> cluster.
>
> *1st one*: Driver
What is this dataset? text file or parquet file?
There is an issue with serialization in Spark SQL, which will make it
very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
be fixed very soon.
Davies
On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
wrote:
> Hi Sean:
>
> Thanks for
Hi Sean:
Thanks for your feedback. Scala is much faster. The count is performed in
~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap
seems to be more than that. Is that also your conclusion?
Thanks.
Best,
Guillaume Guy
* +1 919 - 972 - 8750*
On Fri, Feb 27, 2015 at 9
Hi Jason:
Thanks for your feedback.
Beside the information above I mentioned, there are 3 machines in the
cluster.
*1st one*: Driver + has a bunch of Hadoop services. 32GB of RAM, 8 cores (2
used)
*2nd + 3rd: *16B of RAM, 4 cores (2 used each)
I hope this helps clarify.
Thx.
GG
Best,
Gui
How many machines are on the cluster?
And what is the configuration of those machines (Cores/RAM)?
"Small cluster" is very subjective statement.
Guillaume Guy wrote:
Dear Spark users:
I want to see if anyone has an idea of the performance for a small
cluster.
That's very slow, and there are a lot of possible explanations. The
first one that comes to mind is: I assume your YARN and HDFS are on
the same machines, but are you running executors on all HDFS nodes
when you run this? if not, a lot of these reads could be remote.
You have 6 executor slots, but