Re: parallelize for a large Seq is extreamly slow.

2014-04-29 Thread Earthson
I think the real problem is "spark.akka.frameSize". It is to small for passing the data. every executor failed, and there is no executor, then the task hangs up. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
It's my fault! I upload a wrong jar when I changed the number of partitions. and Now it just works fine:) The size of word_mapping is 2444185. So it will take very long time for large object serialization? I don't think two million is very large, because the cost at local for such size is typical

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Matei Zaharia
How many values are in that sequence? I.e. what is its size? You can also profile your program while it’s running to see where it’s spending time. The easiest way is to get a single stack trace with jstack . Maybe some of the serialization methods for this data are super inefficient, or toSeq o

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Aaron Davidson
Could it be that you're using the default number of partitions of parallelize() is too small in this case? Try something like spark.parallelize(word_mapping.value.toSeq, 60). (Given your setup, it should already be 30, but perhaps that's not the case in YARN mode...) On Fri, Apr 25, 2014 at 11:38

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
parallelize is still so slow. package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerCla

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
reduceByKey(_+_).countByKey instead of countByKey seems to be fast. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
This error come just because I killed my App:( Is there something wrong? the reduceByKey operation is extremely slow(than default Serializer). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4869.html Sen

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
I've tried to set larger buffer, but reduceByKey seems to be failed. need help:) 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down all executors 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/04/26 12:31:12 INFO schedule

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson
Kryo With Exception below: com.esotericsoftware.kryo.KryoException (com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1) com.esotericsoftware.kryo.io.Output.require(Output.java:138) com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446) com.esotericsof

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu wrote: > spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/w

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu wrote: > spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/w