[Spark-1.4.0]jackson-databind conflict?

2015-06-12 Thread Earthson
I'm using Play-2.4 with play-json-2.4, It works fine with spark-1.3.1, but it failed after I upgrade Spark to spark-1.4.0:( sc.parallelize(1 to 1).count [info] com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class org.apache.spark.rdd

[Yarn-Client]Can not access SparkUI

2015-10-25 Thread Earthson
We are using Spark 1.5.1 with `--master yarn`, Yarn RM is running in HA mode. direct visit click ApplicationMaster link YARN RM log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197.html Sent from the Apac

How to add jar with SparkSQL HiveContext?

2014-06-16 Thread Earthson
I have a problem with add jar command hql("add jar /.../xxx.jar") Error: Exception in thread "main" java.lang.AssertionError: assertion failed: No plan for AddJar ... How could I do this job with HiveContext, I can't find any api to do it. Does SparkSQL with Hive support UDF/UDAF? -- View this m

How could I set the number of executor?

2014-06-20 Thread Earthson
"spark-submit" has an arguments "--num-executors" to set the number of executor, but how could I set it from anywhere else? We're using Shark, and want to change the number of executor. The number of executor seems to be same as workers by default? Shall we configure the executor number manually(

Re: How could I set the number of executor?

2014-06-20 Thread Earthson
--num-executors seems to be only available with YARN-only. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-set-the-number-of-executor-tp7990p7992.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why spark-submit command hangs?

2014-07-22 Thread Earthson
I've just have the same problem. I'm using $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client $JOBJAR --class $JOBCLASS It's really strange, because the log shows that 14/07/22 16:16:58 INFO ui.SparkUI: Started SparkUI at http://k1227.mzhen.cn:4040 14/07/22 16:16:58 WARN util.N

Re: Why spark-submit command hangs?

2014-07-22 Thread Earthson
That's what my problem is:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-spark-submit-command-hangs-tp10308p10394.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

[Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition with 2048 buckets. sqlsc.set("spark.sql.shuffle.partitions", "2048") hql("""|insert %s table mz_log |PARTITION (date='%s') |select * from tmp_mzlog

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
"spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to" takes too much time, what should I do? What is the correct configuration? blockManager timeout if I using a small number of reduce partition.

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-29 Thread Earthson
It's really strange that cpu load so high and both disk/network IO load so low. CLUSTER BY is just something similar to groupBy, why it needs so much cpu resource? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-i

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-29 Thread Earthson
Too many GC. The task runs much more faster with more memory(heap space). The CPU load is still too high, and network load is about 20+MB/s(not high enough) So what is the correct way to solve this GC problem? Is there other ways except using more memory? -- View this message in context: http

[PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I am using PySpark with IPython notebook. data = sc.parallelize(range(1000), 10) #successful data.map(lambda x: x+1).collect() #Error data.count() Something similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html But it does not figure out

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I'm running pyspark with Python 2.7.8 under Virtualenv System Python Version: Python 2.6.x -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html Sent from the Apache

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
Do I have to deploy Python to every machine to make "$PYSPARK_PYTHON" work correctly? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html Sent from the Apache Spark U

[SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-15 Thread Earthson
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found that DataTypeConversions is protected[sql]. Finally I find this solution: jrdd.registerTempTable("transform_tmp") jrdd.sqlContext.sql("select * from transform_tmp") Could Any One tell me that: Is it

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Earthson
I'm trying to give API interface to Java users. And I need to accept their JavaSchemaRDDs, and convert it to SchemaRDD for Scala users. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html Sent from t

How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2014-11-21 Thread Earthson
Is there any way to get the yarn application_id inside the program? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-applicationId-for-yarn-mode-both-yarn-client-and-yarn-cluster-mode-tp19462.html Sent from the Apache Spark User List mailing list

Re: How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2014-11-21 Thread Earthson
Finally, I've found two ways: 1. search the output with something like "Submitted application application_1416319392519_0115" 2. use specific AppName. We could query the ApplicationID(yarn) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-applic

Re: what is the best way to implement mini batches?

2014-12-14 Thread Earthson
I think it could be done like: 1. using mapPartition to randomly drop some partition 2. drop some elements randomly(for selected partition) 3. calculate gradient step for selected elements I don't think fixed step is needed, but fixed step could be done: 1. zipWithIndex 2. create ShuffleRDD base

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson
Kryo With Exception below: com.esotericsoftware.kryo.KryoException (com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1) com.esotericsoftware.kryo.io.Output.require(Output.java:138) com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446) com.esotericsof

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
I've tried to set larger buffer, but reduceByKey seems to be failed. need help:) 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down all executors 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/04/26 12:31:12 INFO schedule

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
This error come just because I killed my App:( Is there something wrong? the reduceByKey operation is extremely slow(than default Serializer). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4869.html Sen

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
reduceByKey(_+_).countByKey instead of countByKey seems to be fast. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
parallelize is still so slow. package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerCla

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
It's my fault! I upload a wrong jar when I changed the number of partitions. and Now it just works fine:) The size of word_mapping is 2444185. So it will take very long time for large object serialization? I don't think two million is very large, because the cost at local for such size is typical

Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The problem is this object can't be Serializerable, it holds a RDD field and SparkContext. But Spark shows an error that it need Serialization. The order of my debug output is really strange. ~ Training Start! Round 0 Hehe? Hehe? started? failed? Round 1 Hehe? ~ here is my code 69 impo

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
Or what is the action that make the rdd run. I don't what to save it as file, and I've tried cache(), it seems to be some kind of lazy too. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-require-this-object-to-be-serializerable-tp5009p5011.html Se

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The RDD hold "this" in its closure? How to fix such a problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-require-this-object-to-be-serializerable-tp5009p5015.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is RDD is trying to make itself lazy. and some BroadCast Object need to be generate dynamicly, so the closure have SparkContext inside, so the task complete faile

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The code is here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala I've change it to from Broadcast to Serializable. Now it works:) But There are too many rdd cache, It is the problem? -- View this message in context: http://apache-spark-user

Re: Why Spark require this object to be serializerable?

2014-04-29 Thread Earthson
Finally, I'm using file to save RDDs, and then reload it. It works fine, because Gibbs Sampling for LDA is really slow. It's about 10min to sampling 10k wiki document for 10 round(1 round/min). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-requir

Re: parallelize for a large Seq is extreamly slow.

2014-04-29 Thread Earthson
I think the real problem is "spark.akka.frameSize". It is to small for passing the data. every executor failed, and there is no executor, then the task hangs up. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp

cache not work as expected for iteration?

2014-05-03 Thread Earthson
code:) <https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala#L99> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache1.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache2.png&

Re: cache not work as expected for iteration?

2014-05-04 Thread Earthson
thx for the help, unpersist is excatly what I want:) I see that spark will remove some cache automatically when memory is full, it is much more helpful if the rule satisfy something like LRU It seems that persist and cache is some kind of lazy? -- View this message in context: http://

Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
A new broadcast object will generated for every iteration step, it may eat up the memory and make persist fail. The broadcast object should not be removed because RDD may be recomputed. And I am trying to prevent recomputing RDD, it need old broadcast release some memory. I've tried to set "spar

Re: Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
Code Here <https://github.com/Earthson/sparklda/blob/dev/src/main/scala/net/earthson/nlp/lda/lda.scala#L121> Finally, iteration still runs into recomputing... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-bro

Re: Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
I tried using serialization instead of broadcast, and my program exit with Error(beyond physical memory limits). The large object can not be released by GC? because it is needed for recomputing? So what is the recomended way to solve this problem? -- View this message in context: http://apache

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
How could I do iteration? because the persist is lazy and recomputing may required, all the path of iteration will be save, memory overflow can not be escaped? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p53

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
.set("spark.cleaner.ttl", "120") drops broadcast_0 which makes a Exception below. It is strange, because broadcast_0 is no need, and I have broadcast_3 instead, and recent RDD is persisted, there is no need for recomputing... what is the problem? need help. ~~~ 14/05/05 17:03:12 INFO stor

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Using checkpoint. It removes dependences:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast cleaning. May be it could be removed automatically when no dependences. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html Se

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Yes, I've tried. The problem is new broadcast object generated by every step until eat up all of the memory. I solved it by using RDD.checkpoint to remove dependences to old broadcast object, and use cleanner.ttl to clean up these broadcast object automatically. If there's more elegant way to

Re: Incredible slow iterative computation

2014-05-05 Thread Earthson
checkpoint seems to be just add a CheckPoint mark? You need an action after marked it. I have tried it with success:) newRdd = oldRdd.map(myFun).persist(myStorageLevel) newRdd.checkpoint // < {}) // Force evaluation newRdd.isCheckpointed // true here oldRdd.unpersist(true) If you have

[Suggestion]Strange behavior for broadcast cleaning with spark 0.9

2014-05-15 Thread Earthson
I'm using spark-0.9 with YARN. Q: Why spark.cleaner.ttl setting could remove broadcast that still in use? I think cleaner should not remove broadcasts still in the dependences of some RDDs. It makes the value of spark.cleaner.ttl need to be set more carefully. POINT: cleaner should not crash the

Re: problem about broadcast variable in iteration

2014-05-15 Thread Earthson
RDD is not cached? Because recomputing may be required, every broadcast object is included in the dependences of RDDs, this may also have memory issue(when n and kv is too large in your case). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-b

Re: [Spark-1.4.0]jackson-databind conflict?

2015-06-14 Thread Earthson Lu
I’ve recompiled spark-1.4.0 with fasterxml-2.5.x, it works fine now:) --  Earthson Lu On June 12, 2015 at 23:24:32, Sean Owen (so...@cloudera.com) wrote: I see the same thing in an app that uses Jackson 2.5. Downgrading to 2.4 made it work. I meant to go back and figure out if there&#

Re: [Yarn-Client]Can not access SparkUI

2015-10-26 Thread Earthson Lu
1:45:36,600 INFO org.apache.commons.httpclient.HttpMethodDirector: Retrying request --  Earthson Lu On October 26, 2015 at 15:30:21, Deng Ching-Mallete (och...@apache.org) wrote: Hi Earthson, Unfortunately, attachments aren't allowed in the list so they seemed to have been removed from you

Re: what is the best way to implement mini batches?

2014-12-15 Thread Earthson Lu
large batch for parallel inside each batch(It seems to be the way that SGD implemented in MLLib does?). --  Earthson Lu On December 16, 2014 at 04:02:22, Imran Rashid (im...@therashids.com) wrote: I'm a little confused by some of the responses.  It seems like there are two different issues

parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson Lu
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/word_mapping") this line is too slow. There are about 2 million elements in word_mapping. *Is there a good style for writing a large collection to hdfs?* import org.apache.spark._ > import SparkContext._ > import scala.io