Re: run reduceByKey on huge data in spark

2015-06-30 Thread lisendong
hello, I ‘m using spark 1.4.2-SNAPSHOT I ‘m running in yarn mode:-) I wonder if the spark.shuffle.memoryFraction or spark.shuffle.manager work? how to set these parameters... > 在 2015年7月1日,上午1:32,Ted Yu 写道: > > Which Spark release are you using ? > > Are you running in standalone mode ? > > Ch

Re: different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
for these data points. What then? > > Also would you care to bring this to the user@ list? it's kind of interesting. > > On Thu, Feb 26, 2015 at 2:02 PM, lisendong wrote: >> I set the score of ‘0’ interaction user-item pair to 0.0 >> the code is as following: >&

different result from implicit ALS with explicit ALS

2015-02-26 Thread lisendong
I’m using ALS with spark 1.0.0, the code should be: https://github.com/apache/spark/blob/branch-1.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala I think the following two method should produce the same (or near) result: MatrixFactorizationModel model = ALS.train(ratings.r

how to clean shuffle write each iteration

2015-03-02 Thread lisendong
I 'm using spark als. I set the iteration number to 30. And in each iteration, tasks will produce nearly 1TB shuffle write. To my surprise, this shuffle data will not be cleaned until the total job finished, which means, I need 30TB disk to store the shuffle data. I think after each iteration,

Re: how to clean shuffle write each iteration

2015-03-03 Thread lisendong
in ALS, I guess all the iteration’s rdds are referenced by its next iteration’s rdd, so all the shuffle data will not be deleted until the als job finished… I guess checkpoint could solve my problem, do you know checkpoint? > 在 2015年3月3日,下午4:18,nitin [via Apache Spark User List] > 写道: > > S

gc time too long when using mllib als

2015-03-03 Thread lisendong
why does the gc time so long? i 'm using als in mllib, while the garbage collection time is too long (about 1/3 of total time) I have tried some measures in the "tunning spark guide", and try to set the new generation memory, but it still does not work... Tasks Task Index Task ID Stat

spark.local.dir leads to "Job cancelled because SparkContext was shut down"

2015-03-03 Thread lisendong
As long as I set the "spark.local.dir" to multiple disks, the job will failed, the errors are as follow: (if I set the spark.local.dir to only 1 dir, the job will succed...) Exception in thread "main" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.

spark master shut down suddenly

2015-03-04 Thread lisendong
15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket connection and attempting reconnect 15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED 15/03/04 09:26:36 INFO ZooKeeperLeaderElectio

how to update als in mllib?

2015-03-04 Thread lisendong
I 'm using spark1.0.0 with cloudera. but I want to use new als code which supports more features, such as rdd cache level(MEMORY ONLY), checkpoint, and so on. What is the easiest way to use the new als code? I only need the mllib als code, so maybe I don't need to update all the spark & mllib o

Re: spark master shut down suddenly

2015-03-04 Thread lisendong
I ‘m sorry, but how to look at the mesos logs? where are them? > 在 2015年3月4日,下午6:06,Akhil Das 写道: > > You can check in the mesos logs and see whats really happening. > > Thanks > Best Regards > > On Wed, Mar 4, 2015 at 3:10 PM, lisendong <mailto:lisend...@163.co

why my YoungGen GC takes so long time?

2015-03-05 Thread lisendong
I found my task takes so long time for YoungGen GC, I set the young gen size to about 1.5G, I wonder why it takes so long time? not all the tasks take such long time, only about 1% tasks so long... 180.426: [GC [PSYoungGen: 9916105K->1676785K(14256640K)] 26201020K->18690057K(53403648K), 17.358150

what are the types of tasks when running ALS iterations

2015-03-08 Thread lisendong
you see, the core of ALS 1.0.0 is the following code: there should be flatMap and groupByKey when running ALS iterations , right? but when I run als iteration, there are ONLY flatMap tasks... do you know why? private def updateFeatures( products: RDD[(Int, Array[Arr

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
; 在 2015年3月31日,上午12:11,Xiangrui Meng 写道: > > setCheckpointInterval was added in the current master and branch-1.3. Please > help check whether it works. It will be included in the 1.3.1 and 1.4.0 > release. -Xiangrui > > On Mon, Mar 30, 2015 at 7:27 AM, lisendong <mailto:l

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
System.runFinalization() > while (weakRef.get != null) { > System.gc() > System.runFinalization() > Thread.sleep(200) > if (System.currentTimeMillis - startTime > 1) { > throw new Exception("automatically cleanup error") > } > } > }

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
(weakRef.get != null) { > System.gc() > System.runFinalization() > Thread.sleep(200) > if (System.currentTimeMillis - startTime > 1) { > throw new Exception("automatically cleanup error") > } > } > } > > > --

Re: different result from implicit ALS with explicit ALS

2015-03-31 Thread lisendong
; checkpoint. Is it correct? > > Best, > Xiangrui > > On Tue, Mar 31, 2015 at 8:58 AM, lisendong <mailto:lisend...@163.com>> wrote: > guoqiang ’s method works very well … > > it only takes 1TB disk now. > > thank you very much! > > > >>

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
ferring to the initialization, not the result, right? It's possible > that the resulting weight vectors are sparse although this looks surprising > to me. But it is not related to the initial state, right? > > On Thu, Apr 2, 2015 at 10:43 AM, lisendong <mailto:lisend...@163.com>&g

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
alization, not the result, right? It's possible > that the resulting weight vectors are sparse although this looks surprising > to me. But it is not related to the initial state, right? > > On Thu, Apr 2, 2015 at 10:43 AM, lisendong <mailto:lisend...@163.com>> wrote

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
yes! thank you very much:-) > 在 2015年4月2日,下午7:13,Sean Owen 写道: > > Right, I asked because in your original message, you were looking at > the initialization to a random vector. But that is the initial state, > not final state. > > On Thu, Apr 2, 2015 at 11:51 AM, lisendo

union eatch streaming window into a static rdd and use the static rdd periodicity

2015-05-06 Thread lisendong
the pseudo code : object myApp { var myStaticRDD: RDD[Int] def main() { ... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path //complex transformation using the two DStream val new_stream = streamA.transformWith(StreamB, (a, b, t) => { a.join(

how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

2015-05-11 Thread lisendong
I have one hdfs dir, which contains many files: /user/root/1.txt /user/root/2.txt /user/root/3.txt /user/root/4.txt and there is a daemon process which add one file per minute to this dir. (e.g., 5.txt, 6.txt, 7.txt...) I want to start a spark streaming job which load 3.txt, 4.txt and then det

Re: how to monitor multi directories in spark streaming task

2015-05-13 Thread lisendong
but in fact the directories are not ready at the beginning to my task . for example: /user/root/2015/05/11/data.txt /user/root/2015/05/12/data.txt /user/root/2015/05/13/data.txt like this. and one new directory one day. how to create the new DStream for tomorrow’s new directory(/user/root/20

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread lisendong
reduce/LzoTextInputFormat.java> > the class. You can read more here > https://github.com/twitter/hadoop-lzo#maven-repository > <https://github.com/twitter/hadoop-lzo#maven-repository> > > Thanks > Best Regards > > On Thu, May 14, 2015 at 1:22 PM, lisendong &