Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, "Kane" wr

Re: Sprak Job stuck

2014-03-22 Thread Usman Ghani
Were you able to figure out what this was? You can try setting spark.akka.askTimeout to a larger value. That might help. On Thu, Mar 20, 2014 at 10:24 PM, mohit.goyal wrote: > Hi, > > I have run the spark application to process input data of size ~14GB with > executor memory 10GB. The job got s

Re: Largest input data set observed for Spark.

2014-03-22 Thread Usman Ghani
I am having similar issues with much smaller data sets. I am using spark EC2 scripts to launch clusters, but I almost always end up with straggling executors that take over a node's CPU and memory and end up never finishing. On Thu, Mar 20, 2014 at 1:54 PM, Soila Pertet Kavulya wrote: > Hi Reyn

Re: Shark Table for >22 columns

2014-03-22 Thread Ali Ghodsi
Subacini, the short answer is that we don't really support that yet, but the good news is that I can show you how to work around it. The good thing is that we nowadays internally actually convert the Tuples to Seqs, so we can actually leverage that. The bad thing is that before converting tuples t

Re: worker keeps getting disassociated upon a failed job spark version 0.90

2014-03-22 Thread sam
I have this problem too. Eventually the job fails (on the UI) and hangs the terminal until I CTRL + C. (Logs below) Now the Spark docs explain the heartbeat configuration stuff can be tweaked to handle GC hangs. I'm wondering if this is symptomatic of pushing the cluster a little too hard (we w

Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing l

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045 You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM

Re: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Koert Kuipers
i have found that i am unable to build/test spark with sbt and java6, but using java7 works (and it compiles with java target version 1.6 so binaries are usable from java 6) On Sat, Mar 22, 2014 at 3:11 PM, Bharath Bhushan wrote: > Thanks for the reply. It turns out that my ubuntu-in-vagrant had

RE: 答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread Bharath Bhushan
Thanks for the reply. It turns out that my ubuntu-in-vagrant had 512MB of ram only. Increasing it to 1024MB allowed the assembly to finish successfully. Peak usage was around 780MB. To: user@spark.apache.org From: vboylin1...@gmail.com Subject: 答复: unable to build spark - sbt/sbt: line 50: kille

Re: distinct on huge dataset

2014-03-22 Thread Kane
I mean everything works with the small file. With huge file only count and map work, distinct - doesn't -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html Sent from the Apache Spark User List mailing list archive at Nab

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Configuring shuffle write directory

2014-03-22 Thread Tsai Li Ming
Hi, Each of my worker node has its own unique spark.local.dir. However, when I run spark-shell, the shuffle writes are always written to /tmp despite being set when the worker node is started. By specifying the spark.local.dir for the driver program, it seems to override the executor? Is there

Re: distinct on huge dataset

2014-03-22 Thread Mayur Rustagi
Does it work on a smaller file? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sat, Mar 22, 2014 at 4:50 AM, Ryan Compton wrote: > Does it work without .distinct() ? > > Possibly related issue I ran into: > > https://ma

答复: unable to build spark - sbt/sbt: line 50: killed

2014-03-22 Thread 林武康
Large memory is need to build spark, I think you should make xmx larger, 2g for example. -原始邮件- 发件人: "Bharath Bhushan" 发送时间: ‎2014/‎3/‎22 12:50 收件人: "user@spark.apache.org" 主题: unable to build spark - sbt/sbt: line 50: killed I am getting the following error when trying to build spark.

Yet another question on saving RDD into files

2014-03-22 Thread Jaonary Rabarisoa
Dear all, As a Spark newbie, I need some help to understand how RDD save to file behaves. After reading the post on saving single files efficiently http://apache-spark-user-list.1001560.n3.nabble.com/How-to-save-as-a-single-file-efficiently-td3014.html I understand that each partition of the RDD

Re: distinct on huge dataset

2014-03-22 Thread Ryan Compton
Does it work without .distinct() ? Possibly related issue I ran into: https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E On Sat, Mar 22, 2014 at 12:45 AM, Kane wrote: > It's 0.9.0 > > > > -- > View this messag

Re: distinct on huge dataset

2014-03-22 Thread Kane
It's 0.9.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html Sent from the Apache Spark User List mailing list archive at Nabble.com.