Re: distinct on huge dataset

2014-04-17 Thread Mayur Rustagi
Preferably increase the ulimit on your machines. Spark needs to access a lot of small files hence hard to control file handlers. — Sent from Mailbox On Fri, Apr 18, 2014 at 3:59 AM, Ryan Compton wrote: > Btw, I've got System.setProperty("spark.shuffle.consolidate.files", > "true") and use ex

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty("spark.shuffle.consolidate.files", "true") and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton wrote: > Does this continue in newer versions? (I'm on 0.8.0 now) > > When I use .distinct() on moderately large datasets (224GB, 8.5B rows, > I'm gue

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException java.io.File

Re: distinct on huge dataset

2014-03-24 Thread Aaron Davidson
Look up setting ulimit, though note the distinction between soft and hard limits, and that updating your hard limit may require changing /etc/security/limits.confand restarting each worker. On Mon, Mar 24, 2014 at 1:39 AM, Kane wrote: > Got a bit further, i think out of memory error was caused

Re: distinct on huge dataset

2014-03-24 Thread Kane
Got a bit further, i think out of memory error was caused by setting spark.spill to false. Now i have this error, is there an easy way to increase file limit for spark, cluster-wide?: java.io.FileNotFoundException: /tmp/spark-local-20140324074221-b8f1/01/temp_1ab674f9-4556-4239-9f21-688dfc9f17d2 (

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Ah, interesting. count() without distinct is streaming and does not require that a single partition fits in memory, for instance. That said, the behavior may change if you increase the number of partitions in your input RDD by using RDD.repartition() On Sun, Mar 23, 2014 at 11:47 AM, Kane wrote:

Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of Memory Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash collision error we found there. Kane, is it possible your bigger data is corrupt, such that that any operations on it fail? On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash wrote: > FWIW I've seen correctness errors with spark.shu

Re: distinct on huge dataset

2014-03-22 Thread Andrew Ash
FWIW I've seen correctness errors with spark.shuffle.spill on 0.9.0 and have it disabled now. The specific error behavior was that a join would consistently return one count of rows with spill enabled and another count with it disabled. Sent from my mobile phone On Mar 22, 2014 1:52 PM, "Kane" wr

Re: distinct on huge dataset

2014-03-22 Thread Kane
But i was wrong - map also fails on big file and setting spark.shuffle.spill doesn't help. Map fails with the same error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3039.html Sent from the Apache Spark User List mailing l

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes, that helped, at least it was able to advance a bit further. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3038.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Aaron Davidson
This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045 You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM

Re: distinct on huge dataset

2014-03-22 Thread Kane
I mean everything works with the small file. With huge file only count and map work, distinct - doesn't -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html Sent from the Apache Spark User List mailing list archive at Nab

Re: distinct on huge dataset

2014-03-22 Thread Kane
Yes it works with smaller file, it can count and map, but not distinct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-22 Thread Mayur Rustagi
Does it work on a smaller file? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sat, Mar 22, 2014 at 4:50 AM, Ryan Compton wrote: > Does it work without .distinct() ? > > Possibly related issue I ran into: > > https://ma

Re: distinct on huge dataset

2014-03-22 Thread Ryan Compton
Does it work without .distinct() ? Possibly related issue I ran into: https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E On Sat, Mar 22, 2014 at 12:45 AM, Kane wrote: > It's 0.9.0 > > > > -- > View this messag

Re: distinct on huge dataset

2014-03-22 Thread Kane
It's 0.9.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3027.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: distinct on huge dataset

2014-03-21 Thread Aaron Davidson
Which version of spark are you running? On Fri, Mar 21, 2014 at 10:45 PM, Kane wrote: > I have a huge 2.5TB file. When i run: > val t = sc.textFile("/user/hdfs/dump.csv") > t.distinct.count > > It fails right away with a lot of: > > Loss was due to java.lang.ArrayIndexOutOfBoundsException > jav