Re: GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Vivek YS
Thanks for the input. I will give foldByKey a shot. The way I am doing is, data is partitioned hourly. So I am computing distinct values hourly. Then I use unionRDD to merge them and compute distinct on the overall data. > Is there a way to know which key,value pair is resulting in the OOM ? > Is

long GC pause during file.cache()

2014-06-14 Thread Wei Tan
Hi, I have a single node (192G RAM) stand-alone spark, with memory configuration like this in spark-env.sh SPARK_WORKER_MEMORY=180g SPARK_MEM=180g In spark-shell I have a program like this: val file = sc.textFile("/localpath") //file size is 40G file.cache() val output = file.map(line =>

Re: Is shuffle "stable"?

2014-06-14 Thread Daniel Darabos
Thanks Matei! In the example all three items have the same key, so they go to the same partition: scala> sc.parallelize(Seq(0->3, 0->2, 0->1), 3).partitionBy(new HashPartitioner(3)).glom.collect Array(Array((0,3), (0,2), (0,1)), Array(), Array()) I guess the apparent stability is just due to

Re: spark master UI does not keep detailed application history

2014-06-14 Thread wxhsdp
hi, zhen i met the same problem in ec2, application details can not be accessed. but i can read stdout and stderr. the problem has not been solved yet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-master-UI-does-not-keep-detailed-application-hist

Re: Is shuffle "stable"?

2014-06-14 Thread Matei Zaharia
The order is not guaranteed actually, only which keys end up in each partition. Reducers may fetch data from map tasks in an arbitrary order, depending on which ones are available first. If you’d like a specific order, you should sort each partition. Here you might be getting it because each par

Failing to run standalone streaming app: IOException; classNotFoundException; and more

2014-06-14 Thread pns
Hi,I'm attempting to run the following simple standalone app on mac os and spark 1.0 using sbt:val sparkConf = new SparkConf().setAppName("ProcessEvents").setMaster("local[*]").setSparkHome("/Users/me/Downloads/spark")val ssc = new StreamingContext(sparkConf, Seconds(10))val lines = ssc.textFileStr

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread Michael Armbrust
Actually, are you defining Person as an inner class? You might be running into this: http://stackoverflow.com/questions/18866866/why-there-is-no-typetag-available-in-nested-instantiations-when-interpreted-by On Sat, Jun 14, 2014 at 1:51 PM, Michael Armbrust wrote: > Can you maybe attach the fu

Re: SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread Michael Armbrust
Can you maybe attach the full scala file? On Sat, Jun 14, 2014 at 5:03 AM, premdass wrote: > Hi, > > I am trying to run the spark sql example provided on the example > https://spark.apache.org/docs/latest/sql-programming-guide.html as a > standalone program. > > When i try to run the compile t

Re: guidance on simple unit testing with Sprk

2014-06-14 Thread Gerard Maas
Ll mlll On Jun 14, 2014 4:05 AM, "Matei Zaharia" wrote: > You need to factor your program so that it’s not just a main(). This is > not a Spark-specific issue, it’s about how you’d unit test any program in > general. In this case, your main() creates a SparkContext, so you can’t > pass one from o

Is shuffle "stable"?

2014-06-14 Thread Daniel Darabos
What I mean is, let's say I run this: sc.parallelize(Seq(0->3, 0->2, 0->1), 3).partitionBy(HashPartitioner(3)).collect Will the result always be Array((0,3), (0,2), (0,1))? Or could I possibly get a different order? I'm pretty sure the shuffle files are taken in the order of the source partiti

DStream are not processed after upgrade to Spark 1.0

2014-06-14 Thread Chang Lim
Hi All, I've some Streaming code in Java that works on 0.9.1. After upgrading to 1.0 (with fix to minor API changes) DStream does not seem to be executing. The tasks got killed in 1 second by the worker. Any idea what is causing it? The worker log file is not logging my debug statements. The f

Re: GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Sean Owen
Grouping by key is always problematic since a key might have a huge number of values. You can do a little better than grouping *all* values and *then* finding distinct values by using foldByKey, putting values into a Set. At least you end up with only distinct values in memory. (You don't need two

GroupByKey results in OOM - Any other alternative

2014-06-14 Thread Vivek YS
Hi, For last couple of days I have been trying hard to get around this problem. Please share any insights on solving this problem. Problem : There is a huge list of (key, value) pairs. I want to transform this to (key, distinct values) and then eventually to (key, distinct values count) On sma

Accumulable with huge accumulated value?

2014-06-14 Thread Nilesh Chakraborty
Hey all! I have got an iterative problem. I'm trying to find something similar to Hadoop's MultipleOutputs [1] in Spark 1.0. I need to build up a couple of large dense vectors (may contain billions of elements - 2 billion doubles => at least 16GB) by adding partial vector chunks to it. This can be

SparkSQL registerAsTable - No TypeTag available Error

2014-06-14 Thread premdass
Hi, I am trying to run the spark sql example provided on the example https://spark.apache.org/docs/latest/sql-programming-guide.html as a standalone program. When i try to run the compile the program, i am getting the below error Done updating. Compiling 1 Scala source to C:\Work\Dev\scala\wo