Partition parquet data by ENUM column

2015-07-21 Thread ankits
Hi, I am using a custom build of spark 1.4 with the parquet dependency upgraded to 1.7. I have thrift data encoded with parquet that i want to partition by a column of type ENUM. Spark programming guide says partition discovery is only supported for string and numeric columns, so it seems partition

How to avoid the delay associated with Hive Metastore when loading parquet?

2016-10-23 Thread ankits
Hi, I'm loading parquet files via spark, and I see the first time a file is loaded that there is a 5-10s delay related to the Hive Metastore with messages relating to metastore in the console. How can I avoid this delay and keep the metadata around? I want the data to be persisted even after kill

StackOverflowError with SchemaRDD

2015-01-28 Thread ankits
Hi, I am getting a stack overflow error when querying a schemardd comprised of parquet files. This is (part of) the stack trace: Caused by: java.lang.StackOverflowError at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$

Serialized task result size exceeded

2015-01-30 Thread ankits
This is on spark 1.2 I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and calling count() on it. After loading about 2705 tasks (there is one per file), the job crashes with this error: Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than spark.driver.max

Why are task results large in this case?

2015-02-04 Thread ankits
I am running a job, part of which is to add some "null" values to the rows of a SchemaRDD. The job fails with "Total size of serialized results of 2692 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize(1024.0 MB)" This is the code: val in = sqc.parquetFile(...) .. val presentColProj: Sc

Limit # of parallel parquet decompresses

2015-03-12 Thread ankits
My jobs frequently run out of memory if the #of cores on an executor is too high, because each core launches a new parquet decompressor thread, which allocates memory off heap to decompress. Consequently, even with say 12 cores on an executor, depending on the memory, I can only use 2-3 to avoid OO

Exceptions not caught?

2014-10-23 Thread ankits
Hi, I'm running a spark job and encountering an exception related to thrift. I wanted to know where this is being thrown, but the stack trace is completely useless. So I started adding try catches, to the point where my whole main method that does everything is surrounded with a try catch. Even the

Re: Exceptions not caught?

2014-10-23 Thread ankits
I am simply catching all exceptions (like case e:Throwable => println("caught: "+e) ) Here is the stack trace: 2014-10-23 15:51:10,766 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) java.io.IOException: org.apache.thrift.protocol.TProtocolException: Required field 'X' is unset! Struct:Y(id:

Re: Exceptions not caught?

2014-10-23 Thread ankits
Also everything is running locally on my box, driver and workers. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-tp17157p17160.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Exceptions not caught?

2014-10-23 Thread ankits
>Can you check your class Y and fix the above ? I can, but this is about catching the exception should it be thrown by any class in the spark job. Why is the exception not being caught? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-t

Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
I want to set up spark SQL to allow ad hoc querying over the last X days of processed data, where the data is processed through spark. This would also have to cache data (in memory only), so the approach I was thinking of was to build a layer that persists the appropriate RDDs and stores them in me

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread ankits
Thanks for your response Michael. I'm still not clear on all the details - in particular, how do I take a temp table created from a SchemaRDD and allow it to be queried using the Thrift JDBC server? From the Hive guides, it looks like it only supports loading data from files, but I want to query t

Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread ankits
I have one job with spark that creates some RDDs of type X and persists them in memory. The type X is an auto generated Thrift java class (not a case class though). Now in another job, I want to convert the RDD to a SchemaRDD using sqlContext.applySchema(). Can I derive a schema from the thrift def

How to trace/debug serialization?

2014-11-05 Thread ankits
In my spark job, I have a loop something like this: bla.forEachRdd(rdd => { //init some vars rdd.forEachPartition(partiton => { //init some vars partition.foreach(kv => { ... I am seeing serialization errors (unread block data), because I think spark is trying to serialize the wh

Imbalanced shuffle read

2014-11-11 Thread ankits
Im running a job that uses groupByKey(), so it generates a lot of shuffle data. Then it processes this and writes files to HDFS in a forEachPartition block. Looking at the forEachPartition stage details in the web console, all but one executor is idle (SUCCESS in 50-60ms), and one is RUNNING with a

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I tried that, but that did not resolve the problem. All the executors for partitions except one have no shuffle reads and finish within 20-30 ms. one executor has a complete shuffle read of the previous stage. Any other ideas on debugging this? -- View this message in context: http://apache-spa

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
Adding a call to rdd.repartition() after randomizing the keys has no effect either. code - //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_stream.transform(rdd => { rdd.map({ kv => val

Re: Imbalanced shuffle read

2014-11-12 Thread ankits
I have made some progress - the partitioning is very uneven, and everything goes to one partition. I see that spark partitions by key, so I tried this: //partitioning is done like partitionIdx = f(key) % numPartitions //we use random keys to get even partitioning val uniform = other_st

Remove added jar from spark context

2014-12-01 Thread ankits
Hi, Is there a way to REMOVE a jar (added via addJar) to spark contexts? We have a long running context used by the spark jobserver, but after trying to update versions of classes already in the class path via addJars, the context still runs the old versions. It would be helpful if I could remove

RDDs being cleaned too fast

2014-12-10 Thread ankits
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information about why it was cleaned up. There should be more than enough memory available on the cluster to store them, and by default, the spark.cleaner.ttl is in

spark.cleaner questions

2015-01-13 Thread ankits
I am using spark 1.1 with the ooyala job server (which basically creates long running spark jobs as contexts to execute jobs in). These contexts have cached RDDs in memory (via RDD.persist()). I want to enable the spark.cleaner to cleanup the /spark/work directories that are created for each app,

Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially, saveAsTextFile calls toString() on the elements of the RDD, so a call to null.toString will result in an NPE. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p21178