In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

2015-04-23 Thread daniel.mescheder
Hi everyone, I was playing with the integration of Hive UDAFs in Spark-SQL and noticed that the terminatePartial and merge methods of custom UDAFs were not called. This made me curious as those two methods are the ones responsible for distributing the UDAF execution in Hive. Looking at the code

Performance & Memory Issues When Creating Many Columns in GROUP BY (spark-sql)

2015-05-19 Thread daniel.mescheder
Dear List, We have run into serious problems trying to run a larger than average number of aggregations in a GROUP BY query. Symptoms of this problem are OutOfMemory exceptions and unreasonably long processing times due to GC. The problem occurs when the following two conditions are met: - The

Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2015-07-15 Thread daniel.mescheder
Hey everyone, Consider the following use of spark.sql.shuffle.partitions: case class Data(A:String = f"${(math.random*1e8).toLong}%09.0f", B: String = f"${(math.random*1e8).toLong}%09.0f") val dataFrame = (1 to 1000).map(_ => Data()).toDF dataFrame.registerTempTable("data") sqlContext.setConf( "