dataFrame.colaesce(1) or dataFrame.reapartition(1) does not seem work for me

2015-07-10 Thread kachau
Hi I have Hive insert into query which creates new Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using the following code and try to save it DataFrame dframe = hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') sele

SparkR Error in sparkR.init(master=“local”) in RStudio

2015-07-10 Thread kachau
I have installed the SparkR package from Spark distribution into the R library. I can call the following command and it seems to work properly: library(SparkR) However, when I try to get the Spark context using the following code, sc <- sparkR.init(master="local") It fails after some time with th

How do we control output part files created by Spark job?

2015-07-06 Thread kachau
Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC

How to call hiveContext.sql() on all the Hive partitions in parallel?

2015-07-06 Thread kachau
Hi I have to fire few insert into queries which uses Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using hiveContext as shown below query works fine hiveContext.sql("insert into summary1 partition(server='a1',date='2015-05-22') select from sou