Re: Reading Hive tables Parallel in Spark

2017-07-18 Thread Matteo Cossu
The context you use for calling SparkSQL can be used only in the driver. Moreover, collect() works because it takes in local memory the RDD, but it should be used only for debugging reasons(95% of the times), if all your data fits into a single machine memory you shouldn't use Spark at all but some

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Fretz Nuson
I was getting NullPointerException when trying to call SparkSQL from foreach. After debugging, i got to know spark session is not available in executor and could not successfully pass it. What i am doing is tablesRDD.foreach.collect() and it works but goes sequential On Mon, Jul 17, 2017 at 5:58

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Fretz Nuson
I did threading but got many failed tasks and they were not reprocessed. I am guessing driver lost track of threaded tasks. I had also tired Future and par of scala and same result as above On Mon, Jul 17, 2017 at 5:56 PM, Pralabh Kumar wrote: > Run the spark context in multithreaded way . > > S

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Rick Moritz
Put your jobs into a parallel collection using .par -- then you can submit them very easily to Spark, using .foreach. The jobs will then run using the FIFO scheduler in Spark. The advantage over the prior approaches are, that you won't have to deal with Threads, and that you can leave parallelism

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Simon Kitching
Have you tried simply making a list with your tables in it, then using SparkContext.makeRDD(Seq)? ie val tablenames = List("table1", "table2", "table3", ...) val tablesRDD = sc.makeRDD(tablenames, nParallelTasks) tablesRDD.foreach() > Am 17.07.2017 um 14:12 schrieb FN : > > Hi > I am curren

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Pralabh Kumar
Run the spark context in multithreaded way . Something like this val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val sc = spark.sparkContext val hc = spark.sqlContext val thread1 = new Thread { overrid

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Matteo Cossu
Hello, have you tried to use threads instead of the loop? On 17 July 2017 at 14:12, FN wrote: > Hi > I am currently trying to parallelize reading multiple tables from Hive . As > part of an archival framework, i need to convert few hundred tables which > are in txt format to Parquet. For now i a