Hi All, I'm writing generic pyspark program to process multiple datasets using Spark SQL. For example Traffic Data, Crime Data, Weather Data. Dataset will be in csv format & size may vary from *1 GB* to *10 GB*. Each dataset will be available at different timeframe (weekly,monthly,quarterly).
My requirement is to process all the datasets in parallel by triggering job only once. In Current implementation we are using Spark CSV package for reading csv files & using python treading mechanism to trigger multiple threads ---------------------- jobs = [] for dict_key, dict_val in config_dict.items(): t = threading.Thread(target=task,args=(sqlContext,dict_val)) jobs.append(t) t.start() for x in jobs: x.join() ----------------------- And Defind task mehtod to process each dataset based config values dict ----------------------------------------- def task(sqlContex, data_source_dict): .. ... ------------------------------------- task method, 1. create dataframe from csv file 2. then create temporary table from that dataframe. 3. finally it ingest data in to Hive table. *Issue:* 1. If I process two datasets in parallel, one dataset goes through successfully but for other dataset I'm getting error "*u'temp_table not found*" while ingesting data in to hive table. Its happening consistently either with dataset A or Dataset B sqlContext.sql('INSERT INTO TABLE '+hivetablename+' SELECT * from '+temp_table_name) I tried below things 1. I'm creating dataframe name & temporary tabel name dynamically based in dataset name. 2. Enabled Spark Dynamic allocation (--conf spark.dynamicAllocation.enabled=true) 3. Set spark.scheduler.mode to FAIR I appreciate advise on 1. Is anything wrong in above implementation? 2. Is it good idea to process those big datasets in parallel in one job? 3. Any other solution to process multiple datasets in parallel? Thank you, Amol Patil