Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Amol Patil
@Ayan - Creating temp table dynamically based on dataset name. I will explore df.saveAsTable option. On Mon, Apr 17, 2017 at 9:53 PM, Ryan wrote: > It shouldn't be a problem then. We've done the similar thing in scala. I > don't have much experience with python thread but maybe the code related

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
It shouldn't be a problem then. We've done the similar thing in scala. I don't have much experience with python thread but maybe the code related with reading/writing temp table isn't thread safe. On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil wrote: > Thanks Ryan, > > Each dataset has separate hiv

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread ayan guha
What happens if you do not use the temp table, but directly do df.saveAsTsble with mode append? If i have to guess without looking at the code of your task function, i would think the name if temp table is evaluated statically, so all threads are refering to same tsble. In other words your app is n

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Amol Patil
Thanks Ryan, Each dataset has separate hive table. All hive tables belongs to same hive database. The idea is to ingest data in parallel in respective hive tables. If I run code sequentially for each data source, it works fine but I will take lot of time. We are planning to process around 30-40

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Ryan
I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > > I'm writing generic pys

Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Amol Patil
Hi All, I'm writing generic pyspark program to process multiple datasets using Spark SQL. For example Traffic Data, Crime Data, Weather Data. Dataset will be in csv format & size may vary from *1 GB* to *10 GB*. Each dataset will be available at different timeframe (weekly,monthly,quarterly). My