@Ayan - Creating temp table dynamically based on dataset name. I will
explore df.saveAsTable option.
On Mon, Apr 17, 2017 at 9:53 PM, Ryan wrote:
> It shouldn't be a problem then. We've done the similar thing in scala. I
> don't have much experience with python thread but maybe the code related
It shouldn't be a problem then. We've done the similar thing in scala. I
don't have much experience with python thread but maybe the code related
with reading/writing temp table isn't thread safe.
On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil wrote:
> Thanks Ryan,
>
> Each dataset has separate hiv
What happens if you do not use the temp table, but directly do
df.saveAsTsble with mode append? If i have to guess without looking at the
code of your task function, i would think the name if temp table is
evaluated statically, so all threads are refering to same tsble. In other
words your app is n
Thanks Ryan,
Each dataset has separate hive table. All hive tables belongs to same hive
database.
The idea is to ingest data in parallel in respective hive tables.
If I run code sequentially for each data source, it works fine but I will
take lot of time. We are planning to process around 30-40
I don't think you can parallel insert into a hive table without dynamic
partition, for hive locking please refer to
https://cwiki.apache.org/confluence/display/Hive/Locking.
Other than that, it should work.
On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote:
> Hi All,
>
> I'm writing generic pys
Hi All,
I'm writing generic pyspark program to process multiple datasets using
Spark SQL. For example Traffic Data, Crime Data, Weather Data. Dataset will
be in csv format & size may vary from *1 GB* to *10 GB*. Each dataset will
be available at different timeframe (weekly,monthly,quarterly).
My