Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

Subhash Sriram Thu, 20 Apr 2017 14:54:30 -0700

Would it be an option to just write the results of each job into separate 
tables and then run a UNION on all of them at the end into a final target 
table? Just thinking of an alternative!


Thanks,
Subhash

Sent from my iPhone

> On Apr 20, 2017, at 3:48 AM, Rick Moritz <rah...@gmail.com> wrote:
> 
> Hi List,
> 
> I'm wondering if the following behaviour should be considered a bug, or 
> whether it "works as designed":
> 
> I'm starting multiple concurrent (FIFO-scheduled) jobs in a single 
> SparkContext, some of which write into the same tables.
> When these tables already exist, it appears as though both jobs [at least 
> believe that they] successfully appended to the table (i.e., both jobs 
> terminate succesfully, but I haven't checked whether the data from both jobs 
> was actually written, or if one job overwrote the other's data, despite 
> Mode.APPEND). If the table does not exist, both jobs will attempt to create 
> the table, but whichever job's turn is second (or  later) will then fail with 
> a AlreadyExistsException (org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException).
> 
> I think the issue here is, that both jobs don't register the table with the 
> metastore, until they actually start writing to it, but determine early on 
> that they will need to create it. The slower job then oobviously fails 
> creating the table, and instead of falling back to appending the data to the 
> existing table crashes out.
> 
> I would consider this a bit of a bug, but I'd like to make sure that it isn't 
> merely a case of me doing something stupid elsewhere, or indeed simply an 
> inherent architectural limitation of working with the metastore, before going 
> to Jira with this.
> 
> Also, I'm aware that running the jobs strictly sequentially would work around 
> the issue, but that would require reordering jobs before sending them off to 
> Spark, or kill efficiency.
> 
> Thanks for any feedback,
> 
> Rick

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND

Reply via email to