Re: [EXTERNAL] Partial data with ADLS Gen2

hwl17801341688 Tue, 26 Jul 2022 04:16:18 -0700

---- Replied Message ----
| From | Tufan Rakshit<tufan...@gmail.com> |
| Date | 07/24/2022 18:59 |
| To | Shay Elbaz<shay.el...@gm.com> |
| Cc | kineret M<kiner...@gmail.com>,
user<user@spark.apache.org> |
| Subject | Re: [EXTERNAL] Partial data with ADLS Gen2 |
Just use Delta 


Best 
Tufan
Sent from my iPhone

On 24 Jul 2022, at 12:20, Shay Elbaz <shay.el...@gm.com> wrote:



This is a known issue. Apache Iceberg, Hudi and Delta lake and among the 
possible solutions.
Alternatively, instead of writing the output directly to the "official" 
location, write it to some staging directory instead. Once the job is done, 
rename the staging dir to the official location.
From: kineret M <kiner...@gmail.com>
Sent: Sunday, July 24, 2022 1:06 PM
To:user@spark.apache.org <user@spark.apache.org>
Subject: [EXTERNAL] Partial data with ADLS Gen2
 

|

ATTENTION: This email originated from outside of GM.

|

 
I have spark batch application writing to ADLS Gen2 (hierarchy). 
When designing the application I was sure the spark would perform global commit 
once the job is committed, but what it really does it commits on each task, 
meaning once task completes writing it moves from temp to target storage. So if 
the batch fails we have partial data, and on retry we are getting data 
duplications. 
Our scale is really huge so rolling back (deleting data) is not an option for 
us, the search will takes a lot of time. 
Is there any "build in" solution, something we can use out of the box?



Thanks.
Re: [EXTERNAL] Partial data with ADLS Gen2

Reply via email to