I tried a different tactic. I still append based on the query below, but I add
another deduping step afterwards, writing to a staging directory then
overwriting back. Luckily, the data is small enough for this to happen fast.
Cheers,
Ben
> On Jun 3, 2018, at 3:02 PM, Tayler Lawrence Jones
> w
Sorry actually my last message is not true for anti join, I was thinking of
semi join.
-TJ
On Sun, Jun 3, 2018 at 14:57 Tayler Lawrence Jones
wrote:
> A left join with null filter is only the same as a left anti join if the
> join keys can be guaranteed unique in the existing data. Since hive t
A left join with null filter is only the same as a left anti join if the
join keys can be guaranteed unique in the existing data. Since hive tables
on s3 offer no unique guarantees outside of your processing code, I
recommend using left anti join over left join + null filter.
-TJ
On Sun, Jun 3, 2
I do not use anti join semantics, but you can use left outer join and then
filter out nulls from right side. Your data may have dups on the columns
separately but it should not have dups on the composite key ie all columns
put together.
On Mon, 4 Jun 2018 at 6:42 am, Tayler Lawrence Jones
wrote:
The issue is not the append vs overwrite - perhaps those responders do not
know Anti join semantics. Further, Overwrite on s3 is a bad pattern due to
s3 eventual consistency issues.
First, your sql query is wrong as you don’t close the parenthesis of the
CTE (“with” part). In fact, it looks like y
As Jay suggested correctly, if you're joining then overwrite otherwise only
append as it removes dups.
I think, in this scenario, just change it to write.mode('overwrite')
because you're already reading the old data and your job would be done.
On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim, wrote:
Hi Jay,
Thanks for your response. Are you saying to append the new data and then
remove the duplicates to the whole data set afterwards overwriting the
existing data set with new data set with appended values? I will give that
a try.
Cheers,
Ben
On Fri, Jun 1, 2018 at 11:49 PM Jay wrote:
> Ben
Structured streaming can provide idempotent and exactly once writings in
parquet but I don't know how it does under the hood.
Without this you need to load all your dataset, then dedup, then write back
the entire dataset. This overhead can be minimized with partitionning
output files.
Le ven. 1 ju
Benjamin,
The append will append the "new" data to the existing data with removing
the duplicates. You would need to overwrite the file everytime if you need
unique values.
Thanks,
Jayadeep
On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote:
> I have a situation where I trying to add only new r