Hi Magnus,
Yes, I was thinking also about partitioning approach. And I think this is
the best solution in this type of scenario.
Also my scenario is relevant to your last paragraph, the dates which are
coming are very random. I can get updated from 2012 and from 2019.
Therefore, this strategy mi
Since parquet don't support updates you have to backfill your dataset. If
that is your regular scenario you should partition your parquet files so
backfilling becomes easier.
As the data is structured now you have to update everything just to upsert
quite a small amount of changed data. Look at yo
Unfortunately, dont have timestamps in those tables:( Only key on which I
can check existence of specific record.
But even with the timestamp how would you make the update.? When I say
update I mean to overwrite existing record.
For example you have following in table A
key| field1 | field2
1
Don't you have a date/timestamp to handle updates? So, you're talking about
CDC? If you've Datestamp you can check if that/those key(s) exists, if
exists then check if timestamp matches, if that matches, then ignore, if
that doesn't then update.
On Thu 30 May, 2019, 7:11 AM Genieliu, wrote:
> Is
Isn't step1 and step2 producing the copy of Table A?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Why don't you simply copy whole of delta data (Table A) into a stage table
(temp table in your case) and insert depending on a *WHERE NOT EXISTS* check
on primary key/composite key which already exists in the table B?
That's faster and does the reconciliation job smoothly enough.
Others, any bett
Hey Aakash,
That will work for records which dont exist yet in the target table. What
about records which have to be updated ?
As I mentioned, I want to do an upsert. That means, I want to add not
existing records and update those which already exist.
Thanks
Tom
On Wed 29 May 2019 at 18:39, Aa
Hey Guys,
I am wondering what would be your approach to following scenario:
I have two tables - one (Table A) is relatively small (e.g 50GB) and second
one (Table B) much bigger (e.g. 3TB). Both are parquet tables.
I want to ADD all records from Table A to Table B which dont exist in
Table B ye