Since parquet don't support updates you have to backfill your dataset. If
that is your regular scenario you should partition your parquet files so
backfilling becomes easier.

As the data is structured now you have to update everything just to upsert
quite a small amount of changed data. Look at your data, look at your use
case and use partitioning (and bucketing if you want to eliminate/reduce
shuffle joins) to store your data in a more optimal way.

Lets say your large table is a timeline of events stretching three years
back but your updated data is only from the last week or month. If you'd
partition by year/month/week/day you could just backfill the partitions
that was updated. Adapt the pattern to your particular scenario and data
size.

If everything is random and no sure way to decide what partition updated
will happen in you could just break down your dataset by key %
(suitable_partition_size/assumed_total_size_of_dataset). There are alot of
partitioning schemes but the point is you have to limit the amount of data
to read from disk, filter and write back to get better performance.

regards,

Magnus

On Wed, May 29, 2019 at 7:20 PM Tomasz Krol <patric...@gmail.com> wrote:

> Hey Guys,
>
> I am wondering what would be your approach to following scenario:
>
> I have two tables - one (Table A) is relatively small (e.g 50GB) and
> second one (Table B) much bigger (e.g. 3TB). Both are parquet tables.
>
>  I want to ADD all records from Table A to Table B which dont exist in
> Table B yet. I use only one field (e.g. key) to check existence for
> specific record.
>
> Then I want to UPDATE (by values from Table A) all records in Table B
> which also exist in Table A. To determine if specific record exist I use
> also the same "key" field.
>
> To achieve above I run following sql queries:
>
> 1. Find existing records and insert into temp table
>
> insert into temp_table select a.cols from Table A a left semi join Table B
> b on a.key = b.key
>
> 2. Find new records and insert them into temp table
>
> insert into temp_table select a.cols from Table A a left anti join Table B
> b on a.key = b.key
>
> 3. Find existing records in Table B which dont exist in   Table A
>
> insert into temp_table select b.cols from Table B b left anti join Table A
> a a.key = b. key
>
> In that way I built Table B updated with records from Table A.
> However, the problem here is the step 3, because I am inserting almost 3
> TB of data that takes obviously some time.
> I was trying different approaches but no luck.
>
> I am wondering whats your ideas how can we perform this scenario
> efficiently in Spark?
>
> Cheers
>
> Tom
> --
> Tomasz Krol
> patric...@gmail.com
>

Reply via email to