Hey Guys,

I am wondering what would be your approach to following scenario:

I have two tables - one (Table A) is relatively small (e.g 50GB) and second
one (Table B) much bigger (e.g. 3TB). Both are parquet tables.

 I want to ADD all records from Table A to Table B which dont exist in
Table B yet. I use only one field (e.g. key) to check existence for
specific record.

Then I want to UPDATE (by values from Table A) all records in Table B which
also exist in Table A. To determine if specific record exist I use also the
same "key" field.

To achieve above I run following sql queries:

1. Find existing records and insert into temp table

insert into temp_table select a.cols from Table A a left semi join Table B
b on a.key = b.key

2. Find new records and insert them into temp table

insert into temp_table select a.cols from Table A a left anti join Table B
b on a.key = b.key

3. Find existing records in Table B which dont exist in   Table A

insert into temp_table select b.cols from Table B b left anti join Table A
a a.key = b. key

In that way I built Table B updated with records from Table A.
However, the problem here is the step 3, because I am inserting almost 3 TB
of data that takes obviously some time.
I was trying different approaches but no luck.

I am wondering whats your ideas how can we perform this scenario
efficiently in Spark?

Cheers

Tom
-- 
Tomasz Krol
patric...@gmail.com

Reply via email to