Hey Guys, I am wondering what would be your approach to following scenario:
I have two tables - one (Table A) is relatively small (e.g 50GB) and second one (Table B) much bigger (e.g. 3TB). Both are parquet tables. I want to ADD all records from Table A to Table B which dont exist in Table B yet. I use only one field (e.g. key) to check existence for specific record. Then I want to UPDATE (by values from Table A) all records in Table B which also exist in Table A. To determine if specific record exist I use also the same "key" field. To achieve above I run following sql queries: 1. Find existing records and insert into temp table insert into temp_table select a.cols from Table A a left semi join Table B b on a.key = b.key 2. Find new records and insert them into temp table insert into temp_table select a.cols from Table A a left anti join Table B b on a.key = b.key 3. Find existing records in Table B which dont exist in Table A insert into temp_table select b.cols from Table B b left anti join Table A a a.key = b. key In that way I built Table B updated with records from Table A. However, the problem here is the step 3, because I am inserting almost 3 TB of data that takes obviously some time. I was trying different approaches but no luck. I am wondering whats your ideas how can we perform this scenario efficiently in Spark? Cheers Tom -- Tomasz Krol patric...@gmail.com