Joining 3 tables with 17 billions records

Chetan Khatri Thu, 02 Nov 2017 12:57:52 -0700

Hello Spark Developers,

I have 3 tables that i am reading from HBase and wants to do join
transformation and save to Hive Parquet external table. Currently my join
is failing with container failed error.


1. Read table A from Hbase with ~17 billion records.
2. repartition on primary key of table A
3. create temp view of table A Dataframe
4. Read table B from HBase with ~4 billion records
5. repartition on primary key of table B
6. create temp view of table B Dataframe
7. Join both view of A and B and create Dataframe C
8.  Join Dataframe C with table D
9. coleance(20) to reduce number of file creation on already repartitioned
DF.
10. Finally store to external hive table with partition by skey.

Any Suggestion or resources you come across please do share suggestions on
this to optimize this.

Thanks
Chetan

Joining 3 tables with 17 billions records

Reply via email to