Hi,
I am running into performance issue when joining data frames created from avro
files using spark-avro library.
The data frames are created from 120K avro files and the total size is around
1.5 TB.
The two data frames are very huge with billions of records.
The join for these two DataFrames runs forever.
This process runs on a yarn cluster with 300 executors with 4 executor cores
and 8GB memory.
Any insights on this join will help. I have posted the explain plan below.
I notice a CartesianProduct in the Physical Plan. I am wondering if this is
causing the performance issue.
Below is the logical plan and the physical plan. ( Due to the confidential
nature, I am unable to post any of the column names or the file names here )
== Optimized Logical Plan ==
Limit 21
Join Inner, [ Join Conditions ]
Join Inner, [ Join Conditions ]
Project [ List of columns ]
Relation [ List of columns ] AvroRelation[ fileName1 ] -- Another large file
InMemoryRelation [List of columsn ], true, 10000, StorageLevel(true, true,
false, true, 1), (Repartition 1, false), None
Project [ List of Columns ]
Relation[ List of Columns] AvroRelation[ filename2 ] -- This is a very large
file
== Physical Plan ==
Limit 21
Filter (filter conditions)
CartesianProduct
Filter (more filter conditions)
CartesianProduct
Project (selecting a few columns and applying a UDF to one column)
Scan AvroRelation[avro file][ columns in Avro File ]
InMemoryColumnarTableScan [List of columns ], true, 10000,
StorageLevel(true, true, false, true, 1), (Repartition 1, false), None)
Project [ List of Columns ]
Scan AvroRelation[Avro File][List of Columns]
Code Generation: true
Thanks,
Prasad.