1. I'd also consider how you're structuring the data before applying the
join, naively doing the join could be expensive so doing a bit of data
preparation may be necessary to improve join performance. Try to get a
baseline as well. Arrow would help improve it.
2. Try storing it back as Parquet bu
Hi Team,
I have two questions regarding Arrow and Spark integration,
1. I am joining two huge tables (1PB) each - will the performance be huge
when I use Arrow format before shuffling ? Will the
serialization/deserialization cost have significant improvement?
2. Can we store the final data in Ar