Re: Apache Arrow support for Apache Spark

2020-02-17 Thread Chris Teoh
1. I'd also consider how you're structuring the data before applying the join, naively doing the join could be expensive so doing a bit of data preparation may be necessary to improve join performance. Try to get a baseline as well. Arrow would help improve it. 2. Try storing it back as Parquet bu

Apache Arrow support for Apache Spark

2020-02-16 Thread Subash Prabakar
Hi Team, I have two questions regarding Arrow and Spark integration, 1. I am joining two huge tables (1PB) each - will the performance be huge when I use Arrow format before shuffling ? Will the serialization/deserialization cost have significant improvement? 2. Can we store the final data in Ar