Basic question on Apache Arrow

Subash Prabakar Sun, 16 Feb 2020 05:53:48 -0800

Hi all,

I could understand the use of Arrow in our projects to have
inter-operability as well as faster access. I have couple of questions on
how we can use for the following usecase and whether is it a good way of
usage,


1. Will the Spark execution be faster when I use joins on DF with Arrow
compared to normal Parquet format ? Due to serialization and
deserialization shuffling cost is lesser ? Is it?


2. If I have a use case of running aggregate queries on a very huge table
(say 10TB) containing say few dimensions and very few metrics - Is it good
to use Arrow as intermediate caching layer for interactive queries ? (Low
latency queries)
Note: Dremio contains this by default  - should I explore it or Impala or
Drill for this use case ?


Thanks,
Subash

Basic question on Apache Arrow

Reply via email to