How does extending an existing parquet with columns affect impala/spark performance?

Vitaliy Pisarev Tue, 03 Apr 2018 07:14:43 -0700

This is not strictly a spark question but I'll give it a shot:

have an existing setup of parquet files that are being queried from impala
and from spark.


I intend to add some 30 relatively 'heavy' columns to the parquet. Each
column would store an array of structs. Each struct can have from 5 to 20
fields. An array may have a couple of thousands of structs.

Theoretically, parquet being a columnar storage- extending it with columns
should not affect performance of *existing* queries (since they are not
touching these columns).

   - Is this premise correct?
   - What should I watch out for doing this move?
   - In general, what are the considerations when deciding on the "width"
   (i.e amount of columns) of a parquet file?

How does extending an existing parquet with columns affect impala/spark performance?

Reply via email to