This is not strictly a spark question but I'll give it a shot:

have an existing setup of parquet files that are being queried from impala
and from spark.

I intend to add some 30 relatively 'heavy' columns to the parquet. Each
column would store an array of structs. Each struct can have from 5 to 20
fields. An array may have a couple of thousands of structs.

Theoretically, parquet being a columnar storage- extending it with columns
should not affect performance of *existing* queries (since they are not
touching these columns).

   - Is this premise correct?
   - What should I watch out for doing this move?
   - In general, what are the considerations when deciding on the "width"
   (i.e amount of columns) of a parquet file?

Reply via email to