This is not strictly a spark question but I'll give it a shot: have an existing setup of parquet files that are being queried from impala and from spark.
I intend to add some 30 relatively 'heavy' columns to the parquet. Each column would store an array of structs. Each struct can have from 5 to 20 fields. An array may have a couple of thousands of structs. Theoretically, parquet being a columnar storage- extending it with columns should not affect performance of *existing* queries (since they are not touching these columns). - Is this premise correct? - What should I watch out for doing this move? - In general, what are the considerations when deciding on the "width" (i.e amount of columns) of a parquet file?