To avoid confusion, the query i am referring above is over some numeric element inside *a: struct (nullable = true).*
On Mon, Jul 24, 2017 at 4:04 PM, Patrick <titlibat...@gmail.com> wrote: > Hi, > > On reading a complex JSON, Spark infers schema as following: > > root > |-- header: struct (nullable = true) > | |-- deviceId: string (nullable = true) > | |-- sessionId: string (nullable = true) > |-- payload: struct (nullable = true) > | |-- deviceObjects: array (nullable = true) > | | |-- element: struct (containsNull = true) > | | | |-- additionalPayload: array (nullable = true) > | | | | |-- element: struct (containsNull = true) > | | | | | |-- data: struct (nullable = true) > | | | | | | |-- *a: struct (nullable = true)* > | | | | | | | |-- address: string (nullable = true) > > When we save the above Json in parquet using Spark sql we get only two top > level columns "header" and "payload" in parquet. > > So now we want to do a mean calculation on element *a: struct (nullable > = true)* > > With reference to the Databricks blog for handling complex JSON > https://databricks.com/blog/2017/02/23/working-complex- > data-formats-structured-streaming-apache-spark-2-1.html > > *"when using Parquet, all struct columns will receive the same treatment > as top-level columns. Therefore, if you have filters on a nested field, you > will get the same benefits as a top-level column."* > > Referring to the above statement, will parquet treat *a: struct (nullable > = true)* as top-level column struct and SQL query on the Dataset will be > optimized? > > If not, do we need to externally impose the schema by exploding the > complex type before writing to parquet in order to get top-level column > benefit? What we can do with Spark 2.1, to extract the best performance > over such nested structure like *a: struct (nullable = true).* > > Thanks > >