To avoid confusion, the query i am referring above is over some numeric
element inside *a: struct (nullable = true).*

On Mon, Jul 24, 2017 at 4:04 PM, Patrick <titlibat...@gmail.com> wrote:

> Hi,
>
> On reading a complex JSON, Spark infers schema as following:
>
> root
>  |-- header: struct (nullable = true)
>  |    |-- deviceId: string (nullable = true)
>  |    |-- sessionId: string (nullable = true)
>  |-- payload: struct (nullable = true)
>  |    |-- deviceObjects: array (nullable = true)
>  |    |    |-- element: struct (containsNull = true)
>  |    |    |    |-- additionalPayload: array (nullable = true)
>  |    |    |    |    |-- element: struct (containsNull = true)
>  |    |    |    |    |    |-- data: struct (nullable = true)
>  |    |    |    |    |    |    |-- *a: struct (nullable = true)*
>  |    |    |    |    |    |    |    |-- address: string (nullable = true)
>
> When we save the above Json in parquet using Spark sql we get only two top
> level columns "header" and "payload" in parquet.
>
> So now we want to do a mean calculation on element  *a: struct (nullable
> = true)*
>
> With reference to the Databricks blog for handling complex JSON
> https://databricks.com/blog/2017/02/23/working-complex-
> data-formats-structured-streaming-apache-spark-2-1.html
>
> *"when using Parquet, all struct columns will receive the same treatment
> as top-level columns. Therefore, if you have filters on a nested field, you
> will get the same benefits as a top-level column."*
>
> Referring to the above statement, will parquet treat *a: struct (nullable
> = true)* as top-level column struct and SQL query on the Dataset will be
> optimized?
>
> If not, do we need to externally impose the schema by exploding the
> complex type before writing to parquet in order to get top-level column
> benefit? What we can do with Spark 2.1, to extract the best performance
> over such nested structure like *a: struct (nullable = true).*
>
> Thanks
>
>

Reply via email to