Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame

Michael Armbrust Mon, 07 Nov 2016 08:29:44 -0800

If you can reproduce the issue with Spark 2.0.2 I'd suggest opening a JIRA.


On Fri, Nov 4, 2016 at 5:11 PM, Sam Goodwin <sam.goodwi...@gmail.com> wrote:

> I have a table with a few columns, some of which are arrays. Since
> upgrading from Spark 1.6 to Spark 2.0.1, the array fields are always null
> when reading in a DataFrame.
>
> When writing the Parquet files, the schema of the column is specified as
>
> StructField("packageIds",ArrayType(StringType))
>
> The schema of the column in the Hive Metastore is
>
> packageIds array<string>
>
> The schema used in the writer exactly matches the schema in the Metastore
> in all ways (order, casing, types etc)
>
> The query is a simple "select *"
>
> spark.sql("select * from tablename limit 1").collect() // null columns in Row
>
> How can I begin debugging this issue? Notable things I've already
> investigated:
>
>    - Files were written using Spark 1.6
>    - DataFrame works in spark 1.5 and 1.6
>    - I've inspected the parquet files using parquet-tools and can see the
>    data.
>    - I also have another table written in exactly the same way and it
>    doesn't have the issue.
>
>

Re: Upgrading to Spark 2.0.1 broke array in parquet DataFrame

Reply via email to