Fwd: Dataset schema incompatibility bug when reading column partitioned data

Dávid Szakállas Thu, 11 Apr 2019 09:08:28 -0700

+dev for more visibility. Is this a known issue? Is there a plan for a fix?


Thanks,
David

> Begin forwarded message:
> 
> From: Dávid Szakállas <david.szakal...@gmail.com>
> Subject: Dataset schema incompatibility bug when reading column partitioned 
> data
> Date: 2019. March 29. 14:15:27 CET
> To: u...@spark.apache.org
> 
> We observed the following bug on Spark 2.4.0:
> 
> scala> 
> spark.createDataset(Seq((1,2))).write.partitionBy("_1").parquet("foo.parquet")
> 
> scala> val schema = StructType(Seq(StructField("_1", 
> IntegerType),StructField("_2", IntegerType)))
> 
> scala> spark.read.schema(schema).parquet("foo.parquet").as[(Int, Int)].show
> +---+---+
> | _2| _1|
> +---+---+
> |  2|  1|
> +---+- --+
> 
> That is, when reading column partitioned Parquet files the explicitly 
> specified schema is not adhered to, instead the partitioning columns are 
> appended the end of the column list. This is a quite severe issue as some 
> operations, such as union, fails if columns are in a different order in two 
> datasets. Thus we have to work around the issue with a select:
> 
> val columnNames = schema.fields.map(_.name)
> ds.select(columnNames.head, columnNames.tail: _*)
> 
> 
> Thanks, 
> David Szakallas
> Data Engineer | Whitepages, Inc.

Fwd: Dataset schema incompatibility bug when reading column partitioned data

Reply via email to