[jira] [Commented] (SPARK-35386) parquet read with schema should fail on non-existing columns

Hyukjin Kwon (Jira) Sun, 16 May 2021 17:44:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345809#comment-17345809
 ]


Hyukjin Kwon commented on SPARK-35386:
--------------------------------------

I think the logic is that, when users specify a schema, users know and they are 
sure on the data has the specific schema, and then it should be able to read it 
as specified.

To do the assertion, you should manually check with one liner: 
{{assert(spark.read.parquet(...).schema == userSpecifiedSchema)}}

> parquet read with schema should fail on non-existing columns
> ------------------------------------------------------------
>
>                 Key: SPARK-35386
>                 URL: https://issues.apache.org/jira/browse/SPARK-35386
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark
>    Affects Versions: 3.0.1
>            Reporter: Rafal Wojdyla
>            Priority: Major
>
> When read schema is specified as I user I would prefer/like if spark failed 
> on missing columns.
> {code:python}
> from pyspark.sql.dataframe import DoubleType, StructType
> spark: SparkSession = ...
> spark.read.parquet("/tmp/data.snappy.parquet")
> # inferred schema, includes 3 columns: col1, col2, new_col
> # DataFrame[col1: bigint, col2: bigint, new_col: bigint]
> # let's specify a custom read_schema, with **non nullable** col3 (which is 
> not present):
> read_schema = StructType(fields=[StructField("col3",DoubleType(),False)])
> df = spark.read.schema(read_schema).parquet("/tmp/data.snappy.parquet")
> df.schema
> # we get a DataFrame with **nullable** col3:
> # StructType(List(StructField(col3,DoubleType,true)))
> df.count()
> # 0
> {code}
> Is this a feature or a bug? In this case there's just a single parquet file, 
> I have also tried {{option("mergeSchema", "true")}}, which doesn't help.
> Similar read pattern would fail on pandas (and likely dask).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35386) parquet read with schema should fail on non-existing columns

Reply via email to