Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Rylan Dmello Thu, 25 Jun 2020 13:41:40 -0700

Hello Bart,

Thank you for sharing these links, this was exactly what Tahsin and I were 
looking for. It looks like there has been a lot of discussion about this 
already, which is good to see.


In one of these pull requests, there is a comment about the number of 
real-world use-cases for some kind of TimeType in Spark. We could add our 
use-case of compatibility with Parquet's TimeType as a use-case for a new Spark 
TimeType.

Would it be helpful to collect/document these TimeType use-cases to gauge 
interest? We could add a new story or comment in the Spark JIRA or a page on 
the Apache Confluence if that helps.

Rylan
________________________________
From: Bart Samwel <bart.sam...@databricks.com>
Sent: Wednesday, June 24, 2020 4:08 PM
To: Rylan Dmello <rdme...@mathworks.com>
Cc: dev@spark.apache.org <dev@spark.apache.org>; Tahsin Hassan 
<thas...@mathworks.com>
Subject: Re: [Spark SQL] Question about support for TimeType columns in Apache 
Parquet files

The relevant earlier discussion is here: 
https://github.com/apache/spark/pull/25678#issuecomment-531585556<https://github.com/apache/spark/pull/25678#issuecomment-531585556>.

(FWIW, a recent PR tried adding this again: 
https://github.com/apache/spark/pull/28858<https://github.com/apache/spark/pull/28858>.)

On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello 
<rdme...@mathworks.com<mailto:rdme...@mathworks.com>> wrote:

Hello,


Tahsin and I are trying to use the Apache Parquet file format with Spark SQL, 
but are running into errors when reading Parquet files that contain TimeType 
columns. We're wondering whether this is unsupported in Spark SQL due to an 
architectural limitation, or due to lack of resources?


Context: When reading some Parquet files with Spark, we get an error message 
like the following:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage 186.0 
(TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could not read or 
convert schema for file: dbfs:/test/randomdata/sample001.parquet
...
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 
(TIME_MICROS);
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)


This only seems to occur with Parquet files that have a column with the 
"TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the Parquet 
file. After digging into this a bit, we think that the error message is coming 
from "ParquetSchemaConverter.scala" here: 
link<https://github.com/apache/spark/blob/11d3a744e20fe403dd76e18d57963b6090a7c581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L151>.
 
<https://github.com/apache/spark/blob/11d3a744e20fe403dd76e18d57963b6090a7c581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L140>


This seems to imply that the Spark SQL engine does not support reading Parquet 
files with TimeType columns.

We are wondering if anyone on the mailing list could shed some more light on 
this: are there are architectural/datatype limitations in Spark that are 
resulting in this error, or is TimeType support for Parquet files something 
that hasn't been implemented yet due to lack of resources/interest?


Thanks,

Rylan


--
Bart Samwel
bart.sam...@databricks.com<mailto:bart.sam...@databricks.com>

[https://docs.google.com/uc?export=download&id=1iplOU5OHdtakkH1xKbxfcB-4BA6DXfKc&revid=0B9RlGdocqKy4L2hpckhpbmM4dmNIekJpVVhNamlndmFzYUprPQ]

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Reply via email to