Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Bart Samwel Fri, 26 Jun 2020 02:40:40 -0700

I can't comment on that myself, I haven't been part of the community so I
don't know what is customary for this kind of thing. W.r.t. "compatibility
with Parquet's TimeType", I'd like to argue that that isn't a use case by
itself. The use case is "what people do with it". All in all, TIME is just
clock hands, and the number of uses of that is kind of limited. Typical
things that people try do do involves physical timestamps (a physical point
in time, like spark's timestamp type), logical timestamps (date + hands of
the clock, not associated with any time zone), or dates (logical, not
associated with any time zone). The one reason I can see to have TIME is to
make the type system orthogonal, i.e., to have a DATE type, a TIME type,
and a DATETIME type that is a DATE plus a TIME. But is it useful by itself?
Not that much. Maybe it's useful if you're building a scheduler, like cron?
If you have more actual use cases for this that aren't easily satisfied in
another way, then by all means share them!


On Thu, Jun 25, 2020 at 10:41 PM Rylan Dmello <[email protected]> wrote:

> Hello Bart,
>
> Thank you for sharing these links, this was exactly what Tahsin and I were
> looking for. It looks like there has been a lot of discussion about this
> already, which is good to see.
>
> In one of these pull requests, there is a comment about the number of
> real-world use-cases for some kind of TimeType in Spark. We could add our
> use-case of compatibility with Parquet's TimeType as a use-case for a new
> Spark TimeType.
>
> Would it be helpful to collect/document these TimeType use-cases to gauge
> interest? We could add a new story or comment in the Spark JIRA or a page
> on the Apache Confluence if that helps.
>
> Rylan
> ------------------------------
> *From:* Bart Samwel <[email protected]>
> *Sent:* Wednesday, June 24, 2020 4:08 PM
> *To:* Rylan Dmello <[email protected]>
> *Cc:* [email protected] <[email protected]>; Tahsin Hassan <
> [email protected]>
> *Subject:* Re: [Spark SQL] Question about support for TimeType columns in
> Apache Parquet files
>
> The relevant earlier discussion is here:
> https://github.com/apache/spark/pull/25678#issuecomment-531585556.
>
> (FWIW, a recent PR tried adding this again:
> https://github.com/apache/spark/pull/28858.)
>
> On Wed, Jun 24, 2020 at 10:01 PM Rylan Dmello <[email protected]>
> wrote:
>
> Hello,
>
>
> Tahsin and I are trying to use the Apache Parquet file format with Spark
> SQL, but are running into errors when reading Parquet files that contain
> TimeType columns. We're wondering whether this is unsupported in Spark SQL
> due to an architectural limitation, or due to lack of resources?
>
>
> Context: When reading some Parquet files with Spark, we get an error
> message like the following:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 186.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 186.0 (TID 1970, 10.155.249.249, executor 1): java.io.IOException: Could
> not read or convert schema for file:
> dbfs:/test/randomdata/sample001.parquet
> ...
> Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type:
> INT64 (TIME_MICROS);
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:106)
>
>
> This only seems to occur with Parquet files that have a column with the
> "TimeType" (or the deprecated "TIME_MILLIS"/"TIME_MICROS") types in the
> Parquet file. After digging into this a bit, we think that the error
> message is coming from "ParquetSchemaConverter.scala" here: link
> <https://github.com/apache/spark/blob/11d3a744e20fe403dd76e18d57963b6090a7c581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L151>.
>
> <https://github.com/apache/spark/blob/11d3a744e20fe403dd76e18d57963b6090a7c581/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L140>
>
>
> This seems to imply that the Spark SQL engine does not support reading
> Parquet files with TimeType columns.
>
> We are wondering if anyone on the mailing list could shed some more light
> on this: are there are architectural/datatype limitations in Spark that are
> resulting in this error, or is TimeType support for Parquet files something
> that hasn't been implemented yet due to lack of resources/interest?
>
>
> Thanks,
> Rylan
>
>
>
> --
> Bart Samwel
> [email protected]
>
>
>

-- 
Bart Samwel
[email protected]

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

Reply via email to