Re: Unsupported Catalyst types in Parquet

Alessandro Baretta Tue, 30 Dec 2014 23:34:04 -0800

I think I might have figure it out myself. Here's a pull request for you
guys to check out:


https://github.com/apache/spark/pull/3855

I successfully tested this code on my cluster.

On Tue, Dec 30, 2014 at 11:01 PM, Alessandro Baretta <alexbare...@gmail.com>
wrote:

> Here's a more meaningful exception:
>
> java.lang.ClassCastException:
> org.apache.spark.sql.catalyst.types.DateType$ cannot be cast to
> org.apache.spark.sql.catalyst.types.PrimitiveType
>         at
> org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:188)
>         at
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:167)
>         at
> org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:130)
>         at
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
> This is easy to fix even for a newbie like myself: it suffices to add the
> PrimitiveType trait to the DateType object. You can find this change here:
>
> https://github.com/alexbaretta/spark/compare/parquet-date-support
>
> However, even this does not work. Here's the next blocker:
>
> java.lang.RuntimeException: Unsupported datatype DateType, cannot write to
> consumer
>         at scala.sys.package$.error(package.scala:27)
>         at
> org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:361)
>         at
> org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:329)
>         at
> org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:315)
>         at
> parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
>         at
> parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
>         at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
> $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Any input on how to address this issue would be welcome.
>
> Alex
>
> On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta <alexbare...@gmail.com
> > wrote:
>
>> Sorry! My bad. I had stale spark jars sitting on the slave nodes...
>>
>> Alex
>>
>> On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta <
>> alexbare...@gmail.com> wrote:
>>
>>> Gents,
>>>
>>> I tried #3820. It doesn't work. I'm still getting the following
>>> exceptions:
>>>
>>> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported
>>> datatype DateType
>>>         at scala.sys.package$.error(package.scala:27)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292)
>>>         at scala.Option.getOrElse(Option.scala:120)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363)
>>>         at
>>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362)
>>>
>>> I would more than happy to fix this myself, but I would need some help
>>> wading through the code. Could anyone explain to me what exactly is needed
>>> to support a new data type in SparkSQL's Parquet storage engine?
>>>
>>> Thanks.
>>>
>>> Alex
>>>
>>> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <daoyuan.w...@intel.com>
>>> wrote:
>>>
>>>>  By adding a flag in SQLContext, I have modified #3822 to include
>>>> nanoseconds now. Since passing too many flags is ugly, now I need the whole
>>>> SQLContext, so that we can put more flags there.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Daoyuan
>>>>
>>>>
>>>>
>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>> *Sent:* Tuesday, December 30, 2014 10:43 AM
>>>> *To:* Alessandro Baretta
>>>> *Cc:* Wang, Daoyuan; dev@spark.apache.org
>>>> *Subject:* Re: Unsupported Catalyst types in Parquet
>>>>
>>>>
>>>>
>>>> Yeah, I saw those.  The problem is that #3822 truncates timestamps that
>>>> include nanoseconds.
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta <
>>>> alexbare...@gmail.com> wrote:
>>>>
>>>> Michael,
>>>>
>>>>
>>>>
>>>> Actually, Adrian Wang already created pull requests for these issues.
>>>>
>>>>
>>>>
>>>> https://github.com/apache/spark/pull/3820
>>>>
>>>> https://github.com/apache/spark/pull/3822
>>>>
>>>>
>>>>
>>>> What do you think?
>>>>
>>>>
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>> I'd love to get both of these in.  There is some trickiness that I talk
>>>> about on the JIRA for timestamps since the SQL timestamp class can support
>>>> nano seconds and I don't think parquet has a type for this.  Other systems
>>>> (impala) seem to use INT96.  It would be great to maybe ask on the parquet
>>>> mailing list what the plan is there to make sure that whatever we do is
>>>> going to be compatible long term.
>>>>
>>>>
>>>>
>>>> Michael
>>>>
>>>>
>>>>
>>>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta <
>>>> alexbare...@gmail.com> wrote:
>>>>
>>>> Daoyuan,
>>>>
>>>> Thanks for creating the jiras. I need these features by... last week,
>>>> so I'd be happy to take care of this myself, if only you or someone more
>>>> experienced than me in the SparkSQL codebase could provide some guidance.
>>>>
>>>> Alex
>>>>
>>>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <daoyuan.w...@intel.com>
>>>> wrote:
>>>>
>>>> Hi Alex,
>>>>
>>>> I'll create JIRA SPARK-4985 for date type support in parquet, and
>>>> SPARK-4987 for timestamp type support. For decimal type, I think we only
>>>> support decimals that fits in a long.
>>>>
>>>> Thanks,
>>>> Daoyuan
>>>>
>>>> -----Original Message-----
>>>> From: Alessandro Baretta [mailto:alexbare...@gmail.com]
>>>> Sent: Saturday, December 27, 2014 2:47 PM
>>>> To: dev@spark.apache.org; Michael Armbrust
>>>> Subject: Unsupported Catalyst types in Parquet
>>>>
>>>> Michael,
>>>>
>>>> I'm having trouble storing my SchemaRDDs in Parquet format with
>>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields.
>>>> What would it take to add Parquet support for these Catalyst? Are there any
>>>> other Catalyst types for which there is no Catalyst support?
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Unsupported Catalyst types in Parquet

Reply via email to