I think I might have figure it out myself. Here's a pull request for you guys to check out:
https://github.com/apache/spark/pull/3855 I successfully tested this code on my cluster. On Tue, Dec 30, 2014 at 11:01 PM, Alessandro Baretta <alexbare...@gmail.com> wrote: > Here's a more meaningful exception: > > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.types.DateType$ cannot be cast to > org.apache.spark.sql.catalyst.types.PrimitiveType > at > org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:188) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:167) > at > org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:130) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at > parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at > parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at org.apache.spark.sql.parquet.InsertIntoParquetTable.org > $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > > This is easy to fix even for a newbie like myself: it suffices to add the > PrimitiveType trait to the DateType object. You can find this change here: > > https://github.com/alexbaretta/spark/compare/parquet-date-support > > However, even this does not work. Here's the next blocker: > > java.lang.RuntimeException: Unsupported datatype DateType, cannot write to > consumer > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:361) > at > org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:329) > at > org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:315) > at > parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120) > at > parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81) > at > parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37) > at org.apache.spark.sql.parquet.InsertIntoParquetTable.org > $apache$spark$sql$parquet$InsertIntoParquetTable$writeShard$1(ParquetTableOperations.scala:309) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) > at > org.apache.spark.sql.parquet.InsertIntoParquetTable$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:326) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > Any input on how to address this issue would be welcome. > > Alex > > On Tue, Dec 30, 2014 at 5:21 PM, Alessandro Baretta <alexbare...@gmail.com > > wrote: > >> Sorry! My bad. I had stale spark jars sitting on the slave nodes... >> >> Alex >> >> On Tue, Dec 30, 2014 at 4:39 PM, Alessandro Baretta < >> alexbare...@gmail.com> wrote: >> >>> Gents, >>> >>> I tried #3820. It doesn't work. I'm still getting the following >>> exceptions: >>> >>> Exception in thread "Thread-45" java.lang.RuntimeException: Unsupported >>> datatype DateType >>> at scala.sys.package$.error(package.scala:27) >>> at >>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:343) >>> at >>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$fromDataType$2.apply(ParquetTypes.scala:292) >>> at scala.Option.getOrElse(Option.scala:120) >>> at >>> org.apache.spark.sql.parquet.ParquetTypesConverter$.fromDataType(ParquetTypes.scala:291) >>> at >>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:363) >>> at >>> org.apache.spark.sql.parquet.ParquetTypesConverter$anonfun$4.apply(ParquetTypes.scala:362) >>> >>> I would more than happy to fix this myself, but I would need some help >>> wading through the code. Could anyone explain to me what exactly is needed >>> to support a new data type in SparkSQL's Parquet storage engine? >>> >>> Thanks. >>> >>> Alex >>> >>> On Mon, Dec 29, 2014 at 10:20 PM, Wang, Daoyuan <daoyuan.w...@intel.com> >>> wrote: >>> >>>> By adding a flag in SQLContext, I have modified #3822 to include >>>> nanoseconds now. Since passing too many flags is ugly, now I need the whole >>>> SQLContext, so that we can put more flags there. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Daoyuan >>>> >>>> >>>> >>>> *From:* Michael Armbrust [mailto:mich...@databricks.com] >>>> *Sent:* Tuesday, December 30, 2014 10:43 AM >>>> *To:* Alessandro Baretta >>>> *Cc:* Wang, Daoyuan; dev@spark.apache.org >>>> *Subject:* Re: Unsupported Catalyst types in Parquet >>>> >>>> >>>> >>>> Yeah, I saw those. The problem is that #3822 truncates timestamps that >>>> include nanoseconds. >>>> >>>> >>>> >>>> On Mon, Dec 29, 2014 at 5:14 PM, Alessandro Baretta < >>>> alexbare...@gmail.com> wrote: >>>> >>>> Michael, >>>> >>>> >>>> >>>> Actually, Adrian Wang already created pull requests for these issues. >>>> >>>> >>>> >>>> https://github.com/apache/spark/pull/3820 >>>> >>>> https://github.com/apache/spark/pull/3822 >>>> >>>> >>>> >>>> What do you think? >>>> >>>> >>>> >>>> Alex >>>> >>>> >>>> >>>> On Mon, Dec 29, 2014 at 3:07 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>> I'd love to get both of these in. There is some trickiness that I talk >>>> about on the JIRA for timestamps since the SQL timestamp class can support >>>> nano seconds and I don't think parquet has a type for this. Other systems >>>> (impala) seem to use INT96. It would be great to maybe ask on the parquet >>>> mailing list what the plan is there to make sure that whatever we do is >>>> going to be compatible long term. >>>> >>>> >>>> >>>> Michael >>>> >>>> >>>> >>>> On Mon, Dec 29, 2014 at 8:13 AM, Alessandro Baretta < >>>> alexbare...@gmail.com> wrote: >>>> >>>> Daoyuan, >>>> >>>> Thanks for creating the jiras. I need these features by... last week, >>>> so I'd be happy to take care of this myself, if only you or someone more >>>> experienced than me in the SparkSQL codebase could provide some guidance. >>>> >>>> Alex >>>> >>>> On Dec 29, 2014 12:06 AM, "Wang, Daoyuan" <daoyuan.w...@intel.com> >>>> wrote: >>>> >>>> Hi Alex, >>>> >>>> I'll create JIRA SPARK-4985 for date type support in parquet, and >>>> SPARK-4987 for timestamp type support. For decimal type, I think we only >>>> support decimals that fits in a long. >>>> >>>> Thanks, >>>> Daoyuan >>>> >>>> -----Original Message----- >>>> From: Alessandro Baretta [mailto:alexbare...@gmail.com] >>>> Sent: Saturday, December 27, 2014 2:47 PM >>>> To: dev@spark.apache.org; Michael Armbrust >>>> Subject: Unsupported Catalyst types in Parquet >>>> >>>> Michael, >>>> >>>> I'm having trouble storing my SchemaRDDs in Parquet format with >>>> SparkSQL, due to my RDDs having having DateType and DecimalType fields. >>>> What would it take to add Parquet support for these Catalyst? Are there any >>>> other Catalyst types for which there is no Catalyst support? >>>> >>>> Alex >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >