Parquet tools

Andrey Balmin (JIRA) Mon, 11 Jan 2016 11:23:58 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-11558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092521#comment-15092521
 ]


Andrey Balmin commented on HIVE-11558:
--------------------------------------

I got the same stack trace on CDH 5.3. The problem is fixed in CDH 5.4.
The problem happens if a ColumnChunk contains only nulls. The Statistics object 
for that ColumnChunk looks like this:

statistics = {
 max = null
 min = null
 null_count = 43927
 distinct_count = 0

Thus, accessing statistics.min.array() inside fromParquetStatistics() results 
in an NPE.

> Hive generates Parquet files with broken footers, causes NullPointerException 
> in Spark / Drill / Parquet tools
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-11558
>                 URL: https://issues.apache.org/jira/browse/HIVE-11558
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, StorageHandler
>    Affects Versions: 1.2.1
>         Environment: HDP 2.3
>            Reporter: Hari Sekhon
>            Priority: Critical
>
> When creating a Parquet table in Hive from a table in another format (in this 
> case JSON) using CTAS, the generated parquet files are created with broken 
> footers and cause NullPointerExceptions in both Parquet tools and Spark when 
> reading the files directly.
> Here is the error from parquet tools:
> {code}Could not read footer: java.lang.NullPointerException{code}
> Here is the error from Spark reading the parquet file back:
> {code}java.lang.NullPointerException
>         at 
> parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
>         at 
> parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
>         at 
> parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
>         at 
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
>         at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
>         at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
>         at 
> scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
>         at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>         at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>         at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>         at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>         at 
> scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
>         at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
>         at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>         at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>         at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> What's interesting is that the table works fine in Hive when selecting out of 
> it, even when doing select * on the whole table and letting it run to the end 
> (it's a sample data set), it's only other tools it causes problems for.
> All fields are string except for the first one which is timestamp, but this 
> is not that known issue since if I create another parquet table with 3 fields 
> including the timestamp and two string fields using CTAS those hive generated 
> parquet files works fine in the other tools.
> The only thing I can see which appears to cause this is the other fields have 
> lots of NULLs in them as those json fields may or may not be present.
> I've converted this exact same json data set to parquet using Apache Drill 
> and also using Apache Spark SQL and both of those tools create parquet files 
> from this data set as a straight conversion that are fine when accessed via 
> Parquet tools or Drill or Spark or Hive (using an external Hive table 
> definition layered over the generated parquet files).
> This implies that it's Hive's generation of Parquet that is broken since both 
> Drill and Spark can convert the dataset from JSON to Parquet without any 
> issues on reading the files back in any of the other mentioned tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-11558) Hive generates Parquet files with broken footers, causes NullPointerException in Spark / Drill / Parquet tools

Reply via email to