[ https://issues.apache.org/jira/browse/HIVE-11558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092521#comment-15092521 ]
Andrey Balmin commented on HIVE-11558: -------------------------------------- I got the same stack trace on CDH 5.3. The problem is fixed in CDH 5.4. The problem happens if a ColumnChunk contains only nulls. The Statistics object for that ColumnChunk looks like this: statistics = { max = null min = null null_count = 43927 distinct_count = 0 Thus, accessing statistics.min.array() inside fromParquetStatistics() results in an NPE. > Hive generates Parquet files with broken footers, causes NullPointerException > in Spark / Drill / Parquet tools > -------------------------------------------------------------------------------------------------------------- > > Key: HIVE-11558 > URL: https://issues.apache.org/jira/browse/HIVE-11558 > Project: Hive > Issue Type: Bug > Components: File Formats, StorageHandler > Affects Versions: 1.2.1 > Environment: HDP 2.3 > Reporter: Hari Sekhon > Priority: Critical > > When creating a Parquet table in Hive from a table in another format (in this > case JSON) using CTAS, the generated parquet files are created with broken > footers and cause NullPointerExceptions in both Parquet tools and Spark when > reading the files directly. > Here is the error from parquet tools: > {code}Could not read footer: java.lang.NullPointerException{code} > Here is the error from Spark reading the parquet file back: > {code}java.lang.NullPointerException > at > parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249) > at > parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543) > at > parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520) > at > parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297) > at > scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) > at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) > at > scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165) > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514) > at > scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > What's interesting is that the table works fine in Hive when selecting out of > it, even when doing select * on the whole table and letting it run to the end > (it's a sample data set), it's only other tools it causes problems for. > All fields are string except for the first one which is timestamp, but this > is not that known issue since if I create another parquet table with 3 fields > including the timestamp and two string fields using CTAS those hive generated > parquet files works fine in the other tools. > The only thing I can see which appears to cause this is the other fields have > lots of NULLs in them as those json fields may or may not be present. > I've converted this exact same json data set to parquet using Apache Drill > and also using Apache Spark SQL and both of those tools create parquet files > from this data set as a straight conversion that are fine when accessed via > Parquet tools or Drill or Spark or Hive (using an external Hive table > definition layered over the generated parquet files). > This implies that it's Hive's generation of Parquet that is broken since both > Drill and Spark can convert the dataset from JSON to Parquet without any > issues on reading the files back in any of the other mentioned tools. -- This message was sent by Atlassian JIRA (v6.3.4#6332)