[I] Use FixedSizeBinary instead of Binary for int96 conversion when convertInt96ToArrowTimestamp is false [parquet-java]
doki23 opened a new issue, #3088: URL: https://github.com/apache/parquet-java/issues/3088 ### Describe the enhancement requested ```java public TypeMapping convertINT96(PrimitiveTypeName primitiveTypeName) throws RuntimeException { if (convertInt96ToArrowTimestamp) { return field(new ArrowType.Timestamp(TimeUnit.NANOSECOND, null)); } else { return field(new ArrowType.Binary()); } } ``` When converting a Parquet type to an Arrow type, if the original type is int96 and the option convertInt96ToArrowTimestamp is set to false, the resulting Arrow type defaults to Binary. However, it might be more appropriate to use FixedSizeBinary instead. ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[I] Required field 'num_values' was not found in serialized data! [parquet-java]
wardlican opened a new issue, #3084: URL: https://github.com/apache/parquet-java/issues/3084 ### Describe the bug, including details regarding any error messages, version, and platform. When using iceberg, we encountered a situation where a parquet file we wrote could not be read. When reading, the following error message appeared. Judging from the exception information, it is speculated that the parquet file is damaged or has not been written properly and cannot be parsed. We have also tried a variety of parsing tools but cannot parse it normally. However, the footer of the file is normal and the schema information of the file can be obtained, but the read data cannot be parsed. The DataPageHeader.parquet version is 1.13.1. Is there any tool that can restore damaged files? ``` org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595 at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595 at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366) at org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163) ... 23 more ``` ### Component(s) Thrift -- This is an automated message from the Apache Git Service. To
[PR] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary [parquet-java]
raunaqmorarka opened a new pull request, #3085: URL: https://github.com/apache/parquet-java/pull/3085 ### Rationale for this change The current default for V1 pages is PLAIN encoding. This encoding mixes string length with string data. This is inefficient for for skipping N values, as the encoding does not allow random access. It's also slow to decode as the interleaving of lengths with data does not allow efficient batched implementations and forces most implementations to make copies of the data to fit the usual representation of separate offsets and data for strings. DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates offsets and data. The parquet-format spec also seems to recommend this https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3086: Allow for empty beans [parquet-java]
Fokko merged PR #3087: URL: https://github.com/apache/parquet-java/pull/3087 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[I] `ParquetMetadata` JSON serialization is failing [parquet-java]
Fokko opened a new issue, #3086: URL: https://github.com/apache/parquet-java/issues/3086 ### Describe the bug, including details regarding any error messages, version, and platform. Discovered by plugging in RC1 into Spark: https://github.com/apache/spark/pull/48970 Failing test: https://github.com/Fokko/spark/actions/runs/12027509812/job/33528737000 ``` ... Cause: java.lang.RuntimeException: shaded.parquet.com.fasterxml.jackson.databind.exc.InvalidDefinitionException: No serializer found for class org.apache.parquet.schema.LogicalTypeAnnotation$StringLogicalTypeAnnotation and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: org.apache.parquet.hadoop.metadata.ParquetMetadata["fileMetaData"]->org.apache.parquet.hadoop.metadata.FileMetaData["schema"]->org.apache.parquet.schema.MessageType["fields"]->java.util.ArrayList[1]->org.apache.parquet.schema.PrimitiveType["logicalTypeAnnotation"]) at org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:73) at org.apache.parquet.hadoop.metadata.ParquetMetadata.toPrettyJSON(ParquetMetadata.java:49) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1594) ... ``` I checked the affected classes, and they haven't changed for a long time, so I believe it is the upgrade to the later version of Jackson. Spark uses the same version of Jackson, so I fixed it by allowing to serialize to `null`. Converting to JSON is used for debugging purposes: https://github.com/apache/parquet-java/blob/7644e27717e09d570d99f76cf5bb631122d374bf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1594 ### Component(s) _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary [parquet-java]
raunaqmorarka commented on PR #3085: URL: https://github.com/apache/parquet-java/pull/3085#issuecomment-2506285935 > Hey @raunaqmorarka thanks for raising this. I think we want to [discuss on the devlist](https://lists.apache.org/list.html?d...@parquet.apache.org) first if we want to change behavior. Would you be interested to raise this? I'm not sure how to start a discussion on the devlist, I don't have credentials to login there. It would be nice to discuss on the GH issue https://github.com/apache/parquet-java/issues/3083 if that's possible -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3078: Use Hadoop FileSystem.openFile() to open files [parquet-java]
gszadovszky commented on code in PR #3079: URL: https://github.com/apache/parquet-java/pull/3079#discussion_r1861685918 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/FutureIO.java: ## @@ -70,6 +70,29 @@ public static T awaitFuture(final Future future, final long timeout, fina } } + /** + * Given a future, evaluate it. + * + * Any exception generated in the future is + * extracted and rethrown. + * + * @param future future to evaluate + * @param type of the result. + * @return the result, if all went well. + * @throws InterruptedIOException future was interrupted + * @throws IOException if something went wrong + * @throws RuntimeException any nested RTE thrown + */ + public static T awaitFuture(final Future future) + throws InterruptedIOException, IOException, RuntimeException { +try { + return future.get(); +} catch (InterruptedException e) { + throw (InterruptedIOException) new InterruptedIOException(e.toString()).initCause(e); Review Comment: nit: Why the cast? ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java: ## @@ -70,9 +93,38 @@ public long getLength() { return stat.getLen(); } + /** + * Open the file. + * Uses {@code FileSystem.openFile()} so that + * the existing FileStatus can be passed down: saves a HEAD request on cloud storage. + * and ignored everywhere else. + * + * @return the input stream. + * + * @throws InterruptedIOException future was interrupted + * @throws IOException if something went wrong + * @throws RuntimeException any nested RTE thrown + */ @Override public SeekableInputStream newStream() throws IOException { -return HadoopStreams.wrap(fs.open(stat.getPath())); +FSDataInputStream stream; +try { + // this method is async so that implementations may do async HEAD head + // requests. Not done in S3A/ABFS when a file status passed down (as is done here) + final CompletableFuture future = fs.openFile(stat.getPath()) + .withFileStatus(stat) + .opt(OPENFILE_READ_POLICY_KEY, PARQUET_READ_POLICY) + .build(); + stream = awaitFuture(future); +} catch (RuntimeException e) { + // S3A < 3.3.5 would raise illegal path exception if the openFile path didn't + // equal the path in the FileStatus; Hive virtual FS could create this condition. + // As the path to open is derived from stat.getPath(), this condition seems + // near-impossible to create -but is handled here for due diligence. + stream = fs.open(stat.getPath()); Review Comment: Shouldn't we at least log the original exception? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
Re: [PR] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary [parquet-java]
Fokko commented on PR #3085: URL: https://github.com/apache/parquet-java/pull/3085#issuecomment-2506204690 Hey @raunaqmorarka thanks for raising this. I think we want to [discuss on the devlist](https://lists.apache.org/list.html?d...@parquet.apache.org) first if we want to change behavior. Would you be interested to raise this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org
[PR] GH-3086: Allow for empty beans [parquet-java]
Fokko opened a new pull request, #3087: URL: https://github.com/apache/parquet-java/pull/3087 ### Rationale for this change Please check the issue: https://github.com/apache/parquet-java/issues/3086 ### What changes are included in this PR? ### Are these changes tested? Yes, added a new test. ### Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org