[I] Use FixedSizeBinary instead of Binary for int96 conversion when convertInt96ToArrowTimestamp is false [parquet-java]

2024-11-28 Thread via GitHub


doki23 opened a new issue, #3088:
URL: https://github.com/apache/parquet-java/issues/3088

   ### Describe the enhancement requested
   
   ```java
   public TypeMapping convertINT96(PrimitiveTypeName primitiveTypeName) throws 
RuntimeException {
 if (convertInt96ToArrowTimestamp) {
   return field(new ArrowType.Timestamp(TimeUnit.NANOSECOND, null));
 } else {
   return field(new ArrowType.Binary());
 }
   }
   ```
   When converting a Parquet type to an Arrow type, if the original type is 
int96 and the option convertInt96ToArrowTimestamp is set to false, the 
resulting Arrow type defaults to Binary. However, it might be more appropriate 
to use FixedSizeBinary instead.
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[I] Required field 'num_values' was not found in serialized data! [parquet-java]

2024-11-28 Thread via GitHub


wardlican opened a new issue, #3084:
URL: https://github.com/apache/parquet-java/issues/3084

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When using iceberg, we encountered a situation where a parquet file we wrote 
could not be read. When reading, the following error message appeared. Judging 
from the exception information, it is speculated that the parquet file is 
damaged or has not been written properly and cannot be parsed. We have also 
tried a variety of parsing tools but cannot parse it normally. However, the 
footer of the file is normal and the schema information of the file can be 
obtained, but the read data cannot be parsed. The DataPageHeader.parquet 
version is 1.13.1. Is there any tool that can restore damaged files?
   ```
   org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can 
not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: 
Required field 'num_values' was not found in serialized data! Struct: 
org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
at 
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165)
at 
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141)
at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130)
at 
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.IOException: can not read class 
org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 
'num_values' was not found in serialized data! Struct: 
org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
at 
org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366)
at 
org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133)
at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458)
at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505)
at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478)
at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088)
at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956)
at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909)
at 
org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163)
... 23 more
   ```
   
   ### Component(s)
   
   Thrift


-- 
This is an automated message from the Apache Git Service.
To 

[PR] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary [parquet-java]

2024-11-28 Thread via GitHub


raunaqmorarka opened a new pull request, #3085:
URL: https://github.com/apache/parquet-java/pull/3085

   
   
   
   ### Rationale for this change
   The current default for V1 pages is PLAIN encoding. This encoding mixes 
string length with string data. This is inefficient for for skipping N values, 
as the encoding does not allow random access. It's also slow to decode as the 
interleaving of lengths with data does not allow efficient batched 
implementations and forces most implementations to make copies of the data to 
fit the usual representation of separate offsets and data for strings.
   
   DELTA_LENGTH_BYTE_ARRAY has none of the above problems as it separates 
offsets and data. The parquet-format spec also seems to recommend this 
https://github.com/apache/parquet-format/blob/c70281359087dfaee8bd43bed9748675f4aabe11/Encodings.md?plain=1#L299
   
   ### What changes are included in this PR?
   
   
   ### Are these changes tested?
   
   
   ### Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3086: Allow for empty beans [parquet-java]

2024-11-28 Thread via GitHub


Fokko merged PR #3087:
URL: https://github.com/apache/parquet-java/pull/3087


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[I] `ParquetMetadata` JSON serialization is failing [parquet-java]

2024-11-28 Thread via GitHub


Fokko opened a new issue, #3086:
URL: https://github.com/apache/parquet-java/issues/3086

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Discovered by plugging in RC1 into Spark: 
https://github.com/apache/spark/pull/48970
   
   Failing test: 
https://github.com/Fokko/spark/actions/runs/12027509812/job/33528737000
   ```
   ...
   Cause: java.lang.RuntimeException: 
shaded.parquet.com.fasterxml.jackson.databind.exc.InvalidDefinitionException: 
No serializer found for class 
org.apache.parquet.schema.LogicalTypeAnnotation$StringLogicalTypeAnnotation and 
no properties discovered to create BeanSerializer 
   (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) 
   (through reference chain: 
org.apache.parquet.hadoop.metadata.ParquetMetadata["fileMetaData"]->org.apache.parquet.hadoop.metadata.FileMetaData["schema"]->org.apache.parquet.schema.MessageType["fields"]->java.util.ArrayList[1]->org.apache.parquet.schema.PrimitiveType["logicalTypeAnnotation"])
 at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toJSON(ParquetMetadata.java:73)
 at 
org.apache.parquet.hadoop.metadata.ParquetMetadata.toPrettyJSON(ParquetMetadata.java:49)
 at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1594)
   ...
   ```
   
   I checked the affected classes, and they haven't changed for a long time, so 
I believe it is the upgrade to the later version of Jackson. Spark uses the 
same version of Jackson, so I fixed it by allowing to serialize to `null`.
   
   Converting to JSON is used for debugging purposes: 
   
   
https://github.com/apache/parquet-java/blob/7644e27717e09d570d99f76cf5bb631122d374bf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1594
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary [parquet-java]

2024-11-28 Thread via GitHub


raunaqmorarka commented on PR #3085:
URL: https://github.com/apache/parquet-java/pull/3085#issuecomment-2506285935

   > Hey @raunaqmorarka thanks for raising this. I think we want to [discuss on 
the devlist](https://lists.apache.org/list.html?d...@parquet.apache.org) first 
if we want to change behavior. Would you be interested to raise this?
   
   I'm not sure how to start a discussion on the devlist, I don't have 
credentials to login there.
   It would be nice to discuss on the GH issue 
https://github.com/apache/parquet-java/issues/3083 if that's possible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3078: Use Hadoop FileSystem.openFile() to open files [parquet-java]

2024-11-28 Thread via GitHub


gszadovszky commented on code in PR #3079:
URL: https://github.com/apache/parquet-java/pull/3079#discussion_r1861685918


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/wrapped/io/FutureIO.java:
##
@@ -70,6 +70,29 @@ public static  T awaitFuture(final Future future, 
final long timeout, fina
 }
   }
 
+  /**
+   * Given a future, evaluate it.
+   * 
+   * Any exception generated in the future is
+   * extracted and rethrown.
+   * 
+   * @param future future to evaluate
+   * @param  type of the result.
+   * @return the result, if all went well.
+   * @throws InterruptedIOException future was interrupted
+   * @throws IOException if something went wrong
+   * @throws RuntimeException any nested RTE thrown
+   */
+  public static  T awaitFuture(final Future future)
+  throws InterruptedIOException, IOException, RuntimeException {
+try {
+  return future.get();
+} catch (InterruptedException e) {
+  throw (InterruptedIOException) new 
InterruptedIOException(e.toString()).initCause(e);

Review Comment:
   nit: Why the cast?



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopInputFile.java:
##
@@ -70,9 +93,38 @@ public long getLength() {
 return stat.getLen();
   }
 
+  /**
+   * Open the file.
+   * Uses {@code FileSystem.openFile()} so that
+   * the existing FileStatus can be passed down: saves a HEAD request on cloud 
storage.
+   * and ignored everywhere else.
+   *
+   * @return the input stream.
+   *
+   * @throws InterruptedIOException future was interrupted
+   * @throws IOException if something went wrong
+   * @throws RuntimeException any nested RTE thrown
+   */
   @Override
   public SeekableInputStream newStream() throws IOException {
-return HadoopStreams.wrap(fs.open(stat.getPath()));
+FSDataInputStream stream;
+try {
+  // this method is async so that implementations may do async HEAD head
+  // requests. Not done in S3A/ABFS when a file status passed down (as is 
done here)
+  final CompletableFuture future = 
fs.openFile(stat.getPath())
+  .withFileStatus(stat)
+  .opt(OPENFILE_READ_POLICY_KEY, PARQUET_READ_POLICY)
+  .build();
+  stream = awaitFuture(future);
+} catch (RuntimeException e) {
+  // S3A < 3.3.5 would raise illegal path exception if the openFile path 
didn't
+  // equal the path in the FileStatus; Hive virtual FS could create this 
condition.
+  // As the path to open is derived from stat.getPath(), this condition 
seems
+  // near-impossible to create -but is handled here for due diligence.
+  stream = fs.open(stat.getPath());

Review Comment:
   Shouldn't we at least log the original exception?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



Re: [PR] GH-3083: Make DELTA_LENGTH_BYTE_ARRAY default encoding for binary [parquet-java]

2024-11-28 Thread via GitHub


Fokko commented on PR #3085:
URL: https://github.com/apache/parquet-java/pull/3085#issuecomment-2506204690

   Hey @raunaqmorarka thanks for raising this. I think we want to [discuss on 
the devlist](https://lists.apache.org/list.html?d...@parquet.apache.org) first 
if we want to change behavior. Would you be interested to raise this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org



[PR] GH-3086: Allow for empty beans [parquet-java]

2024-11-28 Thread via GitHub


Fokko opened a new pull request, #3087:
URL: https://github.com/apache/parquet-java/pull/3087

   ### Rationale for this change
   
   Please check the issue: https://github.com/apache/parquet-java/issues/3086
   
   ### What changes are included in this PR?
   
   
   ### Are these changes tested?
   
   Yes, added a new test.
   
   ### Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org