wgtmac commented on code in PR #3148:
URL: https://github.com/apache/parquet-java/pull/3148#discussion_r1945990921


##########
parquet-column/src/main/java/org/apache/parquet/column/page/Page.java:
##########
@@ -34,6 +34,9 @@ public abstract class Page {
     this.uncompressedSize = uncompressedSize;
   }
 
+  /**
+   * @return the compressed size of the page when the bytes are compressed, 
otherwise return 0

Review Comment:
   I would rather return uncompressed size if it is not compressed.



##########
parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV2.java:
##########
@@ -163,6 +170,33 @@ public DataPageV2(
     this.isCompressed = isCompressed;
   }
 
+  public DataPageV2(
+      int rowCount,
+      int nullCount,
+      int valueCount,
+      BytesInput repetitionLevels,
+      BytesInput definitionLevels,
+      Encoding dataEncoding,
+      BytesInput data,
+      int compressedSize,
+      int uncompressedSize,
+      Statistics<?> statistics,
+      boolean isCompressed) {
+    super(compressedSize, uncompressedSize, valueCount);
+    if (!isCompressed && compressedSize != 0) {
+      throw new IllegalArgumentException("compressedSize must be 0 if page is 
not compressed");
+    }

Review Comment:
   IMO, we should assume `compressedSize` == `uncompressedSize` when it is 
uncompressed. `UNCOMPRESSED` codec is still a valid compression codec type, 
otherwise, `RowGroup.total_compressed_size` may have problems.



##########
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java:
##########
@@ -2123,6 +2123,9 @@ private PageHeader newDataPageV2Header(
       int dlByteLength) {
     DataPageHeaderV2 dataPageHeaderV2 = new DataPageHeaderV2(
         valueCount, nullCount, rowCount, getEncoding(dataEncoding), 
dlByteLength, rlByteLength);
+    if (compressedSize == 0) {
+      dataPageHeaderV2.setIs_compressed(false);
+    }

Review Comment:
   Agreed. Data page v2 was designed to adaptively fall back to uncompressed 
data when compression is not promising (though we don't implement it yet). 
Using an explicit parameter makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org
For additional commands, e-mail: issues-h...@parquet.apache.org

Reply via email to