wgtmac commented on code in PR #3148: URL: https://github.com/apache/parquet-java/pull/3148#discussion_r1945990921
########## parquet-column/src/main/java/org/apache/parquet/column/page/Page.java: ########## @@ -34,6 +34,9 @@ public abstract class Page { this.uncompressedSize = uncompressedSize; } + /** + * @return the compressed size of the page when the bytes are compressed, otherwise return 0 Review Comment: I would rather return uncompressed size if it is not compressed. ########## parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV2.java: ########## @@ -163,6 +170,33 @@ public DataPageV2( this.isCompressed = isCompressed; } + public DataPageV2( + int rowCount, + int nullCount, + int valueCount, + BytesInput repetitionLevels, + BytesInput definitionLevels, + Encoding dataEncoding, + BytesInput data, + int compressedSize, + int uncompressedSize, + Statistics<?> statistics, + boolean isCompressed) { + super(compressedSize, uncompressedSize, valueCount); + if (!isCompressed && compressedSize != 0) { + throw new IllegalArgumentException("compressedSize must be 0 if page is not compressed"); + } Review Comment: IMO, we should assume `compressedSize` == `uncompressedSize` when it is uncompressed. `UNCOMPRESSED` codec is still a valid compression codec type, otherwise, `RowGroup.total_compressed_size` may have problems. ########## parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java: ########## @@ -2123,6 +2123,9 @@ private PageHeader newDataPageV2Header( int dlByteLength) { DataPageHeaderV2 dataPageHeaderV2 = new DataPageHeaderV2( valueCount, nullCount, rowCount, getEncoding(dataEncoding), dlByteLength, rlByteLength); + if (compressedSize == 0) { + dataPageHeaderV2.setIs_compressed(false); + } Review Comment: Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@parquet.apache.org For additional commands, e-mail: issues-h...@parquet.apache.org