Re: [PR] GH-3263: Add c-tor to ColumnReaderBase to allow dictionary reuse [parquet-java]

via GitHub Thu, 07 Aug 2025 03:40:17 -0700


pyckle commented on code in PR #3264:
URL: https://github.com/apache/parquet-java/pull/3264#discussion_r2259882935



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java:
##########
@@ -445,30 +445,39 @@ void writeValue() {
    * @param converter     a converter that materializes the values in this 
column in the current record
    * @param writerVersion writer version string from the Parquet file being 
read
    */
-  ColumnReaderBase(
+  protected ColumnReaderBase(
       ColumnDescriptor path, PageReader pageReader, PrimitiveConverter 
converter, ParsedVersion writerVersion) {
+    this(path, readDict(path, pageReader), pageReader, converter, 
writerVersion);
+  }
+
+  protected ColumnReaderBase(
+        ColumnDescriptor path, Dictionary dictionary, PageReader 
dataPageReader, PrimitiveConverter converter, ParsedVersion writerVersion) {
     this.path = Objects.requireNonNull(path, "path cannot be null");
-    this.pageReader = Objects.requireNonNull(pageReader, "pageReader cannot be 
null");
+    this.pageReader = Objects.requireNonNull(dataPageReader, "pageReader 
cannot be null");
     this.converter = Objects.requireNonNull(converter, "converter cannot be 
null");
     this.writerVersion = writerVersion;
     this.maxDefinitionLevel = path.getMaxDefinitionLevel();
-    DictionaryPage dictionaryPage = pageReader.readDictionaryPage();

Review Comment:
   I see your point with the cleanliness. This prevents a dictionary with the 
wrong encoding from being passed in by mistake.
   Looking at `ColumnChunkPageReadStore.ColumnChunkPageReader`, I am no longer 
certain that caching the decoded Dictionary is a safe change, as it will 
increase memory pressure if `PageReaders` are stored in memory with no active 
reader. This is due to the fact that `ColumnChunkPageReader` stores the pages 
with compression, and decompresses when it needs to be read.
   As such, I think it's safer to merge in its current state.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] GH-3263: Add c-tor to ColumnReaderBase to allow dictionary reuse [parquet-java]

Reply via email to