Re: [PR] GH-3263: Add c-tor to ColumnReaderBase to allow dictionary reuse [parquet-java]

via GitHub Thu, 07 Aug 2025 00:30:29 -0700


pyckle commented on code in PR #3264:
URL: https://github.com/apache/parquet-java/pull/3264#discussion_r2259391067



##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java:
##########
@@ -445,30 +445,39 @@ void writeValue() {
    * @param converter     a converter that materializes the values in this 
column in the current record
    * @param writerVersion writer version string from the Parquet file being 
read
    */
-  ColumnReaderBase(
+  protected ColumnReaderBase(
       ColumnDescriptor path, PageReader pageReader, PrimitiveConverter 
converter, ParsedVersion writerVersion) {
+    this(path, readDict(path, pageReader), pageReader, converter, 
writerVersion);
+  }
+
+  protected ColumnReaderBase(
+        ColumnDescriptor path, Dictionary dictionary, PageReader 
dataPageReader, PrimitiveConverter converter, ParsedVersion writerVersion) {
     this.path = Objects.requireNonNull(path, "path cannot be null");
-    this.pageReader = Objects.requireNonNull(pageReader, "pageReader cannot be 
null");
+    this.pageReader = Objects.requireNonNull(dataPageReader, "pageReader 
cannot be null");
     this.converter = Objects.requireNonNull(converter, "converter cannot be 
null");
     this.writerVersion = writerVersion;
     this.maxDefinitionLevel = path.getMaxDefinitionLevel();
-    DictionaryPage dictionaryPage = pageReader.readDictionaryPage();

Review Comment:
   The PageReader returns a DictionaryPage which is then decoded in the 
constructor:
   `dictionary = dictionaryPage.getEncoding().initDictionary(path, 
dictionaryPage);`
   As `getEncoding()` returns an enum, there's no way to override the method of 
`initDictionary()` to return a cached instance.
   Adding a `getDictionary()` function to the `PageReader` interface would 
therefore be necessary. In order not to break any downstream `PageReader` 
implementations, a default method must be added, which seems unclean to me, but 
maybe it's OK.
   I implemented this so you can see what it looks like. If you prefer this 
approach, I can add the caching logic to the `ColumnChunkPageReadStore` too.
   Another option could be adding a function `decodeDictionary()` to 
`DictionaryPage`
   Let me know what you think is cleanest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] GH-3263: Add c-tor to ColumnReaderBase to allow dictionary reuse [parquet-java]

Reply via email to