wgtmac commented on PR #3556: URL: https://github.com/apache/parquet-java/pull/3556#issuecomment-4609512333
Thanks for looking into this! While the problem is real, introducing a boolean flag `dictionaryEarlyCheckEnabled` feels like a band-aid fix that pushes the burden to users. Most users won't know when to manually toggle this to prevent storage inflation. Instead of a new config, could we make the heuristic more adaptive? For example, we could delay the compression check until we've accumulated a certain amount of raw data (e.g., `1MB`), or evaluate it over the first `N` pages rather than just the first one. This would solve the issue out of the box without hurting usability. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
