JFinis opened a new pull request, #514:
URL: https://github.com/apache/parquet-format/pull/514

   This commit is a combination of the following PRs:
   
   * Introduce IEEE 754 total order 
https://github.com/apache/parquet-format/pull/221
   * Add nan_count to handle NaNs in statistics 
https://github.com/apache/parquet-format/pull/196
   
   Both these PRs try to solve the same problems; read the description of the 
respective PRs for explanation.
   
   This PR is the result of an extended discussion in which it was repeatedly 
brought up that another possible solution to the problem could be the 
combination of the two approaches. Please refer to this discussion on the 
mailing list and in the two PRs for details. the mailing list discussion can be 
found here:
   https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s
   
   The contents of this PR are basically a straightforward combination of the 
two approaches:
   * IEEE total order is introduced as a new order for floating point types
   * nan_count and nan_counts fields are added
   
   Legacy writers may not write nan_count(s) fields,
   so readers have to handle them being absent. Also, legacy writers may have 
included NaNs into min/max bounds, so readers also have to handle that.
   
   As there are no legacy writers writing IEEE total order, nan_count(s) are 
defined to be mandatory if this order is used, so readers can assume their 
presense when this order is used.
   
   This commit removes `nan_pages` from the ColumnIndex, which the nan_counts 
PR mandated. We don't need them anymore, as IEEE total order solves this: As 
this is a new order and there are thus no legacy writers and readers, we have 
the freedom to define that for only-NaN pages using this order, we can actually 
write NaNs into min & max bounds in this case, and readers can assume that 
isNaN(min) signals an only-NaN page.
   
   Note: If this PR is chosen to be adopted, I will update the commit message 
to state the problem without just pointing onto the other PRs. But until then, 
I rather link to them instead of repeating myself, in case this PR isn't chosen 
anyway.
   
   <!--
   Thanks for opening a pull request!
   
   If you're new to Parquet-Format, information on how to contribute can be 
found here: https://parquet.apache.org/docs/contribution-guidelines/contributing
   
   Please open a GitHub issue for this pull request: 
https://github.com/apache/parquet-format/issues/new/choose
   and format pull request title as below:
   
       GH-${GITHUB_ISSUE_ID}: ${SUMMARY}
   
   or simply use the title below if it is a minor issue:
   
       MINOR: ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   
   ### What changes are included in this PR?
   
   
   ### Do these changes have PoC implementations?
   
   
   <!-- Please uncomment the line below and replace ${GITHUB_ISSUE_ID} with the 
actual Github issue id. -->
   <!-- Closes #${GITHUB_ISSUE_ID} -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to