(parquet-format) branch master updated: Fix errors and grammar in BloomFilter.md and PageIndex.md (#577)

alamb Thu, 04 Jun 2026 04:02:03 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new c42c2cb  Fix errors and grammar in BloomFilter.md and PageIndex.md 
(#577)
c42c2cb is described below

commit c42c2cb4ecfccb38153375e24b702a82fd763cc0
Author: Ismaël Mejía <[email protected]>
AuthorDate: Thu Jun 4 13:01:51 2026 +0200

    Fix errors and grammar in BloomFilter.md and PageIndex.md (#577)
    
    BloomFilter.md:
    - Fix block_check pseudocode: setBit -> isSet (checking, not setting)
    - Fix struct name to match thrift (BloomFilterHeader)
    - Include missing bloom_filter_length field in ColumnMetaData snippet
    - Normalize "Bloom filter" casing (proper noun) throughout
    - "64 bits version" -> "the 64-bit version"
    - Normalize "bit set" -> "bitset" (consistent with thrift)
    - Add terminal periods in thrift-style doc comments
    
    PageIndex.md:
    - Fix "Blart Versenwald III" double-quote typo
    - Fix heading capitalization: "page index" -> "Page Index"
    - Fix article: "one data page per the retrieved column" -> "per retrieved 
column"
    - Add missing terminal period after parquet.thrift link
---
 BloomFilter.md | 26 +++++++++++++++++---------
 PageIndex.md   |  8 ++++----
 2 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/BloomFilter.md b/BloomFilter.md
index 98ec685..3b50e54 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -122,7 +122,7 @@ boolean block_check(block b, unsigned int32 x) {
   for i in [0..7] {
     for j in [0..31] {
       if (masked.getWord(i).isSet(j)) {
-        if (not b.getWord(i).setBit(j)) {
+        if (not b.getWord(i).isSet(j)) {
           return false
         }
       }
@@ -266,8 +266,8 @@ false positive rates:
 #### File Format
 
 Each multi-block Bloom filter is required to work for only one column chunk. 
The data of a multi-block
-bloom filter consists of the bloom filter header followed by the bloom filter 
bitset. The bloom filter
-header encodes the size of the bloom filter bit set in bytes that is used to 
read the bitset.
+Bloom filter consists of the Bloom filter header followed by the Bloom filter 
bitset. The Bloom filter
+header encodes the size of the Bloom filter bitset in bytes that is used to 
read the bitset.
 
 Here are the Bloom filter definitions in thrift:
 
@@ -282,7 +282,7 @@ union BloomFilterAlgorithm {
 }
 
 /** Hash strategy type annotation. xxHash is an extremely fast 
non-cryptographic hash
- * algorithm. It uses 64 bits version of xxHash. 
+ * algorithm. It uses the 64-bit version of xxHash.
  **/
 struct XxHash {}
 
@@ -307,14 +307,14 @@ union BloomFilterCompression {
   * Bloom filter header is stored at beginning of Bloom filter data of each 
column
   * and followed by its bitset.
   **/
-struct BloomFilterPageHeader {
-  /** The size of bitset in bytes **/
+struct BloomFilterHeader {
+  /** The size of bitset in bytes. **/
   1: required i32 numBytes;
   /** The algorithm for setting bits. **/
   2: required BloomFilterAlgorithm algorithm;
   /** The hash function used for Bloom filter. **/
   3: required BloomFilterHash hash;
-  /** The compression used in the Bloom filter **/
+  /** The compression used in the Bloom filter. **/
   4: required BloomFilterCompression compression;
 }
 
@@ -322,6 +322,14 @@ struct ColumnMetaData {
   ...
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data including the serialized header, in bytes.
+   * Added in 2.10 so readers may not read this field from old files and
+   * it can be obtained after the BloomFilterHeader has been deserialized.
+   * Writers should write this field so readers can read the bloom filter
+   * in a single I/O.
+   */
+  15: optional i32 bloom_filter_length;
 }
 
 ```
@@ -339,8 +347,8 @@ information such as the presence of value. Therefore the 
Bloom filter of columns
 data should be encrypted with the column key, and the Bloom filter of other 
(not sensitive) columns
 do not need to be encrypted.
 
-Bloom filters have two serializable modules - the PageHeader thrift structure 
(with its internal
-fields, including the BloomFilterPageHeader `bloom_filter_page_header`), and 
the Bitset. The header
+Bloom filters have two serializable modules - the Bloom filter header (the 
BloomFilterHeader thrift
+structure and its internal fields), and the Bitset. The header
 structure is serialized by Thrift, and written to file output stream; it is 
followed by the
 serialized Bitset.
 
diff --git a/PageIndex.md b/PageIndex.md
index a371c42..58a8ecf 100644
--- a/PageIndex.md
+++ b/PageIndex.md
@@ -17,13 +17,13 @@
   - under the License.
   -->
 
-# Parquet page index: Layout to Support Page Skipping
+# Parquet Page Index: Layout to Support Page Skipping
 
 In Parquet, a *page index* is optional metadata for a
 ColumnChunk, containing statistics for DataPages that can be used
 to skip those pages when scanning in ordered and unordered columns.
 The page index is stored using the OffsetIndex and ColumnIndex structures,
-defined in [`parquet.thrift`](src/main/thrift/parquet.thrift)
+defined in [`parquet.thrift`](src/main/thrift/parquet.thrift).
 
 ## Problem Statement
 In previous versions of the format, Statistics are stored for ColumnChunks in
@@ -37,7 +37,7 @@ data from disk.
 1. Make both range scans and point lookups I/O efficient by allowing direct
    access to pages based on their min and max values. In particular:
     *  A single-row lookup in a row group based on the sort column of that row 
group
-  will only read one data page per the retrieved column.
+  will only read one data page per retrieved column.
     * Range scans on the sort column will only need to read the exact data 
       pages that contain relevant data.
     * Make other selective scans I/O efficient: if we have a very selective
@@ -81,7 +81,7 @@ Some observations:
 * We store lower and upper bounds for the values of each page. These may be the
   actual minimum and maximum values found on a page, but can also be (more
   compact) values that do not exist on a page. For example, instead of storing
-  ""Blart Versenwald III", a writer may set `min_values[i]="B"`,
+  `"Blart Versenwald III"`, a writer may set `min_values[i]="B"`,
   `max_values[i]="C"`. This allows writers to truncate large values and writers
   should use this to enforce some reasonable bound on the size of the index
   structures.

(parquet-format) branch master updated: Fix errors and grammar in BloomFilter.md and PageIndex.md (#577)

Reply via email to