(parquet-format) branch master updated: Fix typos, grammar, and consistency in Encryption, Contributing, and BinaryProtocolExtensions docs (#578)

gangwu Mon, 08 Jun 2026 08:15:36 -0700

This is an automated email from the ASF dual-hosted git repository.

wgtmac pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new 6be6b91  Fix typos, grammar, and consistency in Encryption, 
Contributing, and BinaryProtocolExtensions docs (#578)
6be6b91 is described below

commit 6be6b914c28ddd2f06247a50b67549eb7bc0d170
Author: Ismaël Mejía <[email protected]>
AuthorDate: Mon Jun 8 17:14:48 2026 +0200

    Fix typos, grammar, and consistency in Encryption, Contributing, and 
BinaryProtocolExtensions docs (#578)
    
    Encryption.md:
    - Fix double-negative; align GCM invocation limit to NIST
    - "Data PageHeader" -> "Data Page Header" (spacing consistency)
    - Replace "allows to" with idiomatic English
    - Fix smart quotes to ASCII for magic-bytes literal
    - Remove double spaces; fix "the the FileMetaData"
    - "explictly" -> "explicitly"
    - Hyphenate compound adjectives ("2 byte short" -> "2-byte short")
    - Fix section heading numbering ("## 5 File Format" -> "## 5. File Format")
    - Fix mass noun article ("from a secret data" -> "from secret data")
    
    CONTRIBUTING.md:
    - Fix 7 typos: docuemnt, demostrate, interopability, libaries,
      highlighed, compatiblity, an prototype
    - Fix possessive: "features desirability" -> "a feature's desirability"
    - Fix agreement: "an external dependencies" -> "an external dependency"
    - Add commas after introductory clauses
    - Fix comma splice -> semicolon
    
    BinaryProtocolExtensions.md:
    - Fix "FileMetadata" -> "FileMetaData" (4 occurrences; match thrift struct)
    - Fix "Flatbuffers"/"flatbuffer" -> "FlatBuffers" (5 occurrences; official 
capitalization)
    - Fix "implementers which" -> "implementers who" (people)
    - Fix missing copula: "extension shared" -> "extension is shared"
---
 BinaryProtocolExtensions.md | 28 +++++++++++++++----------
 CONTRIBUTING.md             | 28 ++++++++++++-------------
 Encryption.md               | 50 +++++++++++++++++++++++----------------------
 3 files changed, 57 insertions(+), 49 deletions(-)

diff --git a/BinaryProtocolExtensions.md b/BinaryProtocolExtensions.md
index e23d332..49dba3b 100644
--- a/BinaryProtocolExtensions.md
+++ b/BinaryProtocolExtensions.md
@@ -26,11 +26,17 @@ The extension mechanism of the `binary` Thrift field-id 
`32767` has some desirab
 * The content of the extension is freeform and can be encoded in any format. 
This format is not restricted to Thrift.  
 * Extensions can be appended to existing Thrift serialized structs [without 
requiring Thrift libraries](#appending-extensions-to-thrift) for manipulation 
(or changes to the thrift IDL).
 
-Because only one field-id is reserved the extension bytes themselves require 
disambiguation; otherwise readers will not be able to decode extensions safely. 
This is left to implementers which MUST put enough unique state in their 
extension bytes for disambiguation. This can be relatively easily achieved by 
adding a [UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) 
at the start or end of the extension bytes. The extension does not specify a 
disambiguation mechanism to  [...]
+Because only one field-id is reserved the extension bytes themselves require
+disambiguation; otherwise readers will not be able to decode extensions safely.
+This is left to implementers who MUST put enough unique state in their 
extension
+bytes for disambiguation. This can be relatively easily achieved by adding a
+[UUID](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) at the
+start or end of the extension bytes. The extension does not specify a
+disambiguation mechanism to allow more flexibility to implementers.
 
 Putting everything together in an example, if we would extend `FileMetaData` 
it would look like this on the wire.
 
-    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift 
stop field)
+    N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift 
stop field)
     4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
     1-5 bytes | ULEB128(M) encoded size of the extension
     M bytes   | extension bytes
@@ -50,14 +56,14 @@ To illustrate the applicability of the extension mechanism 
we provide examples o
 
 ### Footer
 
-A variant of `FileMetaData` encoded in Flatbuffers is introduced. This variant 
is more performant and can scale to very wide tables, something that current 
Thrift `FileMetaData` struggles with.
+A variant of `FileMetaData` encoded in FlatBuffers is introduced. This variant 
is more performant and can scale to very wide tables, something that current 
Thrift `FileMetaData` struggles with.
 
 In its private form the footer of a Parquet file will look like so:
 
-    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift 
stop field)
+    N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift 
stop field)
     4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
     1-5 bytes | ULEB128(K+28) encoded size of the extension
-    K bytes   | Flatbuffers representation (v0) of FileMetaData
+    K bytes   | FlatBuffers representation (v0) of FileMetaData
     4 bytes   | little-endian crc32(flatbuffer)
     4 bytes   | little-endian size(flatbuffer)
     4 bytes   | little-endian crc32(size(flatbuffer))
@@ -67,20 +73,20 @@ In its private form the footer of a Parquet file will look 
like so:
 
 some-UUID is some UUID picked for this extension and it is used throughout 
(possibly internal) experimentation. It is put at the end to allow detection of 
the extension when parsed in reverse. The little-endian sizes and crc32s are 
also to the end to facilitate efficient parsing the footer in reverse without 
requiring parsing the Thrift compact protocol that precedes it.
 
-At some point the experiments conclude and the extension shared publicly with 
the community. The extension is proposed for inclusion to the standard with a 
migration plan to replace the existing `FileMetaData`.
+At some point the experiments conclude and the extension is shared publicly 
with the community. The extension is proposed for inclusion to the standard 
with a migration plan to replace the existing `FileMetaData`.
 
-The community reviews the proposal and (potentially) proposes changes to the 
Flatbuffers IDL representation. In addition, because this extension is a 
*replacement* of an existing struct, it must:
+The community reviews the proposal and (potentially) proposes changes to the 
FlatBuffers IDL representation. In addition, because this extension is a 
*replacement* of an existing struct, it must:
 
 1. have some way of being extended in the future much like what it replaces. 
Because the extension mechanism only allows for a single extension, without 
this in place we cannot have footer extensions during the migration.  
 2. consider its intermediate form where both the **Thrift** `FileMetaData` and 
the **FlatBuffers** `FileMetaData` will be present.  
 3. consider its final form where the long form header for `32767: binary` may 
not be present.
 
-Once the design is ratified the new `FileMetaData` encoding is made final with 
the following migration plan. For the next N years writers will write both the 
Thrift and the flatbuffer `FileMetaData`. It will look much like its private 
form except the flatbuffer IDL may be different:
+Once the design is ratified the new `FileMetaData` encoding is made final with 
the following migration plan. For the next N years writers will write both the 
Thrift and the FlatBuffers `FileMetaData`. It will look much like its private 
form except the FlatBuffers IDL may be different:
 
-    N-1 bytes | Thrift compact protocol encoded FileMetadata (minus \0 thrift 
stop field)
+    N-1 bytes | Thrift compact protocol encoded FileMetaData (minus \0 thrift 
stop field)
     4 bytes   | 08 FF FF 01 (long form header for 32767: binary)
     1-5 bytes | ULEB128(K+28) encoded size of the extension
-    K bytes   | Flatbuffers representation (v1) of FileMetaData
+    K bytes   | FlatBuffers representation (v1) of FileMetaData
     4 bytes   | little-endian crc32(flatbuffer)
     4 bytes   | little-endian size(flatbuffer)
     4 bytes   | little-endian crc32(size(flatbuffer))
@@ -90,7 +96,7 @@ Once the design is ratified the new `FileMetaData` encoding 
is made final with t
 
 After the migration period, the end of the Parquet file may look like this:
 
-    K bytes   | Flatbuffers representation (v1) of FileMetaData
+    K bytes   | FlatBuffers representation (v1) of FileMetaData
     4 bytes   | little-endian crc32(flatbuffer)
     4 bytes   | little-endian size(flatbuffer)
     4 bytes   | little-endian crc32(size(flatbuffer))
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index d6049a8..f9fdf21 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -43,10 +43,10 @@ The general steps for adding features to the format are as 
follows:
 1. Design/scoping: The goal of this phase is to identify design goals of a
    feature and provide some demonstration that the feature meets those goals.
    This phase starts with a discussion of changes on the developer mailing list
-   ([email protected]). Depending on the scope and goals of the feature 
the
-   it can be useful to provide additional artifacts as part of a discussion. 
The
-   artifacts can include a design docuemnt, a draft pull request to make the
-   discussion concrete and/or an prototype implementation to demostrate the
+   ([email protected]). Depending on the scope and goals of the feature, 
it
+   can be useful to provide additional artifacts as part of a discussion. The
+   artifacts can include a design document, a draft pull request to make the
+   discussion concrete and/or a prototype implementation to demonstrate the
    viability of implementation. This step is complete when there is lazy
    consensus. Part of the consensus is whether it is sufficient to provide two
    working implementations as outlined in step 2, or if demonstration of the
@@ -58,7 +58,7 @@ The general steps for adding features to the format are as 
follows:
 2. Completeness: The goal of this phase is to ensure the feature is viable,
    there is no ambiguity in its specification by demonstrating compatibility
    between implementations. Once a change has lazy consensus, two
-   implementations of the feature demonstrating interopability must also be
+   implementations of the feature demonstrating interoperability must also be
    provided.  One implementation MUST be
    [`parquet-java`](http://github.com/apache/parquet-java).  It is preferred
    that the second implementation be
@@ -73,21 +73,21 @@ The general steps for adding features to the format are as 
follows:
    fit for inclusion (for example, they were submitted as a pull request 
against
    the target repository and committers gave positive reviews). Reports on the
    benefits from closed source implementations are welcome and can help lend
-   weight to features desirability but are not sufficient for acceptance of a
+   weight to a feature's desirability but are not sufficient for acceptance of 
a
    new feature.
 
 Unless otherwise discussed, it is expected the implementations will be 
developed
 from their respective main branch (i.e. backporting is not required), to
 demonstrate that the feature is mergeable to its implementation.
 
-3. Ratification: After the first two steps are complete a formal vote is held 
on
+3. Ratification: After the first two steps are complete, a formal vote is held 
on
    [email protected] to officially ratify the feature.  After the vote
-   passes the format change is merged into the `parquet-format` repository and
+   passes, the format change is merged into the `parquet-format` repository and
    it is expected the changes from step 2 will also be merged soon after
    (implementations should not be merged until the addition has been merged to
    `parquet-format`).
 
-#### General guidelines/preferences on additions.
+#### General guidelines/preferences on additions
 
 1. To the greatest extent possible changes should have an option for forward
    compatibility (old readers can still read files). The [compatibility and
@@ -95,13 +95,13 @@ demonstrate that the feature is mergeable to its 
implementation.
    provides more details on expectations for changes that break compatibility.
 
 2. New encodings should be fully specified in this repository and not
-   rely on an external dependencies for implementation (i.e. `parquet-format` 
is
+   rely on external dependencies for implementation (i.e. `parquet-format` is
    the source of truth for the encoding). If it does require an
    external dependency, then the external dependency must have its
    own specification separate from implementation.
 
 3. New compression mechanisms should have a pure Java implementation that can 
be
-   used as a dependency in `parquet-java`, exceptions may be
+   used as a dependency in `parquet-java`; exceptions may be
    discussed on the mailing list to see if a non-native Java
    implementation is acceptable.
 
@@ -154,7 +154,7 @@ recommendations for managing features:
 2. Forward compatible features/changes may be enabled and used by default in
    implementations once the parquet-format containing those changes has been
    formally released.  For features that may pose a significant performance
-   regression to older format readers, libaries should consider delaying 
default
+   regression to older format readers, libraries should consider delaying 
default
    enablement until 1 year after the release of the parquet-java implementation
    that contains the feature implementation.
 
@@ -162,7 +162,7 @@ recommendations for managing features:
    until 2 years after the parquet-java implementation containing the feature 
is
    released. It is recommended that changing the default value for a forward
    incompatible feature flag should be clearly advertised to consumers (e.g. 
via
-   a major version release if using Semantic Versioning, or highlighed in
+   a major version release if using Semantic Versioning, or highlighted in
    release notes).
 
 For forward compatible changes which have a high chance of performance
@@ -174,7 +174,7 @@ the same timelines as `parquet-java`. Parquet-java will 
wait to enable features
 by default until the most conservative timelines outlined above have been
 exceeded. This timeline is an attempt to balance ensuring
 new features make their way into the ecosystem and avoiding
-breaking compatiblity for readers that are slower to adopt new standards. We
+breaking compatibility for readers that are slower to adopt new standards. We
 encourage earlier adoption of new features when an organization using Parquet
 can guarantee that all readers of the parquet files they produce can read a new
 feature.
diff --git a/Encryption.md b/Encryption.md
index 180b9aa..d3c8c9f 100644
--- a/Encryption.md
+++ b/Encryption.md
@@ -79,7 +79,7 @@ in order to verify its integrity. New footer fields keep an
 information about the file encryption algorithm and the footer signing key.
 
 For encrypted columns, the following modules are always encrypted, with the 
same column key: 
-pages and  page headers (both dictionary and data), column indexes, offset 
indexes, bloom filter 
+pages and page headers (both dictionary and data), column indexes, offset 
indexes, bloom filter 
 headers and bitsets.  If the 
 column key is different from the footer encryption key, the column metadata is 
serialized 
 separately and encrypted with the column key. In this case, the column 
metadata is also 
@@ -101,7 +101,7 @@ other on a combination of GCM and CTR modes.
 AES GCM is an authenticated encryption. Besides the data confidentiality 
(encryption), it 
 supports two levels of integrity verification (authentication): of the data 
(default), 
 and of the data combined with an optional AAD (“additional authenticated 
data”). The 
-authentication allows to make sure the data has not been tampered with. An AAD 
+authentication makes it possible to verify that the data has not been tampered 
with. An AAD
 is a free text to be authenticated, together with the data. The user can, for 
example, pass the 
 file name with its version (or creation timestamp) as an AAD input, to verify 
that the 
 file has not been replaced with an older version. The details on how Parquet 
creates 
@@ -136,9 +136,10 @@ one IV is ever repeated, then the implementation may be 
vulnerable"*. *"Complian
 requirement is crucial to the security of GCM"*.
 
 The bulk of modules in a Parquet file are page headers and data pages. 
Therefore, one encryption 
-key shall not not be used for more than 2^31 (~2 billion) pages. In Parquet 
files encrypted with 
-multiple keys (footer and column keys), the constraint on the number of 
invocations is applied 
-to each key separately.
+key shall not be used for more than 2^32 total module encryptions, as per the 
NIST specification.
+Since each data page requires two module encryptions (header + data), this 
means in practice no
+more than 2^31 pages per key. In Parquet files encrypted with multiple keys 
(footer and column
+keys), the constraint on the number of invocations is applied to each key 
separately.
 
 When running in the context of a larger system, any particular Parquet writer 
implementation likely
 does not have sufficient context to enforce key invocation limits system-wide. 
Therefore,
@@ -161,8 +162,9 @@ tag used to verify the ciphertext and AAD integrity.
 
 
 #### 4.2.2 AES_GCM_CTR_V1
+
 In this Parquet algorithm, all modules except pages are encrypted with the GCM 
cipher, as described 
-above. The pages are encrypted by the CTR cipher without padding. This allows 
to encrypt/decrypt 
+above. The pages are encrypted by the CTR cipher without padding. This makes 
it possible to encrypt/decrypt
 the bulk of the data faster, while still verifying the metadata integrity and 
making 
 sure the file has not been replaced with a wrong version. However, tampering 
with the 
 page data might go unnoticed. The AES CTR cipher
@@ -208,7 +210,7 @@ it can't prevent replacement of one ciphertext with another 
(encrypted with the
 Parquet modular encryption leverages AADs to protect against swapping 
ciphertext modules (encrypted 
 with AES GCM) inside a file or between files. Parquet can also protect against 
swapping full 
 files - for example, replacement of a file with an old version, or replacement 
of one table 
-partition with another. AADs are built to reflects the identity of a file and 
of the modules 
+partition with another. AADs are built to reflect the identity of a file and 
of the modules 
 inside the file. 
 
 Parquet constructs a module AAD from two components: an optional AAD prefix - 
a string provided 
@@ -221,12 +223,12 @@ group 1. The module AAD is a direct concatenation of the 
prefix and suffix parts
 
 #### 4.4.1 AAD prefix
 File swapping can be prevented by an AAD prefix string, that uniquely 
identifies the file and 
-allows to differentiate it e.g. from older versions of the file or from other 
partition files in the same 
+makes it possible to differentiate it e.g. from older versions of the file or 
from other partition files in the same
 data set (table). This string is optionally passed by a writer upon file 
creation. If provided,
 the AAD prefix is stored in an `aad_prefix` field in the file, and is made 
available to the readers. 
 This field is not encrypted. If a user is concerned about keeping the file 
identity inside the file, 
 the writer code can explicitly request Parquet not to store the AAD prefix. 
Then the aad_prefix field 
-will be empty; AAD prefixes must be fully managed by the caller code and 
supplied explictly to Parquet 
+will be empty; AAD prefixes must be fully managed by the caller code and 
supplied explicitly to Parquet 
 readers for each file.
 
 The protection against swapping full files is optional. It is not enabled by 
default because 
@@ -246,15 +248,15 @@ of all partition files (prefixes) from 0 to N-1.
    
 #### 4.4.2 AAD suffix
 The suffix part of a module AAD protects against module swapping inside a 
file. It also protects against 
-module swapping between files  - in situations when an encryption key is 
re-used in multiple files and the 
+module swapping between files - in situations when an encryption key is 
re-used in multiple files and the 
 writer has not provided a unique AAD prefix for each file. 
 
 Unlike AAD prefix, a suffix is built internally by Parquet, by direct 
concatenation of the following parts: 
 1.     [All modules] internal file identifier - a random byte array generated 
for each file (implementation-defined length)
 2.     [All modules] module type (1 byte)
-3.     [All modules except footer] row group ordinal (2 byte short, little 
endian)
-4.     [All modules except footer] column ordinal (2 byte short, little endian)
-5.     [Data page and header only] page ordinal (2 byte short, little endian)
+3.     [All modules except footer] row group ordinal (2-byte short, 
little-endian)
+4.     [All modules except footer] column ordinal (2-byte short, little-endian)
+5.     [Data page and header only] page ordinal (2-byte short, little-endian)
 
 The following module types are defined:  
 
@@ -262,8 +264,8 @@ The following module types are defined:
    * ColumnMetaData (1)
    * Data Page (2)
    * Dictionary Page (3)
-   * Data PageHeader (4)
-   * Dictionary PageHeader (5)
+   * Data Page Header (4)
+   * Dictionary Page Header (5)
    * ColumnIndex (6)
    * OffsetIndex (7)
    * BloomFilter Header (8)
@@ -276,8 +278,8 @@ The following module types are defined:
 | ColumnMetaData       |       yes        |   yes (1)   |        yes        |  
    yes       |     no      |
 | Data Page            |       yes        |   yes (2)   |        yes        |  
    yes       |     yes     |
 | Dictionary Page      |       yes        |   yes (3)   |        yes        |  
    yes       |     no      |
-| Data PageHeader      |       yes        |   yes (4)   |        yes        |  
    yes       |     yes     |
-| Dictionary PageHeader|       yes        |   yes (5)   |        yes        |  
    yes       |     no      |
+| Data Page Header     |       yes        |   yes (4)   |        yes        |  
    yes       |     yes     |
+| Dictionary Page Header|      yes        |   yes (5)   |        yes        |  
    yes       |     no      |
 | ColumnIndex          |       yes        |   yes (6)   |        yes        |  
    yes       |     no      |
 | OffsetIndex          |       yes        |   yes (7)   |        yes        |  
    yes       |     no      |
 | BloomFilter Header   |       yes        |   yes (8)   |        yes        |  
    yes       |     no      |
@@ -285,7 +287,7 @@ The following module types are defined:
 
 
 
-## 5 File Format
+## 5. File Format
 
 ### 5.1 Encrypted module serialization
 All modules, except column pages, are encrypted with the GCM cipher. In the 
AES_GCM_V1 algorithm, 
@@ -392,7 +394,7 @@ struct ColumnChunk {
 
 ### 5.3 Protection of sensitive metadata
 The Parquet file footer, and its nested structures, contain sensitive 
information - ranging 
-from a secret data (column statistics) to other information that can be 
exploited by an 
+from secret data (column statistics) to other information that can be 
exploited by an 
 attacker (e.g. schema, num_values, key_value_metadata, encoding 
 and crypto_metadata). This information is automatically protected when the 
footer and 
 secret columns are encrypted with the same key. In other cases - when 
column(s) and the 
@@ -408,7 +410,7 @@ field in the `ColumnChunk`.
 struct ColumnChunk {
 ...
   
-  /** Column metadata for this chunk.. **/
+  /** Column metadata for this chunk **/
   3: optional ColumnMetaData meta_data
 ..
   /** Crypto metadata of encrypted columns **/
@@ -439,7 +441,7 @@ little endian integer, followed by a final magic string, 
"PARE". The same magic
 written at the beginning of the file (offset 0). Parquet readers start file 
parsing by 
 reading and checking the magic string. Therefore, the encrypted footer mode 
uses a new 
 magic string ("PARE") in order to instruct readers to look for a file crypto 
metadata 
-before the footer - and also to immediately inform legacy readers (expecting 
‘PAR1’ 
+before the footer - and also to immediately inform legacy readers (expecting 
"PAR1" 
 bytes) that they can’t parse this file.
 
 ```c
@@ -490,14 +492,14 @@ The plaintext footer is signed in order to prevent 
tampering with the
 structure with the 
 AES GCM algorithm - using a footer signing key, and an AAD constructed 
according to the instructions 
 of the section 4.4. Only the nonce and GCM tag are stored in the file – as a 
28-byte 
-fixed-length array, written right after  the footer itself. The ciphertext is 
not stored, 
+fixed-length array, written right after the footer itself. The ciphertext is 
not stored, 
 because it is not required for footer integrity verification by readers.
 
 | nonce (12 bytes) |  tag (16 bytes) |
 |------------------|-----------------|
 
 
-The plaintext footer mode sets the following fields in the the FileMetaData 
structure:
+The plaintext footer mode sets the following fields in the FileMetaData 
structure:
 
 ```c
 struct FileMetaData {
@@ -522,7 +524,7 @@ The 28-byte footer signature is written after the plaintext 
footer, followed by
 that contains the combined length of the footer and its signature. A final 
magic string, 
 "PAR1", is written at the end of the 
 file. The same magic string is written at the beginning of the file (offset 
0). The magic bytes 
-for plaintext footer mode are ‘PAR1’ to allow legacy readers to read 
projections of the file 
+for plaintext footer mode are "PAR1" to allow legacy readers to read 
projections of the file 
 that do not include encrypted columns.
 
  ![File Layout - Encrypted footer](doc/images/FileLayoutEncryptionPF.png)

(parquet-format) branch master updated: Fix typos, grammar, and consistency in Encryption, Contributing, and BinaryProtocolExtensions docs (#578)

Reply via email to