Re: [PR] Document Parquet Features by Version [parquet-site]

via GitHub Fri, 05 Jun 2026 17:01:41 -0700


emkornfield commented on code in PR #186:
URL: https://github.com/apache/parquet-site/pull/186#discussion_r3365981105



##########
content/en/docs/File Format/versions.md:
##########
@@ -0,0 +1,267 @@
+---
+title: "Parquet format versions"
+linkTitle: "Format versions"
+weight: 9
+---
+
+This page describes how features are added to the [Parquet format
+specification](https://github.com/apache/parquet-format) and how they affect
+reader and writer compatibility. See the
+[Implementation status](../implementationstatus/) page for which 
implementations
+(arrow, parquet-java, arrow-rs, etc.) support each feature.
+
+*Note*: If you find out-of-date information, please open an issue or pull 
request.
+
+## Feature compatibility
+
+The Parquet format spec [classifies changes] by their effect on reader and
+writer compatibility. Changes differ in their *forward* compatibility — whether
+an older reader can read files that use a newer feature.
+
+**Forward compatible** features remain **readable by older readers**, with a
+possibly degraded experience: some metadata may be missing or performance may
+suffer, but the reader does not fail. Examples:
+
+* **Bloom filters**: a reader that ignores them skips the pruning metadata but
+  still reads the data correctly.
+* **Logical type annotations** such as `VARIANT`: an older reader reads the
+  underlying physical column (e.g. `BYTE_ARRAY`) as raw bytes without applying
+  the logical type.
+
+**Forward incompatible** features make the data **unreadable** to older 
software.
+Examples:
+
+* **New encodings** (e.g. the `DELTA_*` encodings, `BYTE_STREAM_SPLIT`,
+  `RLE_DICTIONARY`): a reader that does not implement them cannot decode the
+  column values.
+* **Data Page V2 headers**: a reader that only understands `DataPageHeader`
+  cannot parse `DataPageHeaderV2` pages.
+
+[classifies changes]: 
https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#compatibility-and-feature-enablement
+
+## `FileMetadata` version field
+
+Each Parquet file has a `version` field in the [`thrift FileMetadata`] that
+declares which features the file may use, and thus what a reader **must** 
support

Review Comment:
   I think my issue is the linear nature of versioning here it imposes on 
readers.  In a perfect world every reader would implement everything they 
needed to up as it is released.  But this means a reader needs to move in 
lock-step with the major version of the header.  For example, it lets say we 
release the following features:
   1.  Backward incompatible feature that isn't strictly better for everyone 
(V3)
   2. Awesome new encoding that a lot more people care about (V4)
   
   By this spec, any writer would need to write V4.  This gives readers two 
choices:
   1.  Cheat and try to read the data anyways (this makes version less useful 
in general, and I think one of the reasons some writer always wrote "1".  
Readers were capable of reading new encodings (and pretty cleanly detecting 
when they couldn't so people ignored the guidance).
   2. Implement both V3 and V4 before they get the benefit of V4 (this might 
have much longer delays given a lot of parquet implementations are volunteer 
driven).
   
   Is there a way to reconcile this?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Document Parquet Features by Version [parquet-site]

Reply via email to