jorisvandenbossche commented on code in PR #186:
URL: https://github.com/apache/parquet-site/pull/186#discussion_r3380588765


##########
content/en/docs/File Format/versions.md:
##########
@@ -0,0 +1,266 @@
+---
+title: "Parquet format versions"
+linkTitle: "Features and Versions"
+weight: 9
+---
+
+This page describes how features are added to the [Parquet format
+specification](https://github.com/apache/parquet-format) and how they affect
+reader and writer compatibility. See the
+[Implementation status](../implementationstatus/) page for which 
implementations
+(arrow, parquet-java, arrow-rs, etc.) support each feature.
+
+*Note*: If you find out-of-date information, please open an issue or pull 
request.
+
+## Feature compatibility
+
+The Parquet format spec [classifies changes] by their effect on reader and
+writer compatibility. Changes differ in their *forward* compatibility — whether
+an older reader can read files that use a newer feature.
+
+**Forward compatible** features remain **readable by older readers**, with a
+possibly degraded experience: some metadata may be missing or performance may
+suffer, but the reader does not fail. Examples:
+
+* **Bloom filters**: a reader that ignores them skips the pruning metadata but
+  still reads the data correctly.
+* **Logical type annotations** such as `VARIANT`: an older reader reads the
+  underlying physical column (e.g. `BYTE_ARRAY`) as raw bytes without applying
+  the logical type.
+
+**Forward incompatible** features make the data **unreadable** to older 
software.
+Examples:
+
+* **New encodings** (e.g. the `DELTA_*` encodings, `BYTE_STREAM_SPLIT`,
+  `RLE_DICTIONARY`): a reader that does not implement them cannot decode the
+  column values.
+* **Data Page V2 headers**: a reader that only understands `DataPageHeader`
+  cannot parse `DataPageHeaderV2` pages.
+
+[classifies changes]: 
https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#compatibility-and-feature-enablement
+
+## `FileMetadata` version field
+
+Each Parquet file has a `version` field in the [`thrift FileMetadata`] that
+declares which features the file may use, and thus what a reader **must** 
support
+to read it.
+
+**Note**: Many writers set the version field to `1` even for files that use
+format 2.0 features, which has caused [confusion and interoperability
+issues][closing-out-2.0].

Review Comment:
   The first paragraph giving meaning to the "version" metadata field seems a 
bit confusing/misleading, together with the note, and moreover with the fact 
that the thrift itself specifically says this should be hardcoded to "1":
   
   
https://github.com/apache/parquet-format/blob/74001e41f5c5a1856b29be115f9c992cab16a4bf/src/main/thrift/parquet.thrift#L1365-L1374
   
   ```
   ...
   struct FileMetaData {
     /** Version of this file
       *
       * As of December 2025, there is no agreed upon consensus of what 
constitutes
       * version 2 of the file. For maximum compatibility with readers, writers 
should
       * always populate "1" for version. For maximum compatibility with 
writers,
       * readers should accept "1" and "2" interchangeably.  All other versions 
are
       * reserved for potential future use-cases.
       */
     1: required i32 version
   ...
   ```
   
   Or when keeping the text here, the note in the thrift file should be updated 
to match better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to