Re: [PR] Fix errors and inconsistencies in Variant format documentation [parquet-format]

via GitHub Wed, 03 Jun 2026 08:59:34 -0700


alamb commented on code in PR #574:
URL: https://github.com/apache/parquet-format/pull/574#discussion_r3349998296



##########
VariantEncoding.md:
##########
@@ -87,6 +87,7 @@ The `version` is a 4-bit value that must always contain the 
value `1`.
 `sorted_strings` is a 1-bit value indicating whether dictionary strings are 
sorted and unique.
 `offset_size_minus_one` is a 2-bit value providing the number of bytes per 
dictionary size and offset field.
 The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`.
+Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by 
readers.

Review Comment:
   THis "must be set to 0 by writers" seems to me to change the spec. 
Previously it was not specified. I think we should leave it as unspecified  -- 
it would be nice to clarify it should be ignored by readers 
   
   ```suggestion
   Bit 5 (marked `R`) is reserved; it must be ignored by readers.
   ```



##########
VariantEncoding.md:
##########
@@ -206,14 +207,15 @@ value_header    |   |   |       |       |
 The actual number of bytes is computed as `field_offset_size_minus_one + 1` 
and `field_id_size_minus_one + 1`.
 `is_large` is a 1-bit value that indicates how many bytes are used to encode 
the number of elements.
 If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are 
used.
+Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by 
readers.

Review Comment:
   same comment as above -- I don't think we should mandate setting to 0



##########
VariantEncoding.md:
##########
@@ -223,6 +225,7 @@ value_header    |           |   |       |
 The actual number of bytes is computed as `field_offset_size_minus_one + 1`.
 `is_large` is a 1-bit value that indicates how many bytes are used to encode 
the number of elements.
 If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are 
used.
+Bits 5-3 (marked `RRR`) are reserved; they must be set to 0 by writers and 
ignored by readers.

Review Comment:
   same as above



##########
VariantEncoding.md:
##########
@@ -407,20 +414,20 @@ The Decimal type contains a scale, but no precision. The 
implied precision of a
 | NullType             | null                        | `0`     | UNKNOWN       
              | none                                                            
                                                    |
 | Boolean              | boolean (True)              | `1`     | BOOLEAN       
              | none                                                            
                                                    |
 | Boolean              | boolean (False)             | `2`     | BOOLEAN       
              | none                                                            
                                                    |
-| Exact Numeric        | int8                        | `3`     | INT(8, 
signed)              | 1 byte                                                   
                                                           |
-| Exact Numeric        | int16                       | `4`     | INT(16, 
signed)             | 2 byte little-endian                                      
                                                          |
-| Exact Numeric        | int32                       | `5`     | INT(32, 
signed)             | 4 byte little-endian                                      
                                                          |
-| Exact Numeric        | int64                       | `6`     | INT(64, 
signed)             | 8 byte little-endian                                      
                                                          |
+| Exact Numeric        | int8                        | `3`     | INT(8, true)  
              | 1-byte                                                          
                                                    |
+| Exact Numeric        | int16                       | `4`     | INT(16, true) 
              | 2-byte little-endian                                            
                                                    |
+| Exact Numeric        | int32                       | `5`     | INT(32, true) 
              | 4-byte little-endian                                            
                                                    |
+| Exact Numeric        | int64                       | `6`     | INT(64, true) 
              | 8-byte little-endian                                            
                                                    |
 | Double               | double                      | `7`     | DOUBLE        
              | IEEE little-endian                                              
                                                    |
-| Exact Numeric        | decimal4                    | `8`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
-| Exact Numeric        | decimal8                    | `9`     | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
-| Exact Numeric        | decimal16                   | `10`    | 
DECIMAL(precision, scale)   | 1 byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
-| Date                 | date                        | `11`    | DATE          
              | 4 byte little-endian                                            
                                                    |
+| Exact Numeric        | decimal4                    | `8`     | 
DECIMAL(precision, scale)   | 1-byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
+| Exact Numeric        | decimal8                    | `9`     | 
DECIMAL(precision, scale)   | 1-byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
+| Exact Numeric        | decimal16                   | `10`    | 
DECIMAL(precision, scale)   | 1-byte scale in range [0, 38], followed by 
little-endian unscaled value (see decimal table)                         |
+| Date                 | date                        | `11`    | DATE          
              | 4-byte little-endian                                            
                                                    |
 | Timestamp            | timestamp                   | `12`    | 
TIMESTAMP(isAdjustedToUTC=true, MICROS)     | 8-byte little-endian              
                                                                  |
 | TimestampNTZ         | timestamp without time zone | `13`    | 
TIMESTAMP(isAdjustedToUTC=false, MICROS)    | 8-byte little-endian              
                                                                  |
 | Float                | float                       | `14`    | FLOAT         
              | IEEE little-endian                                              
                                                    |
-| Binary               | binary                      | `15`    | BINARY        
              | 4 byte little-endian size, followed by bytes                    
                                                    |
-| String               | string                      | `16`    | STRING        
              | 4 byte little-endian size, followed by UTF-8 encoded bytes      
                                                    |
+| Binary               | binary                      | `15`    | BYTE_ARRAY    
              | 4-byte little-endian size, followed by bytes                    
                                                    |

Review Comment:
   Double checked against https://parquet.apache.org/docs/file-format/types/



##########
VariantShredding.md:
##########
@@ -294,7 +294,7 @@ def construct_variant(metadata: Metadata, value: Variant, 
typed_value: Any) -> V
             # this is a shredded object
             object_fields = {
                 name: construct_variant(metadata, field.value, 
field.typed_value)
-                for (name, field) in typed_value
+                for (name, field) in typed_value.items()

Review Comment:
   Why this change? Does the original not work? Did you test that this still 
works after the proposed change?



##########
VariantShredding.md:
##########
@@ -85,8 +85,8 @@ Shredded values must use the following Parquet types:
 | Variant Type                | Parquet Physical Type             | Parquet 
Logical Type     |
 
|-----------------------------|-----------------------------------|--------------------------|
 | boolean                     | BOOLEAN                           |            
              |
-| int8                        | INT32                             | INT(8, 
signed=true)      |
-| int16                       | INT32                             | INT(16, 
signed=true)     |
+| int8                        | INT32                             | INT(8, 
true)             |

Review Comment:
   I found the signed=true easier to understand to be honest, but this is 
technically accurate too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix errors and inconsistencies in Variant format documentation [parquet-format]

Reply via email to