alamb commented on code in PR #574:
URL: https://github.com/apache/parquet-format/pull/574#discussion_r3349998296
##########
VariantEncoding.md:
##########
@@ -87,6 +87,7 @@ The `version` is a 4-bit value that must always contain the
value `1`.
`sorted_strings` is a 1-bit value indicating whether dictionary strings are
sorted and unique.
`offset_size_minus_one` is a 2-bit value providing the number of bytes per
dictionary size and offset field.
The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`.
+Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by
readers.
Review Comment:
THis "must be set to 0 by writers" seems to me to change the spec.
Previously it was not specified. I think we should leave it as unspecified --
it would be nice to clarify it should be ignored by readers
```suggestion
Bit 5 (marked `R`) is reserved; it must be ignored by readers.
```
##########
VariantEncoding.md:
##########
@@ -206,14 +207,15 @@ value_header | | | | |
The actual number of bytes is computed as `field_offset_size_minus_one + 1`
and `field_id_size_minus_one + 1`.
`is_large` is a 1-bit value that indicates how many bytes are used to encode
the number of elements.
If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are
used.
+Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by
readers.
Review Comment:
same comment as above -- I don't think we should mandate setting to 0
##########
VariantEncoding.md:
##########
@@ -223,6 +225,7 @@ value_header | | | |
The actual number of bytes is computed as `field_offset_size_minus_one + 1`.
`is_large` is a 1-bit value that indicates how many bytes are used to encode
the number of elements.
If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are
used.
+Bits 5-3 (marked `RRR`) are reserved; they must be set to 0 by writers and
ignored by readers.
Review Comment:
same as above
##########
VariantEncoding.md:
##########
@@ -407,20 +414,20 @@ The Decimal type contains a scale, but no precision. The
implied precision of a
| NullType | null | `0` | UNKNOWN
| none
|
| Boolean | boolean (True) | `1` | BOOLEAN
| none
|
| Boolean | boolean (False) | `2` | BOOLEAN
| none
|
-| Exact Numeric | int8 | `3` | INT(8,
signed) | 1 byte
|
-| Exact Numeric | int16 | `4` | INT(16,
signed) | 2 byte little-endian
|
-| Exact Numeric | int32 | `5` | INT(32,
signed) | 4 byte little-endian
|
-| Exact Numeric | int64 | `6` | INT(64,
signed) | 8 byte little-endian
|
+| Exact Numeric | int8 | `3` | INT(8, true)
| 1-byte
|
+| Exact Numeric | int16 | `4` | INT(16, true)
| 2-byte little-endian
|
+| Exact Numeric | int32 | `5` | INT(32, true)
| 4-byte little-endian
|
+| Exact Numeric | int64 | `6` | INT(64, true)
| 8-byte little-endian
|
| Double | double | `7` | DOUBLE
| IEEE little-endian
|
-| Exact Numeric | decimal4 | `8` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
-| Exact Numeric | decimal8 | `9` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
-| Exact Numeric | decimal16 | `10` |
DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
-| Date | date | `11` | DATE
| 4 byte little-endian
|
+| Exact Numeric | decimal4 | `8` |
DECIMAL(precision, scale) | 1-byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Exact Numeric | decimal8 | `9` |
DECIMAL(precision, scale) | 1-byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Exact Numeric | decimal16 | `10` |
DECIMAL(precision, scale) | 1-byte scale in range [0, 38], followed by
little-endian unscaled value (see decimal table) |
+| Date | date | `11` | DATE
| 4-byte little-endian
|
| Timestamp | timestamp | `12` |
TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian
|
| TimestampNTZ | timestamp without time zone | `13` |
TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian
|
| Float | float | `14` | FLOAT
| IEEE little-endian
|
-| Binary | binary | `15` | BINARY
| 4 byte little-endian size, followed by bytes
|
-| String | string | `16` | STRING
| 4 byte little-endian size, followed by UTF-8 encoded bytes
|
+| Binary | binary | `15` | BYTE_ARRAY
| 4-byte little-endian size, followed by bytes
|
Review Comment:
Double checked against https://parquet.apache.org/docs/file-format/types/
##########
VariantShredding.md:
##########
@@ -294,7 +294,7 @@ def construct_variant(metadata: Metadata, value: Variant,
typed_value: Any) -> V
# this is a shredded object
object_fields = {
name: construct_variant(metadata, field.value,
field.typed_value)
- for (name, field) in typed_value
+ for (name, field) in typed_value.items()
Review Comment:
Why this change? Does the original not work? Did you test that this still
works after the proposed change?
##########
VariantShredding.md:
##########
@@ -85,8 +85,8 @@ Shredded values must use the following Parquet types:
| Variant Type | Parquet Physical Type | Parquet
Logical Type |
|-----------------------------|-----------------------------------|--------------------------|
| boolean | BOOLEAN |
|
-| int8 | INT32 | INT(8,
signed=true) |
-| int16 | INT32 | INT(16,
signed=true) |
+| int8 | INT32 | INT(8,
true) |
Review Comment:
I found the signed=true easier to understand to be honest, but this is
technically accurate too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]