mikhail-melnik opened a new issue, #3573:
URL: https://github.com/apache/parquet-java/issues/3573

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I was playing with Variant implementation and trying to run corresponding 
tests, but got them failing. Turns out that VariantUtil.getString(), 
VariantUtil.getDictKey(), and VariantUtil.getMetadataMap() construct Java 
strings from raw bytes without specifying a charset. So this relies on the JVM 
platform default charset, which is strictly speaking not guaranteed to be UTF-8.
   
   For example, on JDK <= 17, setting `LC_ALL=C` (the invocation recommended in 
the project README) causes the JVM to use ASCII. Under this encoding, 
multi-byte UTF-8 sequences - such as `é` (encoded as `0xC3 0xA9`) - are decoded 
as two separate characters instead of one, corrupting non-ASCII string values 
read from Variant columns.
   
   The [Variant binary encoding 
spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) 
explicitly requires UTF-8 for all string data. As far as I can see the write 
path already uses `StandardCharsets.UTF_8` explicitly (in 
VariantBuilder.appendString() and MetadataBuilder.getOrInsert()), so this makes 
the mismatch on the read path both inconsistent and incorrect.
   
   **To reproduce**, run the existing `testParseUnicodeString` test with 
`LC_ALL=C`:
   
   ```bash
   LC_ALL=C ./mvnw test -pl parquet-variant 
-Dtest=TestVariantParseJson#testParseUnicodeString
   ```
   
   Expected: test passes. 
   Actual:
   ```
   org.junit.ComparisonFailure: expected:<[él è]ve> but was:<[Ãlè]ve>
   ```
   
   CI run tests on JDK 11 and 17 but does not set `LC_ALL=C`, so I guess 
`file.encoding` stays UTF-8 by default. JDK 18+ is also unaffected because JEP 
400 makes UTF-8 the default regardless of locale.
   
   **Proposed Fix:** add `StandardCharsets.UTF_8` to all six call sites, 
consistent with `Binary.toStringUsingUTF8()`, `KeyMetadata`, and the rest of 
the codebase.
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to