mikhail-melnik opened a new issue, #3573: URL: https://github.com/apache/parquet-java/issues/3573
### Describe the bug, including details regarding any error messages, version, and platform. I was playing with Variant implementation and trying to run corresponding tests, but got them failing. Turns out that VariantUtil.getString(), VariantUtil.getDictKey(), and VariantUtil.getMetadataMap() construct Java strings from raw bytes without specifying a charset. So this relies on the JVM platform default charset, which is strictly speaking not guaranteed to be UTF-8. For example, on JDK <= 17, setting `LC_ALL=C` (the invocation recommended in the project README) causes the JVM to use ASCII. Under this encoding, multi-byte UTF-8 sequences - such as `é` (encoded as `0xC3 0xA9`) - are decoded as two separate characters instead of one, corrupting non-ASCII string values read from Variant columns. The [Variant binary encoding spec](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) explicitly requires UTF-8 for all string data. As far as I can see the write path already uses `StandardCharsets.UTF_8` explicitly (in VariantBuilder.appendString() and MetadataBuilder.getOrInsert()), so this makes the mismatch on the read path both inconsistent and incorrect. **To reproduce**, run the existing `testParseUnicodeString` test with `LC_ALL=C`: ```bash LC_ALL=C ./mvnw test -pl parquet-variant -Dtest=TestVariantParseJson#testParseUnicodeString ``` Expected: test passes. Actual: ``` org.junit.ComparisonFailure: expected:<[él è]ve> but was:<[Ãlè]ve> ``` CI run tests on JDK 11 and 17 but does not set `LC_ALL=C`, so I guess `file.encoding` stays UTF-8 by default. JDK 18+ is also unaffected because JEP 400 makes UTF-8 the default regardless of locale. **Proposed Fix:** add `StandardCharsets.UTF_8` to all six call sites, consistent with `Binary.toStringUsingUTF8()`, `KeyMetadata`, and the rest of the codebase. ### Component(s) Core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
