On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes <dhol...@openjdk.org> wrote:

>> This work has been split out from JDK-8328877: [JNI] The JNI Specification 
>> needs to address the limitations of integer UTF-8 String lengths
>> 
>> The modified UTF-8 format used by the VM can require up to six bytes to 
>> represent one unicode character, but six byte characters are stored as 
>> UTF-16 surrogate pairs. Hence the most bytes per character is 3, and so the 
>> maximum length is 3*`Integer.MAX_VALUE`.  Though with compact strings this 
>> reduces to 2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should 
>> therefore define UTF8 lengths as `size_t` to accommodate all possible 
>> representations. Higher-level API's can still use `int` if they know the 
>> strings (eg symbols) are sufficiently constrained in length.  See the 
>> comments in utf8.hpp that explain Strings, compact strings and the encoding.
>> 
>> As the existing JNI `GetStringUTFLength` still requires the current 
>> truncating behaviour of ` UNICODE::utf8_length` we add back 
>> `UNICODE::utf8_length_as_int` for it to use.
>> 
>> Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& 
>> length)` use `length` as an IN/OUT parameter: it is the incoming (int) 
>> length of the jbyte/jchar array, and the outgoing (size_t) length of the 
>> UTF8 sequence. This makes some of the call sites a little messy with casts.
>> 
>> Testing:
>>  - tiers 1-4
>>  - GHA
>
> David Holmes has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   more missing casts

src/hotspot/share/classfile/javaClasses.cpp line 350:

> 348:   // This check is too strict when the input string is not a valid UTF8.
> 349:   // For example, it may be created with arbitrary content via 
> jni_NewStringUTF.
> 350:   if (UTF8::is_legal_utf8((const unsigned char*)utf8_str, 
> strlen(utf8_str), false)) {

Most of the time we use `is_legal_utf8`, we have a char* and have to cast it to 
unsigned char*.  How about adding an inlined overload for is_legal_utf8(const 
char*, size_t) for conenience?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20560#discussion_r1731997319

Reply via email to