Integrated: 8338257: UTF8 lengths should be size_t not int

David Holmes Thu, 29 Aug 2024 13:43:36 -0700

On Tue, 13 Aug 2024 02:20:41 GMT, David Holmes <dhol...@openjdk.org> wrote:


> This work has been split out from JDK-8328877: [JNI] The JNI Specification 
> needs to address the limitations of integer UTF-8 String lengths
> 
> The modified UTF-8 format used by the VM can require up to six bytes to 
> represent one unicode character, but six byte characters are stored as UTF-16 
> surrogate pairs. Hence the most bytes per character is 3, and so the maximum 
> length is 3*`Integer.MAX_VALUE`.  Though with compact strings this reduces to 
> 2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should therefore define 
> UTF8 lengths as `size_t` to accommodate all possible representations. 
> Higher-level API's can still use `int` if they know the strings (eg symbols) 
> are sufficiently constrained in length.  See the comments in utf8.hpp that 
> explain Strings, compact strings and the encoding.
> 
> As the existing JNI `GetStringUTFLength` still requires the current 
> truncating behaviour of ` UNICODE::utf8_length` we add back 
> `UNICODE::utf8_length_as_int` for it to use.
> 
> Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& length)` 
> use `length` as an IN/OUT parameter: it is the incoming (int) length of the 
> jbyte/jchar array, and the outgoing (size_t) length of the UTF8 sequence. 
> This makes some of the call sites a little messy with casts.
> 
> Testing:
>  - tiers 1-4
>  - GHA

This pull request has now been integrated.

Changeset: a4962ace
Author:    David Holmes <dhol...@openjdk.org>
URL:       
https://git.openjdk.org/jdk/commit/a4962ace4d3afb36e9d6822a4f02a1515fac40ed
Stats:     234 lines in 16 files changed: 112 ins; 5 del; 117 mod

8338257: UTF8 lengths should be size_t not int

Reviewed-by: stuefe, coleenp, dlong

-------------

PR: https://git.openjdk.org/jdk/pull/20560

Integrated: 8338257: UTF8 lengths should be size_t not int

Reply via email to