Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

David Holmes Thu, 29 Aug 2024 05:40:49 -0700

On Thu, 29 Aug 2024 08:38:41 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:


> > > Many of these translations seem awkward, since they convert to size_t 
> > > only to then convert back to int.
> > 
> > Can you be more specific here please?
> 
> Certainly. For example, this construct:
> 
> ```
>     size_t utf8_len = static_cast<size_t>(length);
>     const char* base = UNICODE::as_utf8(position, utf8_len);
>     Symbol* sym = SymbolTable::new_symbol(base, checked_cast<int>(utf8_len));
> ```
> 
> We introduce `utf8_len` as a `size_t` synonym for len, but since it 
> originates from an `int`, its length must be <= INT_MIN. We also assume it is 
> >=0, but we don't check. We feed `utf8_len` into both `UNICODE::as_utf8` and 
> `SymbolTable::new_symbol`. The former takes a size_t, but since we rely on 
> `length` >= 0, we could just as well give it `length`. For 
> `SymbolTable::new_symbol`, we translate `utf8_len` back to `int`, with a 
> check. The check feels superfluous since `utf8_len` came from an int.
> 
> I assume this verbosity is for the benefit of the code reader, to make intent 
> clear. Otherwise, we could have just continued to use length, and just cast 
> it on the fly to unsigned or to size_t when calling `UNICODE::as_utf8`.

This is exactly the case I was referring to. The declaration here is:

 template<typename T> static char* as_utf8(const T* base, size_t& length);

whether length is an IN/OUT parameter that is the int array length going in 
(hence >= 0 and <= INT_MAX), and the size_t utf8 sequence length coming out. 
The out coming utf8 length can theoretically by > INT_MAX but if that were the 
case in this code (which expects to be dealing with names that can be symbols 
hence < 64K) then that would be a programming error which the checked_cast 
would catch. And of course new_symbol checks for < 64K.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20560#issuecomment-2317534674

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

Reply via email to