Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread David Holmes
On Thu, 29 Aug 2024 13:06:19 GMT, Thomas Stuefe wrote: >>> > > Many of these translations seem awkward, since they convert to size_t >>> > > only to then convert back to int. >>> > >>> > Can you be more specific here please? >>> >>> Certainly. For example, this construct: >>> >>> ``` >>>

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread Thomas Stuefe
On Thu, 29 Aug 2024 12:37:47 GMT, David Holmes wrote: > > > > Many of these translations seem awkward, since they convert to size_t > > > > only to then convert back to int. > > > > > > > > > Can you be more specific here please? > > > > > > Certainly. For example, this construct: > > ``` >

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread Thomas Stuefe
On Thu, 29 Aug 2024 12:30:26 GMT, David Holmes wrote: >> src/hotspot/share/classfile/javaClasses.hpp line 138: >> >>> 136: // Legacy variants that truncate the length if needed >>> 137: static intutf8_length_as_int(oop java_string); >>> 138: static intutf8_length_as_int(oop java_st

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread David Holmes
On Thu, 29 Aug 2024 08:38:41 GMT, Thomas Stuefe wrote: > > > Many of these translations seem awkward, since they convert to size_t > > > only to then convert back to int. > > > > Can you be more specific here please? > > Certainly. For example, this construct: > > ``` > size_t utf8_len =

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread David Holmes
On Thu, 29 Aug 2024 08:52:03 GMT, Thomas Stuefe wrote: >> David Holmes has updated the pull request with a new target base due to a >> merge or a rebase. The incremental webrev excludes the unrelated changes >> brought in by the merge/rebase. The pull request contains 13 additional >> commits

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread Thomas Stuefe
On Thu, 29 Aug 2024 02:45:53 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread Thomas Stuefe
On Thu, 29 Aug 2024 07:59:47 GMT, David Holmes wrote: > > Many of these translations seem awkward, since they convert to size_t only > > to then convert back to int. > > Can you be more specific here please? Certainly. For example, this construct: size_t utf8_len = static_cast(length);

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-29 Thread David Holmes
On Thu, 29 Aug 2024 05:53:25 GMT, Thomas Stuefe wrote: > Many of these translations seem awkward, since they convert to size_t only to > then convert back to int. Can you be more specific here please? I know the most awkward case is where we have an in/out parameter that brings in an int and s

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-28 Thread Thomas Stuefe
On Thu, 29 Aug 2024 05:52:26 GMT, Thomas Stuefe wrote: > Many of these translations seem awkward, since they convert to size_t only to > then convert back to int. > > Proposal: I undestand you need to find a good point to tourniquet off the > int->size_t conversion to minimize the translations

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-28 Thread Thomas Stuefe
On Thu, 29 Aug 2024 02:45:53 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v7]

2024-08-28 Thread David Holmes
> This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unicode character, but six byte characters are stored as UTF

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v6]

2024-08-27 Thread David Holmes
> This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unicode character, but six byte characters are stored as UTF

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Wed, 28 Aug 2024 03:46:43 GMT, Dean Long wrote: >> Note that I do already document the assumptions here in the general comment >> in utf8.hpp: >> >> There is an additional assumption/expectation that our UTF8 API's are never >> dealing with >> invalid UTF8, and more generally that all UTF8

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Wed, 28 Aug 2024 01:24:43 GMT, David Holmes wrote: >>> If you try to accommodate arbitrary future use then every method in the VM >>> would need to enforce every single precondition and invariant it expects >>> "just in case" and that is not practical. >> >> I'm basically arguing for Functi

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 23:54:08 GMT, Dean Long wrote: >> If you try to accommodate arbitrary future use then every method in the VM >> would need to enforce every single precondition and invariant it expects >> "just in case" and that is not practical. Code can and does take advantage >> of the e

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 27 Aug 2024 21:21:01 GMT, David Holmes wrote: > If you try to accommodate arbitrary future use then every method in the VM > would need to enforce every single precondition and invariant it expects > "just in case" and that is not practical. I'm basically arguing for Functional Testing

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 16:51:21 GMT, Dean Long wrote: >> Why? I think that would have a large flow on effect. And this length does >> fit in an int. > > The worse case is len == SIZE_MAX and therefore num_chars == SIZE_MAX, which > won't fit in an int. If we say this will never happen because cur

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 13:06:26 GMT, Thomas Stuefe wrote: >> IIUC for compact strings, with non-latin-1 each pair of bytes would require >> at most 3-bytes to encode so you'd need 2/3 of INT_MAX. With latin-1 it >> would be 1/2 INT_MAX. But yes I suppose in theory you might be able to get >> an o

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 27 Aug 2024 12:10:36 GMT, David Holmes wrote: >> src/hotspot/share/utilities/utf8.cpp line 127: >> >>> 125: prev = c; >>> 126: } >>> 127: return checked_cast(num_chars); >> >> Ideally, this function would return size_t. > > Why? I think that would have a large flow on effect. An

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Thomas Stuefe
On Tue, 27 Aug 2024 12:20:04 GMT, David Holmes wrote: >> I think the Java string would only need to be INT_MAX/3 in length, if all >> the characters require surrogate encoding. > > IIUC for compact strings, with non-latin-1 each pair of bytes would require > at most 3-bytes to encode so you'd n

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 07:51:38 GMT, Dean Long wrote: >> I'm trying to reason if on 32-bit we could even create a large enough string >> for this to be a problem? Once we have the giant string `as_utf8` will have >> to allocate an array that is just as large if not larger. So for overflow to >> b

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 08:22:57 GMT, Dean Long wrote: >> David Holmes has updated the pull request incrementally with one additional >> commit since the last revision: >> >> more missing casts > > src/hotspot/share/utilities/utf8.cpp line 127: > >> 125: prev = c; >> 126: } >> 127: retur

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 27 Aug 2024 07:23:14 GMT, David Holmes wrote: >> src/hotspot/share/classfile/javaClasses.cpp line 633: >> >>> 631: } >>> 632: >>> 633: int java_lang_String::utf8_length_as_int(oop java_string, typeArrayOop >>> value) { >> >> Why not call java_lang_String::utf8_length() here instead of

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 27 Aug 2024 07:20:27 GMT, David Holmes wrote: >> src/hotspot/share/classfile/javaClasses.cpp line 588: >> >>> 586: size_t utf8_len = static_cast(length); >>> 587: const char* base = UNICODE::as_utf8(position, utf8_len); >>> 588: Symbol* sym = SymbolTable::new_symbol(base, >>

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 27 Aug 2024 07:09:33 GMT, David Holmes wrote: >> src/hotspot/share/classfile/javaClasses.cpp line 555: >> >>> 553: bool is_latin1 = java_lang_String::is_latin1(java_string); >>> 554: >>> 555: if (length == 0) return nullptr; >> >> Should this be checking for length <= 0? It l

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread Dean Long
On Tue, 27 Aug 2024 07:07:11 GMT, David Holmes wrote: >> src/hotspot/share/classfile/javaClasses.cpp line 307: >> >>> 305: { >>> 306: ResourceMark rm; >>> 307: size_t utf8_len = static_cast(length); >> >> I think there should be an assert that length is not negative, probably at >> t

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 03:36:00 GMT, Dean Long wrote: >> David Holmes has updated the pull request incrementally with one additional >> commit since the last revision: >> >> more missing casts > > src/hotspot/share/classfile/javaClasses.cpp line 633: > >> 631: } >> 632: >> 633: int java_lang_S

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 03:13:59 GMT, Dean Long wrote: >> David Holmes has updated the pull request incrementally with one additional >> commit since the last revision: >> >> more missing casts > > src/hotspot/share/classfile/javaClasses.cpp line 588: > >> 586: size_t utf8_len = static_cast(

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-27 Thread David Holmes
On Tue, 27 Aug 2024 01:09:09 GMT, Dean Long wrote: >> David Holmes has updated the pull request incrementally with one additional >> commit since the last revision: >> >> more missing casts > > src/hotspot/share/classfile/javaClasses.cpp line 307: > >> 305: { >> 306: ResourceMark rm; >

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Dean Long
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread David Holmes
On Mon, 26 Aug 2024 23:37:15 GMT, Coleen Phillimore wrote: >> Why "> 0" ? > > Because length is an in which could be negative but you're passing it to > size_t. -Wsign-conversion might complain because you're changing signs. I > guess you know from context that it's a positive number, so ok.

Re: RFR: 8338257: UTF8 lengths should be size_t not int

2024-08-26 Thread David Holmes
On Thu, 15 Aug 2024 20:44:52 GMT, Coleen Phillimore wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represen

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Coleen Phillimore
On Mon, 19 Aug 2024 23:08:35 GMT, David Holmes wrote: >> src/hotspot/share/classfile/javaClasses.cpp line 639: >> >>> 637: if (length == 0) { >>> 638: return 0; >>> 639: } >> >> Maybe assert length > 0 here? > > Why "> 0" ? Because length is an in which could be negative but you're pas

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-26 Thread Coleen Phillimore
On Tue, 20 Aug 2024 04:09:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int

2024-08-20 Thread David Holmes
On Thu, 15 Aug 2024 20:44:52 GMT, Coleen Phillimore wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represen

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v5]

2024-08-19 Thread David Holmes
> This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unicode character, but six byte characters are stored as UTF

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v4]

2024-08-19 Thread David Holmes
> This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unicode character, but six byte characters are stored as UTF

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v3]

2024-08-19 Thread David Holmes
> This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unicode character, but six byte characters are stored as UTF

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v2]

2024-08-19 Thread David Holmes
On Tue, 20 Aug 2024 03:16:29 GMT, Dean Long wrote: >> David Holmes has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Add missing cast for signed-to-unsigned converion. > > src/hotspot/share/classfile/javaClasses.cpp line 695: > >> 693:

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v2]

2024-08-19 Thread Dean Long
On Mon, 19 Aug 2024 23:19:04 GMT, David Holmes wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represent one

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v2]

2024-08-19 Thread David Holmes
On Thu, 15 Aug 2024 20:43:51 GMT, Coleen Phillimore wrote: > Did you try to compile this with -Wconversion (not -Wsign-conversion) without > -Werror? I hadn't but will do so ... > src/hotspot/share/classfile/javaClasses.cpp line 586: > >> 584: ResourceMark rm; >> 585: jbyte* position

Re: RFR: 8338257: UTF8 lengths should be size_t not int [v2]

2024-08-19 Thread David Holmes
> This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unicode character, but six byte characters are stored as UTF

Re: RFR: 8338257: UTF8 lengths should be size_t not int

2024-08-19 Thread David Holmes
On Thu, 15 Aug 2024 20:44:52 GMT, Coleen Phillimore wrote: >> This work has been split out from JDK-8328877: [JNI] The JNI Specification >> needs to address the limitations of integer UTF-8 String lengths >> >> The modified UTF-8 format used by the VM can require up to six bytes to >> represen

Re: RFR: 8338257: UTF8 lengths should be size_t not int

2024-08-15 Thread Coleen Phillimore
On Tue, 13 Aug 2024 02:20:41 GMT, David Holmes wrote: > This work has been split out from JDK-8328877: [JNI] The JNI Specification > needs to address the limitations of integer UTF-8 String lengths > > The modified UTF-8 format used by the VM can require up to six bytes to > represent one unic

RFR: 8338257: UTF8 lengths should be size_t not int

2024-08-12 Thread David Holmes
This work has been split out from JDK-8328877: [JNI] The JNI Specification needs to address the limitations of integer UTF-8 String lengths The modified UTF-8 format used by the VM can require up to six bytes to represent one unicode character, but six byte characters are stored as UTF-16 surro