Re: RFR: 8364007: Add no-argument codePointCount method to CharSequence and String [v3]

Tatsunori Uchino Sat, 03 Jan 2026 22:09:59 -0800

On Sat, 26 Jul 2025 10:10:40 GMT, Tatsunori Uchino <[email protected]> wrote:


>> Adds `codePointCount()` overloads to `String`, `Character`, 
>> `(Abstract)StringBuilder`, and `StringBuffer` to make it possible to 
>> conveniently retrieve the length of a string as code points without extra 
>> boundary checks.
>> 
>> 
>> if (superTremendouslyLongExpressionYieldingAString().codePointCount() > 
>> limit) {
>>     throw new Exception("exceeding length");
>> }
>> 
>> 
>> Is a CSR required to this change?
>
> Tatsunori Uchino has updated the pull request incrementally with four 
> additional commits since the last revision:
> 
>  - Update `@bug` in correct file
>  - Add default implementation on codePointCount in CharSequence
>  - Update `@bug` entries in test class doc comments
>  - Discard changes on code whose form is not `str.codePointCount(0, 
> str.length())`

I had Copilot (GPT-4.1) create a draft:

> ## Summary
> 
> Add a no-argument `codePointCount()` method to `CharSequence` and `String` to 
> count the number of Unicode code points in the entire sequence or string.
> 
> ## Problem
> 
> Currently, `String.codePointCount` and `CharSequence.codePointCount` only 
> provide an overload that requires start and end indices. Developers often 
> expect an overload with no arguments that returns the code point count of the 
> entire string or sequence. Without this, developers resort to verbose or less 
> efficient workarounds, such as using `codePoints().count()` (which yields 
> every code point, adding unnecessary overhead) or calling `codePointCount(0, 
> str.length())` (which is more verbose, requires a temporary variable, and 
> performs an extra boundary check).
> 
> A common use case involves enforcing maximum character limits on user input, 
> particularly for fields stored in databases such as MySQL or PostgreSQL. Both 
> database systems can consider the declared length of `VARCHAR(n)` columns as 
> the number of Unicode code points, not just the number of `char` units or 
> bytes for character sets like UTF-8 or UTF8MB4. Correctly counting code 
> points is essential for supporting internationalized input, emoji, and 
> non-BMP characters. For example, the NIST SP 800-63B guideline specifies that 
> passwords should be checked in terms of the number of Unicode code points.
> 
> ## Solution
> 
> Introduce default no-argument `codePointCount()` methods in both the 
> `CharSequence` interface and the `String` class. The new method returns the 
> number of Unicode code points in the entire character sequence, equivalent to 
> invoking `codePointCount(0, length())`, but provides better readability and 
> avoids unnecessary overhead. The implementation in `CharSequence` is a 
> default method, while `String` provides an explicit override for potential 
> performance optimization.
> 
> ## Specification
> 
> Add to `java.lang.CharSequence` interface:
> ```java
> /**
>  * Returns the number of Unicode code points in this character sequence.
>  * Equivalent to {@code codePointCount(0, length())}.
>  *
>  * @return the number of Unicode code points in this sequence
>  * @since N
>  */
> default int codePointCount() {
>     return codePointCount(0, length());
> }
> ```
> 
> Add to `java.lang.String` class:
> ```java
> /**
>  * Returns the number of Unicode code points in this string.
>  * Equivalent to {@code codePointCount(0, length())}.
>  *
>  * @return the number of Unicode code points in this string
>  * @since N
>  */
> @Override
> public int codePointCount() {
>     return codePointCount(0, length());
> }
> ```
> 
> Here, `N` refers to the next Java platform version in which this change will 
> be available.
> 
> Informative Supplement:
> 
> - Implementation: [GitHub PR 26461](https://github.com/openjdk/jdk/pull/26461)
> - Example use cases:
>     ```java
>     // For user names stored in MySQL (or PostgreSQL) VARCHAR(20), which 
> counts code points:
>     if (userName.codePointCount() > 20) {
>         IO.println("The user name is too long to store in VARCHAR(20) in 
> utf8mb4 MySQL/PostgreSQL!");
>     }
>     // Password policy: require at least 8 Unicode characters (code points) 
> as per NIST SP 800-63B:
>     if (password.codePointCount() < 8) {
>         IO.println("Password is too short!");
>     }
>     ```
> 
> References:
> - [MySQL VARCHAR 
> documentation](https://dev.mysql.com/doc/refman/8.0/en/char.html)
> - [PostgreSQL Character 
> Types](https://www.postgresql.org/docs/current/datatype-character.html)
> - [NIST SP 800-63B 
> §5.1.1.2](https://pages.nist.gov/800-63-4/sp800-63b.html#passwordver)

<details>
<summary>Markdown Source</summary>


## Summary

Add a no-argument `codePointCount()` method to `CharSequence` and `String` to 
count the number of Unicode code points in the entire sequence or string.

## Problem

Currently, `String.codePointCount` and `CharSequence.codePointCount` only 
provide an overload that requires start and end indices. Developers often 
expect an overload with no arguments that returns the code point count of the 
entire string or sequence. Without this, developers resort to verbose or less 
efficient workarounds, such as using `codePoints().count()` (which yields every 
code point, adding unnecessary overhead) or calling `codePointCount(0, 
str.length())` (which is more verbose, requires a temporary variable, and 
performs an extra boundary check).

A common use case involves enforcing maximum character limits on user input, 
particularly for fields stored in databases such as MySQL or PostgreSQL. Both 
database systems can consider the declared length of `VARCHAR(n)` columns as 
the number of Unicode code points, not just the number of `char` units or bytes 
for character sets like UTF-8 or UTF8MB4. Correctly counting code points is 
essential for supporting internationalized input, emoji, and non-BMP 
characters. For example, the NIST SP 800-63B guideline specifies that passwords 
should be checked in terms of the number of Unicode code points.

## Solution

Introduce default no-argument `codePointCount()` methods in both the 
`CharSequence` interface and the `String` class. The new method returns the 
number of Unicode code points in the entire character sequence, equivalent to 
invoking `codePointCount(0, length())`, but provides better readability and 
avoids unnecessary overhead. The implementation in `CharSequence` is a default 
method, while `String` provides an explicit override for potential performance 
optimization.

## Specification

Add to `java.lang.CharSequence` interface:

/**
 * Returns the number of Unicode code points in this character sequence.
 * Equivalent to {@code codePointCount(0, length())}.
 *
 * @return the number of Unicode code points in this sequence
 * @since N
 */
default int codePointCount() {
    return codePointCount(0, length());
}


Add to `java.lang.String` class:

/**
 * Returns the number of Unicode code points in this string.
 * Equivalent to {@code codePointCount(0, length())}.
 *
 * @return the number of Unicode code points in this string
 * @since N
 */
@Override
public int codePointCount() {
    return codePointCount(0, length());
}


Here, `N` refers to the next Java platform version in which this change will be 
available.

Informative Supplement:

- Implementation: [GitHub PR 26461](https://github.com/openjdk/jdk/pull/26461)
- Example use cases:
    ```java
    // For user names stored in MySQL (or PostgreSQL) VARCHAR(20), which counts 
code points:
    if (userName.codePointCount() > 20) {
        IO.println("The user name is too long to store in VARCHAR(20) in 
utf8mb4 MySQL/PostgreSQL!");
    }
    // Password policy: require at least 8 Unicode characters (code points) as 
per NIST SP 800-63B:
    if (password.codePointCount() < 8) {
        IO.println("Password is too short!");
    }
    ```

References:
- [MySQL VARCHAR 
documentation](https://dev.mysql.com/doc/refman/8.0/en/char.html)
- [PostgreSQL Character 
Types](https://www.postgresql.org/docs/current/datatype-character.html)
- [NIST SP 800-63B 
§5.1.1.2](https://pages.nist.gov/800-63-4/sp800-63b.html#passwordver)


</details>

Needs to be fixed:

> for character sets like UTF-8 or UTF8MB4.

↓

"for character sets like UTF-8 (utf8mb4 in MySQL)."

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26461#issuecomment-3707771006

Re: RFR: 8364007: Add no-argument codePointCount method to CharSequence and String [v3]

Reply via email to