On Sat, 26 Jul 2025 10:10:40 GMT, Tatsunori Uchino <[email protected]> wrote:
>> Adds `codePointCount()` overloads to `String`, `Character`,
>> `(Abstract)StringBuilder`, and `StringBuffer` to make it possible to
>> conveniently retrieve the length of a string as code points without extra
>> boundary checks.
>>
>>
>> if (superTremendouslyLongExpressionYieldingAString().codePointCount() >
>> limit) {
>> throw new Exception("exceeding length");
>> }
>>
>>
>> Is a CSR required to this change?
>
> Tatsunori Uchino has updated the pull request incrementally with four
> additional commits since the last revision:
>
> - Update `@bug` in correct file
> - Add default implementation on codePointCount in CharSequence
> - Update `@bug` entries in test class doc comments
> - Discard changes on code whose form is not `str.codePointCount(0,
> str.length())`
I had Copilot (GPT-4.1) create a draft:
> ## Summary
>
> Add a no-argument `codePointCount()` method to `CharSequence` and `String` to
> count the number of Unicode code points in the entire sequence or string.
>
> ## Problem
>
> Currently, `String.codePointCount` and `CharSequence.codePointCount` only
> provide an overload that requires start and end indices. Developers often
> expect an overload with no arguments that returns the code point count of the
> entire string or sequence. Without this, developers resort to verbose or less
> efficient workarounds, such as using `codePoints().count()` (which yields
> every code point, adding unnecessary overhead) or calling `codePointCount(0,
> str.length())` (which is more verbose, requires a temporary variable, and
> performs an extra boundary check).
>
> A common use case involves enforcing maximum character limits on user input,
> particularly for fields stored in databases such as MySQL or PostgreSQL. Both
> database systems can consider the declared length of `VARCHAR(n)` columns as
> the number of Unicode code points, not just the number of `char` units or
> bytes for character sets like UTF-8 or UTF8MB4. Correctly counting code
> points is essential for supporting internationalized input, emoji, and
> non-BMP characters. For example, the NIST SP 800-63B guideline specifies that
> passwords should be checked in terms of the number of Unicode code points.
>
> ## Solution
>
> Introduce default no-argument `codePointCount()` methods in both the
> `CharSequence` interface and the `String` class. The new method returns the
> number of Unicode code points in the entire character sequence, equivalent to
> invoking `codePointCount(0, length())`, but provides better readability and
> avoids unnecessary overhead. The implementation in `CharSequence` is a
> default method, while `String` provides an explicit override for potential
> performance optimization.
>
> ## Specification
>
> Add to `java.lang.CharSequence` interface:
> ```java
> /**
> * Returns the number of Unicode code points in this character sequence.
> * Equivalent to {@code codePointCount(0, length())}.
> *
> * @return the number of Unicode code points in this sequence
> * @since N
> */
> default int codePointCount() {
> return codePointCount(0, length());
> }
> ```
>
> Add to `java.lang.String` class:
> ```java
> /**
> * Returns the number of Unicode code points in this string.
> * Equivalent to {@code codePointCount(0, length())}.
> *
> * @return the number of Unicode code points in this string
> * @since N
> */
> @Override
> public int codePointCount() {
> return codePointCount(0, length());
> }
> ```
>
> Here, `N` refers to the next Java platform version in which this change will
> be available.
>
> Informative Supplement:
>
> - Implementation: [GitHub PR 26461](https://github.com/openjdk/jdk/pull/26461)
> - Example use cases:
> ```java
> // For user names stored in MySQL (or PostgreSQL) VARCHAR(20), which
> counts code points:
> if (userName.codePointCount() > 20) {
> IO.println("The user name is too long to store in VARCHAR(20) in
> utf8mb4 MySQL/PostgreSQL!");
> }
> // Password policy: require at least 8 Unicode characters (code points)
> as per NIST SP 800-63B:
> if (password.codePointCount() < 8) {
> IO.println("Password is too short!");
> }
> ```
>
> References:
> - [MySQL VARCHAR
> documentation](https://dev.mysql.com/doc/refman/8.0/en/char.html)
> - [PostgreSQL Character
> Types](https://www.postgresql.org/docs/current/datatype-character.html)
> - [NIST SP 800-63B
> §5.1.1.2](https://pages.nist.gov/800-63-4/sp800-63b.html#passwordver)
<details>
<summary>Markdown Source</summary>
## Summary
Add a no-argument `codePointCount()` method to `CharSequence` and `String` to
count the number of Unicode code points in the entire sequence or string.
## Problem
Currently, `String.codePointCount` and `CharSequence.codePointCount` only
provide an overload that requires start and end indices. Developers often
expect an overload with no arguments that returns the code point count of the
entire string or sequence. Without this, developers resort to verbose or less
efficient workarounds, such as using `codePoints().count()` (which yields every
code point, adding unnecessary overhead) or calling `codePointCount(0,
str.length())` (which is more verbose, requires a temporary variable, and
performs an extra boundary check).
A common use case involves enforcing maximum character limits on user input,
particularly for fields stored in databases such as MySQL or PostgreSQL. Both
database systems can consider the declared length of `VARCHAR(n)` columns as
the number of Unicode code points, not just the number of `char` units or bytes
for character sets like UTF-8 or UTF8MB4. Correctly counting code points is
essential for supporting internationalized input, emoji, and non-BMP
characters. For example, the NIST SP 800-63B guideline specifies that passwords
should be checked in terms of the number of Unicode code points.
## Solution
Introduce default no-argument `codePointCount()` methods in both the
`CharSequence` interface and the `String` class. The new method returns the
number of Unicode code points in the entire character sequence, equivalent to
invoking `codePointCount(0, length())`, but provides better readability and
avoids unnecessary overhead. The implementation in `CharSequence` is a default
method, while `String` provides an explicit override for potential performance
optimization.
## Specification
Add to `java.lang.CharSequence` interface:
/**
* Returns the number of Unicode code points in this character sequence.
* Equivalent to {@code codePointCount(0, length())}.
*
* @return the number of Unicode code points in this sequence
* @since N
*/
default int codePointCount() {
return codePointCount(0, length());
}
Add to `java.lang.String` class:
/**
* Returns the number of Unicode code points in this string.
* Equivalent to {@code codePointCount(0, length())}.
*
* @return the number of Unicode code points in this string
* @since N
*/
@Override
public int codePointCount() {
return codePointCount(0, length());
}
Here, `N` refers to the next Java platform version in which this change will be
available.
Informative Supplement:
- Implementation: [GitHub PR 26461](https://github.com/openjdk/jdk/pull/26461)
- Example use cases:
```java
// For user names stored in MySQL (or PostgreSQL) VARCHAR(20), which counts
code points:
if (userName.codePointCount() > 20) {
IO.println("The user name is too long to store in VARCHAR(20) in
utf8mb4 MySQL/PostgreSQL!");
}
// Password policy: require at least 8 Unicode characters (code points) as
per NIST SP 800-63B:
if (password.codePointCount() < 8) {
IO.println("Password is too short!");
}
```
References:
- [MySQL VARCHAR
documentation](https://dev.mysql.com/doc/refman/8.0/en/char.html)
- [PostgreSQL Character
Types](https://www.postgresql.org/docs/current/datatype-character.html)
- [NIST SP 800-63B
§5.1.1.2](https://pages.nist.gov/800-63-4/sp800-63b.html#passwordver)
</details>
Needs to be fixed:
> for character sets like UTF-8 or UTF8MB4.
↓
"for character sets like UTF-8 (utf8mb4 in MySQL)."
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26461#issuecomment-3707771006