AW: Request for Comments: Adding bulk-read method "CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)"

Markus Karg Sat, 26 Oct 2024 09:54:00 -0700

Chen,


thank you for chiming in! :-)

 

Answering in PR draft's discussion thread, where you asked the same question 
meanwhile, to not duplicate discussions.

 

https://github.com/openjdk/jdk/pull/21730 

 

-Markus

 

 

Von: Chen Liang [mailto:liangchenb...@gmail.com] 
Gesendet: Samstag, 26. Oktober 2024 18:06
An: Markus Karg
Cc: core-libs-dev
Betreff: Re: Request for Comments: Adding bulk-read method 
"CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)"

 

Hi Markus,
Should we drop the srcBigin/srcEnd parameters, as they can be replaced by a 
subSequence(srcBegin, srcEnd) call?

 

On Fri, Oct 25, 2024, 12:34 PM Markus Karg <mar...@headcrashing.eu> wrote:

I hereby request for comments on the proposal to generalize the existing method 
"String.getChars()"'s signature to become a new default interface method 
"CharSequence.getChars()".

 

Problem

 

For performance reasons, many CharSequence implementations, in particular 
String, StringBuilder, StringBuffer and CharBuffer, provide a way to bulk-read 
a complete region of their characters content into a provided char array.

Unfortunately, there is no _uniform_ way to perform this, and it is not 
guaranteed that there is bulk-reading implemented with _any_ CharSequence, in 
particular custom ones.

While String, StringBuilder and StringBuffer all share the same getChars() 
method signature for this purpose, CharBuffer's way to perform the very same is 
the get() method.

Other implementations have other method signatures, or do not have any solution 
to this problem at all.

In particular, there is no method in their common interface, CharSequence, to 
perform such an bulk-optimized read, as CharSequence only allows to read one 
character after the next in a sequential way, either by iterating over charAt() 
from 0 to length(), or by consuming the chars() Stream.

 

As a result, code that wants to read from CharSequence in an 
implementation-agnostic, but still bulk-optimized way, needs to know _each_ 
possible implementation's specific method!

Effectively this results in code like this (real-world example taken from the 
implementation of Reader.of(CharSequence) in JDK 24):

 

switch (cs) {

               case String s -> s.getChars(next, next + n, cbuf, off);

               case StringBuilder sb -> sb.getChars(next, next + n, cbuf, off);

               case StringBuffer sb -> sb.getChars(next, next + n, cbuf, off);

               case CharBuffer cb -> cb.get(next, cbuf, off, n);

               default -> {

                              for (int i = 0; i < n; i++)

                                             cbuf[off + i] = cs.charAt(next + 
i);

               }

}

 

The problem with this code is that it is bound and limited to exactly that 
given set of CharSequence implementations.

If a future CharSequence implementation shall get accessed in an bulk-optimized 
way, the switch expression has to get extended and recompiled _every time_.

If some custom CharSequence implementation is used that this code is not aware 
of, sequential read is applied, even if that implementation _does_ provide some 
bulk-read method!

 

Solution

 

There are several possible alternative solutions:

* (A) CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin) 
- As this signature is already supported by String, StringBuffer and 
StringBuilder, I hereby propose to add this signature to CharSequence and 
provide a default implementation that iterates over charAt(int) from 0 to 
length().

* (B) Alternatively the same default method could be implemented using the 
chars() Stream - I assume that might run slower, but correct me if I am wrong.

* (C) Alternatively we could go with the signature get(char[] dst, int offset, 
int length) - Only CharBuffer implements that already, so more changes are 
needed and more duplicate methods will exist in the end.

* (D) Alternatively we could come up with a totally different signature - That 
would be most fair to all existing implementations, but in the end it will 
imply the most changes and the most duplicate methods.

* (E) We could give up the idea and live with the situation as-is. - I assume 
only few people really prefer that outcome.

 

Please tell me if I missed a viable option!

 

As a side benefit of CharSequence.getChars(), its existence might trigger 
implementors to provide bulk-reading if not done yet, at least for those cases 
where it is actually feasible.

In the same way it might trigger callers of Reader to start making use of bulk 
reading, at least in those cases where it does make sense but application 
authors were reluctant to implement the switch-case shown above.

 

Hence, if nobody vetoes, I will file Jira Issue, PR and CSR for 
"CharSequence.getChars()" (alternative A) in the next days.

 

-Markus

AW: Request for Comments: Adding bulk-read method "CharSequence.getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)"

Reply via email to