>If a method doesn't intrinsically require a String, then I prefer CharSequence. It's probable that sooner or later something is going to demand a String, but that's not a good reason to be "that guy" :-) I lean towards using CharSequence when that makes sense too (i.e. suggesting we are working on code points, and supporting implementations of charsequence). The tdebatty/java-string-similarity library work only Strings I think. Others like LingPipe, ICU4J, Lucene, Apache Commons Text, and Apache OpenNLP use both CharSequence and String.
Analysing the use of CharSequence and String could be an interesting idea for a blog post, and could even raise some tickets to fix consistency in the API of [text] or some other component/project. >Also, wouldn't some sort of low-space-overhead string storage be a good fit >for text? Sounds interesting. Normally when I have some idea like that for [text] (or for other projects/components) I either note it down somewhere (normally first at http://kinoshita.eti.br/todo/), and then file an issue like TEXT-71, TEXT-77, TEXT-78, or TEXT-79, to start investigating it. If you have some idea of how that could be implemented, or know about some projects for that, feel free to suggest it in a JIRA ticket, or start another thread here in the mailing list. Cheers Bruno ________________________________ From: Simon Spero <sesunc...@gmail.com> To: Commons Developers List <dev@commons.apache.org> Sent: Tuesday, 20 June 2017 1:39 AM Subject: CharSequence vs. String (was Re: [GitHub] commons-text pull request #46: TEXT-85:Added CaseUtils class with camel case...) On Jun 12, 2017 10:47 AM, "arunvinudss" <g...@git.apache.org> wrote: Github user arunvinudss commented on a diff in the pull request: I am a bit biased towards using String instead of CharSequence . Yes CharSequence allows us to pass String Buffers and builders and other types as input potentially increasing the scope of the function but considering the nature of work we do in this particular method it may not necessarily be a good idea. My basic contention is that the minute we call toString() on a charSequence to do any sort of manipulation it becomes a costly operation and we may lose performance . True if the particular CharSequence is not in fact an instance of String. String::toString returns this. The bigger problem is that too many methods use String as a parameter or return type, when CharSequence would serve just as well. This indeed requires the invocation of Object::toString. For methods that use String as the return type, changing the result to CharSequence is source and binary incompatible, and properly so (since at some point the user may actually need a String). A generic method with Type parameter with CharSequence as bound (T extends CharSequence) can sometimes be useful, and can be added in addition to methods taking String arguments, but can't replace them. There are some places in javac that have special treatment for String - for example, the + operator , but jdk9 reduces that particular win by indyfying concat. If a method doesn't intrinsically require a String, then I prefer CharSequence. It's probable that sooner or later something is going to demand a String, but that's not a good reason to be "that guy" :-) Note: Strings can be an incredible waste of memory; 40 + ⌈length/4⌉ bytes (reduced to a mere 40 + ⌈length/8⌉ bytes in jdk9 when compact strings can be used). This is incredibly painful if you have a vast number of small "strings", which may not all need to be materialized simultaneously. See e.g. [1] (~50MiB of UTF-8 chars becomes ~250MiB of Strings. And since there's no individual humongous object they all get to make the journey from TLAB to Old Space the hard way. Note this predates jdk 9,but illustrates some of the win from compact strings) Storing the character data in a shared byte array is a huge win. Someone should tell the jdk implementors to look at applications that do this. Like, um, javac :-) Materializing these strings as possibly transient CharSequence's is really convenient... until some method just has to have a String Also, wouldn't some sort of low-space-overhead string storage be a good fit for text? Simon [1] Spero,S. (2015). Time And Relative Dimensions In Semantics: Is OWL Bigger On The Inside? OWLED 2015. Available at http://cgi.csc.liv.ac.uk/~valli/OWLED2015/OWLED_2015_paper_12.pdf --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org