> On 22 May 2017, at 12:15, Bruno P. Kinoshita > <brunodepau...@yahoo.com.br.INVALID> wrote: > > I'd be in favour of 2 or some variation of it. Provide a well documented > naïve implementation, and use whatever is available at the JVM for handling > upper/lower case. > I would use it for very simple cases, where all I need is to capitalise each > word, and where it would be OK to have possible mistakes in case you have to > handle text that is not in English, or special cases like some new > mathematical symbol (e.g. U+1D52B mathematical fraktur small N, which also > uses surrogate to make it even more interesting). > > For cases where I have to take care of different languages (e.g. ch digraph > for Czech) I would probably use ICU. > > > For cases that depend on the country, context, or some other feature (e.g. > names in Dutch with the van preposition) I would probably look at OpenNLP > with a machine learning or rule based approach. > > The issue is that when all I need is the very simple approach now, I would > have to write something like a for-loop or Java 8 stream and split the text, > then call toUpperCase on each first char, then write tests for it, etc. I > think for this case it would still be worth having our simple implementation > in [text], with docs explaining what it is capable of, and what it is not.
I’m beginning to think a simple implementation is harmful - developers will use it in places where they ought to be doing something more locale-specific.. Even in English, trivially capitalising each word is rarely correct. We should avoid the topic entirely if we don't do it justice and I agree that OpenNLP seems a good home for this sort of work. Consequently, I see little benefit in rewriting the WordUtil capitalisation methods and plan to leave them alone. I’m also tempted to extend their Javadocs to point out the deficiencies. I’ll instead focus on pulling out the wrapping methods into something more object-oriented. Perhaps after that, we could deprecate WordUtils for removal in 2.0? > > Cheers > Bruno > [] https://codepoints.net/U+1D52B?lang=en > [] https://en.wikipedia.org/wiki/Ch_(digraph)#Czech > [] https://en.wikipedia.org/wiki/Van_(Dutch)#Collation_and_capitalisation > ________________________________ > From: Duncan Jones <dun...@wortharead.com> > To: Commons Developers List <dev@commons.apache.org> > Sent: Monday, 22 May 2017 12:06 AM > Subject: [TEXT] How do we want to handle case conversions? > > > > Hi everyone, > > > I’ve found some time to continue breaking WordUtils into separate classes > (eschewing the “big collection of static methods” approach). However, as I > read more about case handling in Unicode, I realise how simplistic the > WordUtils methods are and how complex a full solution would need to be. > > > Section 5.18 of the Unicode specification [1] describes these complexities. > The mains ones that bother me are: > > > 1. Title case conversions vary widely between different locales and > languages. I’m not clear whether any locale is satisfied by the current > simplistic implementation in WordUtils.capitalize(str). Supporting this > correctly would be a serious challenge. > > > 2. All types of case conversion may vary depending upon context/locale. There > are examples provided in [1] where the outcome is different in a Turkish > locale or if the letter in question is followed by another or not. > > > Does anyone have a suggestion for how to move forward with this work? I see > three options: 1] Admit defeat and avoid the case conversion mess entirely. > 2] Mimic the existing functionality, but document the limitations. 3] Attempt > to deliver a locale-dependent version, perhaps still with limitations (or for > certain languages). > > > I’m leaning towards 2, perhaps even calling the classes “SimpleX…”. > > > Thanks, > > Duncan > > > > [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org