[issue30717] Add unicode grapheme cluster break algorithm

Manish Sun, 05 Jan 2020 19:22:17 -0800

Manish <[email protected]> added the comment:

Hi,


Unicodey person here, I'm involved in Unicode itself and also maintain an 
implementation of this particular spec[1].


So, firstly,

> "a⃑".center(width=5, fillchar=".")

If you're trying to do terminal width stuff, extended grapheme clusters *will 
not* solve the problem for you. There is no algorithm specified in Unicode that 
does this, because this is font dependent. Extended grapheme clusters are 
better than code points for this, however, and will not ever produce *worse* 
results.


It's fine to expose this, but it's worth adding caveats.

Also, yes, please do not expose a direct indexing function. Aside from almost 
all Unicode algorithms being streaming algorithms and thus inefficient to index 
directly, needing to directly index a grapheme cluster is almost always a sign 
that you are making a mistake.

> Yes. I clearly don't want this PR to be interpreted as "we're needing ICU". 
> ICU provides much much more than what I'm willing to provide. My goal here is 
> just to "fix" how the python's standard library iterates over characters. 
> Noting more, nothing less.

I think it would be a mistake to make the stdlib use this for most notions of 
what a "character" is, as I said this notion is also inaccurate. Having an 
iterator library somewhere that you can use and compose is great, changing the 
internal workings of string operations would be a major change, and not 
entirely productive.

There's only one language I can think of that uses extended grapheme clusters 
as its default notion of "character": Swift. Swift is largely designed for UI 
stuff, and it makes sense in this context. This is also baked in very deeply to 
the language (e.g. their Character class is a thin wrapper around String, since 
grapheme clusters can be arbitrarily large). You'd need a pretty major paradigm 
shift for python to make a similar change, and it doesn't make as much sense 
for python in the first place.

Starting off with a library published to pypi makes sense to me.


 [1]: https://github.com/unicode-rs/unicode-segmentation/

----------
nosy: +Manishearth

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue30717>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue30717] Add unicode grapheme cluster break algorithm

Reply via email to