Re: .title() - annoying mistake

Karen Shaeffer via Python-list Mon, 22 Mar 2021 11:17:39 -0700

Hi Chris,
Thanks for your comment.

> Python doesn't work with UTF-8 encoded code points; it works with
> Unicode code points. Are you looking for something that checks whether
> something is a palindrome, or locates palindromes within it?
> 
> def is_palindrome(txt):
>    return txt == txt[::-1]
> 
> Easy.


Of course, its easy. Its a pythonic idiom! But it doesn’t work. And you know 
that. You even explained a few reasons why it doesn’t work below. There are 
many more instances of strings that do not work. Here are two:

idx = 6    A man, a plan, a canal: Panama   is_palindrome() = False
idx = 17    ab́cdeedcb́a   is_palindrome() = False

The palindrome isn’t worth any more time. It isn’t even a good example.

In my experience processing unstructured, multilingual text, you encounter a 
wide array of variances in both the text and in the encoding details, including 
outright errors. You have to account for all of them, because 99.99% of that 
text is valuable to you.

The key idea: If you care about the details, working with unstructured 
multi-lingual text is complicated. There are no easy solutions.


> 
> Efficiently finding substring palindromes would be a bit harder, but
> that'd be true even if you restricted it to ASCII. The advantage of
> Python's way of doing it is that, if you have a method that would work
> with ASCII bytes, the exact same thing will work with a Unicode
> string.
> 
> There's another big wrinkle not touched here, and that's what to do
> with combining characters. Python makes it easy to normalize text as
> much as is possible, and an NFC normalization would help a lot, but
> it's not going to do everything. So you may want to first define a
> proper way to split a string into whatever you're defining a character
> to be, and that's a very difficult problem, regardless of programming
> language. For example, Arabic text changes in visual shape when
> letters are next to each other, and Greek has two different forms for
> the letter sigma (U+03C2 and U+03C3) - should those distinctions
> affect palindromminess? What about ligatures - is U+FB01 "ﬁ" a single
> character, or should it be matched by "if" on the other end?
> 
> What part of this is trivial in Go?

Go is simpler than Python. Both languages have the capabilities to solve any 
text processing problem. I’m still learning Go, so I can’t really say more.

Personally, I like Python for text processing. You can usually get satisfactory 
results very quickly for most of the input space. And if you don’t care about 
all the gotchas, then you are good to go.

I have no more time for this. Thanks for your comment. I learned a little 
reading the long thread dealing with .title(). (chuckles ;)

Humbly,
Karen


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: .title() - annoying mistake

Reply via email to