Re: .title() - annoying mistake

Chris Angelico Mon, 22 Mar 2021 12:01:36 -0700

On Tue, Mar 23, 2021 at 5:16 AM Karen Shaeffer via Python-list
<python-list@python.org> wrote:
>
> Hi Chris,
> Thanks for your comment.
>
> > Python doesn't work with UTF-8 encoded code points; it works with
> > Unicode code points. Are you looking for something that checks whether
> > something is a palindrome, or locates palindromes within it?
> >
> > def is_palindrome(txt):
> >    return txt == txt[::-1]
> >
> > Easy.
>
> Of course, its easy. Its a pythonic idiom! But it doesn’t work. And you know 
> that. You even explained a few reasons why it doesn’t work below. There are 
> many more instances of strings that do not work. Here are two:
>
> idx = 6    A man, a plan, a canal: Panama   is_palindrome() = False
> idx = 17    ab́cdeedcb́a   is_palindrome() = False
>
> The palindrome isn’t worth any more time. It isn’t even a good example.
>
> In my experience processing unstructured, multilingual text, you encounter a 
> wide array of variances in both the text and in the encoding details, 
> including outright errors. You have to account for all of them, because 
> 99.99% of that text is valuable to you.
>
> The key idea: If you care about the details, working with unstructured 
> multi-lingual text is complicated. There are no easy solutions.
>
>
> >
> > Efficiently finding substring palindromes would be a bit harder, but
> > that'd be true even if you restricted it to ASCII. The advantage of
> > Python's way of doing it is that, if you have a method that would work
> > with ASCII bytes, the exact same thing will work with a Unicode
> > string.
> >
> > There's another big wrinkle not touched here, and that's what to do
> > with combining characters. Python makes it easy to normalize text as
> > much as is possible, and an NFC normalization would help a lot, but
> > it's not going to do everything. So you may want to first define a
> > proper way to split a string into whatever you're defining a character
> > to be, and that's a very difficult problem, regardless of programming
> > language. For example, Arabic text changes in visual shape when
> > letters are next to each other, and Greek has two different forms for
> > the letter sigma (U+03C2 and U+03C3) - should those distinctions
> > affect palindromminess? What about ligatures - is U+FB01 "ﬁ" a single
> > character, or should it be matched by "if" on the other end?
> >
> > What part of this is trivial in Go?
>
> Go is simpler than Python. Both languages have the capabilities to solve any 
> text processing problem. I’m still learning Go, so I can’t really say more.
>
> Personally, I like Python for text processing. You can usually get 
> satisfactory results very quickly for most of the input space. And if you 
> don’t care about all the gotchas, then you are good to go.
>
> I have no more time for this. Thanks for your comment. I learned a little 
> reading the long thread dealing with .title(). (chuckles ;)
>


Hey, you're the one who brought up palindrome testing as a difficult
problem in Python :) Your post implied that it was easier in Go, and I
can't see that that's possible.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: .title() - annoying mistake

Reply via email to