On Wed, Jan 3, 2018 at 2:36 AM, Robin Becker <ro...@reportlab.com> wrote: > On 02/01/2018 15:18, Chris Angelico wrote: >> >> On Wed, Jan 3, 2018 at 1:30 AM, Robin Becker <ro...@reportlab.com> wrote: >>> >>> I'm seeing some strange characters in web responses eg >>> >>> u'\u200e28\u200e/\u200e09\u200e/\u200e1962' >>> >>> for a date of birth. The code \u200e is LEFT-TO-RIGHT MARK according to >>> unicodedata.name. I tried unicodedata.normalize, but it leaves those >>> characters there. Is there any standard way to deal with these? >>> >>> I assume that some browser+settings combination is putting these in eg >>> perhaps the language is normally right to left but numbers are not. >> >> >> Unicode normalization is a different beast altogether. You could >> probably just remove the LTR marks and run with the rest, though, as >> they don't seem to be important in this string. >> >> ChrisA >> > I guess I'm really wondering whether the BIDI control characters have any > semantic meaning. Most numbers seem to be LTR. > > If I saw u'\u200f12' it seems to imply that the characters should be > displayed '21', but I don't know whether the number is 12 or 21. >
In this particular situation, it's highly unlikely that they'll have any influence, and even if they do, I don't think there's any way for the *repeated* directionality markers to do anything. They look like something added automatically for the sake of paranoia. ChrisA -- https://mail.python.org/mailman/listinfo/python-list