Ben Hearn <benandrewhe...@gmail.com> writes: > Hello all, > > I am having a bit of trouble with a string mismatch operation in my tool I am > writing. > > I am comparing a database collection or url quoted paths to the paths on the > users drive. > > These 2 paths look identical, one from the drive & the other from an xml url: > a = '/Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! > _PromoMix_.wav' > b = '/Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! > _PromoMix_.wav' > > But after realising it was failing on them I ran a difflib and these > differences popped up. > > import difflib > print('\n'.join(difflib.ndiff([a], [b]))) > - /Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! > _PromoMix_.wav > ? ^^ > + /Users/macbookpro/Music/tracks_new/_NS_2018/J.Staaf - ¡Móchate! > _PromoMix_.wav > ? ^ > > > What am I missing when it comes to unquoting the string, or should I do > some other fancy operation on the drive string? >
In [8]: len(a) Out[8]: 79 In [9]: len(b) Out[9]: 78 The difference is in the ó. In (b) it is a single character, Unicode 0xF3, LATIN SMALL LETTER O WITH ACUTE. In (a) it is composed of the letter o and the accent "́" (Unicode 0x301). So you would have to do Unicode normalisation before comparing. For example: In [16]: from unicodedata import normalize In [17]: a == b Out[17]: False In [18]: normalize('NFC', a) == normalize('NFC', b) Out[18]: True -- Pieter van Oostrum www: http://pieter.vanoostrum.org/ PGP key: [8DAE142BE17999C4] -- https://mail.python.org/mailman/listinfo/python-list