On 7/12/19 7:17 AM, Bob van der Poel wrote:
I have some files which came off the net with, I'm assuming, unicode
characters in the names. I have a very short program which takes the
filename and puts into an emacs buffer, and then lets me add information to
that new file (it's a poor man's DB).

Next, I can look up text in the file and open the saved filename.
Everything works great until I hit those darn unicode filenames.

Just to confuse me even more, the error seems to be coming from a bit of
tkinter code:
  if sresults.has_key(textAtCursor):
         bookname = os.path.expanduser(sresults[textAtCursor].strip())

which generates

   UnicodeWarning: Unicode equal comparison failed to convert both arguments
to Unicode - interpreting them as being unequal  if
sresults.has_key(textAtCursor):

I really don't understand the business about "both arguments". Not sure how
to proceed here. Hoping for a guideline!


(I'm guessing that) the "both arguments" relates to expanduser() because this is the first time that the fileNM has been identified to Python as anything more than a string of characters.

[a fileNM will be a string of characters, but a string of characters is not necessarily a (legal) fileNM!]

Further suggesting, that if you are using Python3 (cf 2), your analysis may be the wrong-way-around. Python3 treats strings as Unicode. However, there is, and certainly in the past, was, no requirement for OpSys and IOCS to encode in Unicode.

The problem (for me) came from MSFT's (for example) many variations of ISO-8859-n and that there are no clues as to which of these was used in naming the file, and thus many possibly 'translations' into Unicode.

You can start to address the issue by using Python's bytes (instead of strings), however that cold reality still intrudes.

Do you know the provenance of these files, eg they are in French and from an MS-Win machine? If so, you may be able to use decode() and encode(), but...

Still looking for trouble? Knowing a fileNM was in Spanish/Portuguese I was able to take the fileNM's individual Unicode characters/surrogates and subtract an applicable constant, so that accented letters fell 'back' into the correct Unicode range. (this is extremely risky, and could quite easily make matters worse/more confusing).

I warn you that pursuing this matter involves disappearing down into a very deep 'rabbit hole', but YMMV!

WebRefs:
https://docs.python.org/3/howto/unicode.html
https://www.dictionary.com/e/slang/rabbit-hole/
--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to