On 7/12/19 7:17 AM, Bob van der Poel wrote:
I have some files which came off the net with, I'm assuming, unicode
characters in the names. I have a very short program which takes the
filename and puts into an emacs buffer, and then lets me add information to
that new file (it's a poor man's DB).
Next, I can look up text in the file and open the saved filename.
Everything works great until I hit those darn unicode filenames.
Just to confuse me even more, the error seems to be coming from a bit of
tkinter code:
if sresults.has_key(textAtCursor):
bookname = os.path.expanduser(sresults[textAtCursor].strip())
which generates
UnicodeWarning: Unicode equal comparison failed to convert both arguments
to Unicode - interpreting them as being unequal if
sresults.has_key(textAtCursor):
I really don't understand the business about "both arguments". Not sure how
to proceed here. Hoping for a guideline!
(I'm guessing that) the "both arguments" relates to expanduser() because
this is the first time that the fileNM has been identified to Python as
anything more than a string of characters.
[a fileNM will be a string of characters, but a string of characters is
not necessarily a (legal) fileNM!]
Further suggesting, that if you are using Python3 (cf 2), your analysis
may be the wrong-way-around. Python3 treats strings as Unicode. However,
there is, and certainly in the past, was, no requirement for OpSys and
IOCS to encode in Unicode.
The problem (for me) came from MSFT's (for example) many variations of
ISO-8859-n and that there are no clues as to which of these was used in
naming the file, and thus many possibly 'translations' into Unicode.
You can start to address the issue by using Python's bytes (instead of
strings), however that cold reality still intrudes.
Do you know the provenance of these files, eg they are in French and
from an MS-Win machine? If so, you may be able to use decode() and
encode(), but...
Still looking for trouble? Knowing a fileNM was in Spanish/Portuguese I
was able to take the fileNM's individual Unicode characters/surrogates
and subtract an applicable constant, so that accented letters fell
'back' into the correct Unicode range. (this is extremely risky, and
could quite easily make matters worse/more confusing).
I warn you that pursuing this matter involves disappearing down into a
very deep 'rabbit hole', but YMMV!
WebRefs:
https://docs.python.org/3/howto/unicode.html
https://www.dictionary.com/e/slang/rabbit-hole/
--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list