Hi, everybody. What is a best practice to deal with filenames in python3? The problem is that os.walk(src_dir), os.listdir(src_dir), ... return "surrogate" strings as filenames. It is impossible to assume that they are normal strings that could be print()'ed on unicode terminal or saved as as string into database (mongodb) as they'll issue UnicodeEncodeError on surrogate character. So, how to handle this situation?
The first solution I found was to convert filenames to bytes and use them. But that's not nice. Once I need to compare filename with some string I'll have to convert strings to bytes. Also Bytes() objects are base64 encoded in mongo shell and thus they are hard to read, *e.g. "binary" : BinData(0,"c29tZSBiaW5hcnkgdGV4dA==")*. Finally PEP 383 states that using bytes does not work in windows (btw, why?). Another option I found is to work with filenames as surrogate strings but enc them to 'latin-1' before printing/saving into database: filename.encode(fse, errors='surrogateescape').decode('latin-1') This way I like more since latin symbols are clearly visible in mongo shell. Yet I doubt this is best solution. Ideally I would like to send surrogate strings to database or to terminal as is and let db/terminal handle them. IOW let terminal print garbage where surrogate letters appear. Is this possible in python? So what do you think: is usage unicode strings and explicit conversion to latin-1 a good option? Also related question: is it possible to detect surrogate symbols in strings? I found suggestion to use re.compile('[\ud800-\uefff]+'). Yet all this stuff feels to hacky for me, so I would like some confirmation that this is the right way. Thanks in advance and sorry for touching this matter again. Too many discussions and not evident what is the current state of art here. -- Peter. -- https://mail.python.org/mailman/listinfo/python-list