On 10/12/19 3:46 PM, Eko palypse wrote: > Thank you very much for your answer. > >> You have to be able to match bytes, not strings. > May I ask you to elaborate on this, sorry non-native English speaker. > The buffer I receive is a byte-like buffer. > >> I don't think you'll be able to 100% reliably match bytes in this way. >> You're asking it to make analysis of multiple bytes and to interpret >> them according to which character they would represent if decoded from >> UTF-8. >> >> My recommendation: Even if your buffer is multiple gigabytes, just >> decode it anyway. Maybe you can decode your buffer in chunks, but >> otherwise, just bite the bullet and do the decode. You may be >> pleasantly surprised at how little you suffer as a result; Python is >> quite decent at memory management, and even if you DO get pushed into >> the swapper by this, it's still likely to be faster than trying to >> code around all the possible problems that come from mismatching your >> text search. >> >> ChrisA > That's what I was afraid of. > It would be nice if the "world" could commit itself to one standard, > but I'm afraid that won't happen in my life anymore, I guess. :-( > > Thx > Eren
Current 'best practices' are in my opinion to convert data (if needed) to some version of Unicode (UTF-8, UTF-16, or UCS-4) at input (if needed) and process in that domain. You do need to be prepared to run into files which are encoded in some locally defined 8-bit code page. In Python3, strings are unicode encoded, and you don't need to worry about the details of which encoding is used internally, Python will deal with that itself. -- Richard Damon -- https://mail.python.org/mailman/listinfo/python-list