First of all many thanks to everyone for the active participation. @Chris Angelico I think I understand what you illustrated with the byte example, makes sense. As it was developed for 8-bit encoding only, it cannot be used for mulitbyte encoding.
@Richard Damon and @MRAB thank you very much for the information too, very much appreciated. I think I understand what you all mean but I am not sure how to put this all together. Maybe a little bit more information about what I wanted to do. Using notepad++ and scintilla. Scintilla passes a readonly pointer with SCI_GETCHARACTERPOINTER of the current buffer to me. The problem is that the buffer can have all possible encodings. cp1251, cp1252, utf8, ucs-2 ... but scintilla informs me about which encoding is currently used. I wanted to realize a regular expression tester with Python3, and mark the text that has been matched by regular expressions. After testing to treat everything as python3 str I found out that the positions of the matched text are not correctly reported. E.g say, if I want to find the word "Ärger", assumed encoded in utf8, with the regex \w+. If I decode it, it would return the length of 5, whereas it is of length 6 within the document, so marking the match would be wrong, wouldn't it? I understand the reason of the difference. If I use the builtin find dialog of notepad++, which uses internally the boost::regex engine, I can use \w+ to find the word. So that's where I'm stuck at the moment. How can I find and mark those matches correctly. Wrapping boost:regex with ICU support? Thx Eren -- https://mail.python.org/mailman/listinfo/python-list