Ammar Askar <am...@ammaraskar.com> added the comment: Had some time to look into this. Just to summarize this problem, it deals with unicode points that are single characters but take up more than the width of a single character, even with a monospace font [1].
In the examples from above, the Chinese character itself counts as one character in a Python string. However, notice that it needs two carets: >>> x = "该" >>> print(x) 该 >>> len(x) 1 >>> print(x + '\n' + '^^') 该 ^^ This issue is somewhat font dependent, in the case of the emoji I know that windows sometimes renders emojis as single-character wide black-and-white glyphs or colorful ones depending on the program. As Pablo alluded to, unicodedata.east_asian_width is probably the best solution we can implement. For these wide characters it provides: >>> unicodedata.east_asian_width('💩') 'W' >>> unicodedata.east_asian_width('该') 'W' W corresponding to Wide. Whereas for regular width characters: >>> unicodedata.east_asian_width('b') 'Na' >>> unicodedata.east_asian_width('=') 'Na' we get Neutral (Not East Asian). This can be used to count the "displayed width" of the characters and hence the carets. However, organization is going to be a bit tricky since we're currently using _PyPegen_byte_offset_to_character_offset to get offsets to use for string slicing in the ast segment parsing code. We might have to make a separate function that gets the font display-width. ------------- [1] Way more details on this issue here: https://denisbider.blogspot.com/2015/09/when-monospace-fonts-arent-unicode.html and an example of a Python library that tries to deal with this issue here: https://github.com/jquast/wcwidth ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43950> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com