Chris Jones <cjns1...@gmail.com> writes: Hi Chris, thanks for your detailed reply.
> On Sat, May 30, 2009 at 04:55:19PM EDT, Arnaud Delobelle wrote: > >> Hi all, > > Disclaimer: I am not familiar with the curses python implementation and > I'm neither an ncurses nor a "unicode" expert by a long shot. > > :-) > >> I am looking for advice on how to use unicode with curses. First I will >> explain my understanding of how curses deals with keyboard input and how >> it differs with what I would like. >> >> The curses module has a window.getch() function to capture keyboard >> input. This function returns an integer which is more or less: >> >> * a byte if the key which was pressed is a printable character (e.g. a, >> F, &); >> >> * an integer > 255 if it is a special key, e.g. if you press KEY_UP it >> returns 259. > > The getch(3NCURSES) function returns an integer. Provide it's large > enough to accomodate the highest possible value, the actual size in > bytes of the integer should be irrelevant. Sorry I was somehow mixing up what happens in general and what happens with utf-8 (probably because I have only done test with utf-8), where the number of bytes used to encode a character varies. >> As far as I know, curses is totally unicode unaware, > > My impression is that rather than "unicode unaware", it is "unicode > transparent" - or (nitpicking) "UTF8 transparent" - since I'm not sure > other flavors of unicode are supported. >> so if the key pressed is printable but not ASCII, > > .. nitpicking again, but ASCII is a 7-bit encoding: 0-127. > >> the getch() function will return one or more bytes depending on the >> encoding in the terminal. > > I don't know about the python implementation, but my guess is that it > should closely follow the underlying ncurses API - so the above is > basically correct, although it's not a question of the number of bytes > but rather the returned range of integers - if your locale is en.US then > that should be 0-255.. if it is en_US.utf8 the range is considerably > larger. In my tests, my locale is en_GB.utf8 and the python getch() function does return a number of bytes - see below. >> E.g. given utf-8 encoding, if I press the key 'é' on my keyboard (which >> encoded as '\xc3\xa9' in utf-8), I will need two calls to getch() to get >> this: the first one will return 0xC3 and the second one 0xA9. > > No. A single call to getch() will grab your " é" and return 0xc3a9, > decimal 50089. It is the case though that on my machine, if I press 'é' then call getch() it will return 0xC3. A further call to getch() will return 0xA9. This I was I was talking about getch() returning bytes: to me it behaves as if it returns the encoded characters byte by byte. >> Instead of getting a stream of bytes and special keycodes (with value > >> 255) from getch(), what I want is a stream of *unicode characters* and >> special keycodes. > > This is what getch(3NCURSES) does: it returns the integer value of one > "unicode character". It is not what happens in my tests. I have made a simple testing script, see below. > Likewise, I would assume that looping over the python equivalent of > getch() will not return a stream of bytes but rather a "stream" of > integers that map one to one to the "unicode characters" that were > entered at the terminal. > Note: I am only familiar with languages such as English, Spanish, > French, etc. where only one terminal cell is used for each glyph. My > understanding is that things get somewhat more complicated with > languages that require so-called "wide characters" - two terminal cells > per character, but that's a different issue. > >> So, still assuming utf-8 encoding in the terminal, if I type: >> >> Té[KEY_UP]ça >> >> iterating call to the getch() function will give me this sequence of >> integers: >> >> 84, 195, 169, 259, 195, 167, 97 >> T- é------- KEY_UP ç------- a- >> >> But what I want to get this stream instead: >> >> u'T', u'é', 259, u'ç', u'a' > > No, for the above, getch() will return: > > 84, 50089, 259, 50087, 97 > > .. which is "functionally" equivalent to: > > u'T', u'é', 259, u'ç', u'a' > > [..] > > So shouldn't this issue boil down to just a matter of casting the > integers to the "u" data type? > > This short snippet may help clarify the above: > > ----------------------------------------------------------------------- > #include <locale.h> > #include <ncurses.h> > #include <stdlib.h> > #include <stdio.h> > #include <string.h> > > int unichar; > > int main(int argc, char *argv[]) > { > setlocale(LC_ALL, "en_US.UTF.8"); /* make sure UTF8 */ > initscr(); /* start curses mode */ > raw(); > keypad(stdscr, TRUE); /* pass special keys */ > unichar = getch(); /* read terminal */ > > mvprintw(24, 0, "Key pressed is = %4x ", unichar); > > refresh(); > getch(); /* wait */ > endwin(); /* leave curses mode */ > return 0; > } > ----------------------------------------------------------------------- > > Hopefully you have access to a C compiler: > > $ gcc -lncurses uni00.c -o uni00 Thanks for this. When I test it on my machine (BTW it is MacOS 10.5.7), if I type an ASCII character (e.g. 'A'), I get its ASCII code (0x41), but if I type a non-ascii character (e.g. '§') I get back to the prompt immediately. It must be because two values are queued for getch. I should try it on a Linux machine, but I don't have one handy at the moment. I have made a little test script in Python which is similar but will only stop when 'Esc' is pressed. -------------------------------------------------- import curses def getcodes(win): codes = [] while True: c = win.getch() if c == 27: return codes codes.append(c) print curses.wrapper(getcodes) -------------------------------------------------- If I try this in a Terminal and type 'souçi[ESC]', I get this: [115, 111, 117, 195, 167, 105] s--, o--, u--, ç-------, i-- As you see, two calls to getch() were necessary after typing 'ç'. BTW on the same terminal: marigold:junk arno$ locale LANG="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_CTYPE="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_ALL= I will have to do tests with other encodings. -- Arnaud -- http://mail.python.org/mailman/listinfo/python-list