Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > Nevertheless, there are important abstractions that are written on top > of the bytes layer, and in the Unix and Linux world, the most > important abstraction is *text*. In the Unix world, text formats and > text processing is much more common in user-space apps than binary > processing.
That linux text is not the same thing as Python's text. Conceptually, Python text is a sequence of 32-bit integers. Linux text is a sequence of 8-bit integers. It is great that lots of computer-to-computer formats are encoded in ASCII (~ UTF-8). However, nowhere in linux is there a real abstraction layer that processes Python-esque text. Case in point: $ env | grep UTF LANG=en_US.UTF-8 $ od -c <<<"Hyvää yötä" # "Good night" in Finnish 0000000 H y v 303 244 303 244 y 303 266 t 303 244 \n 0000017 The "od" utility is asked to display its input as characters. The locale info gives a hint that all text data is in UTF-8. Yet what comes out is bytes. How about: $ wc -c <<<"Hyvää yötä" 15 $ tr 'ä' 'a' <<<"Hyvää yötä" Hyvaaaa ya�taa Grep is smarter: $ grep v...y <<<"Hyvää yötä" Hyvää yötä which is why you should always prefix "grep" with LC_ALL=C in your scripts (makes it far faster, too). Marko -- https://mail.python.org/mailman/listinfo/python-list