> For example, I got that EN DASH out of a web page which states <?xml > version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I > did go for that encoding. But if the browser can properly decode that > character using that encoding, how come other applications can't?
Please do trust us that ISO-8859-1 does *NOT* support EN DASH. There are two possible explanations for the behavior you observed: a) even though the file was declared ISO-8859-1, the data in it actually didn't use that encoding. The browser somehow found out, and chose a different encoding from the declared one. b) the web page contained the character reference – (or –), or the entity reference –. XML allows to support arbitrary Unicode characters even in a file that is encoded with ASCII. > I might need to go for python's htmllib to avoid this, not sure. But > if I don't, if I only want to just copy and paste some web pages text > contents into a tkinter Text widget, what should I do to succesfully > make every single character go all the way from the widget and out of > tkinter into a python string variable? How did my browser knew it > should render an EN DASH instead of a circumflexed lowercase u? Read the source of the web page to be certain. > This is the webpage in case you are interested, 4th line of first > paragraph, there is the EN DASH: > http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html Ok, this says – in several places, as well as “ and ” HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list