Re: Unicode chr(150) en dash

Martin v. Löwis Thu, 17 Apr 2008 12:23:06 -0700

> For example, I got that EN DASH out of a web page which states <?xml
> version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I
> did go for that encoding. But if the browser can properly decode that
> character using that encoding, how come other applications can't?


Please do trust us that ISO-8859-1 does *NOT* support EN DASH.

There are two possible explanations for the behavior you observed:
a) even though the file was declared ISO-8859-1, the data in it
   actually didn't use that encoding. The browser somehow found out,
   and chose a different encoding from the declared one.
b) the web page contained the character reference &#x2013; (or &#8211;),
   or the entity reference &ndash;. XML allows to support arbitrary
   Unicode characters even in a file that is encoded with ASCII.

> I might need to go for python's htmllib to avoid this, not sure. But
> if I don't, if I only want to just copy and paste some web pages text
> contents into a tkinter Text widget, what should I do to succesfully
> make every single character go all the way from the widget and out of
> tkinter into a python string variable? How did my browser knew it
> should render an EN DASH instead of a circumflexed lowercase u?

Read the source of the web page to be certain.

> This is the webpage in case you are interested, 4th line of first
> paragraph, there is the EN DASH:
> http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html

Ok, this says &#8211; in several places, as well as &#8220; and &#8221;

HTH,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

Reply via email to