On 10月18日, 上午12时14分, MRAB <pyt...@mrabarnett.plus.com> wrote:
> Arian Kuschki wrote:
> > Hi all
>
> > this has been bugging me for a long time and I do not seem to be able to
> > understand what to do. I always have problems when dealing input text that
> > contains umlauts. Consider the following:
>
> > In [1]: import urllib
>
> > In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
>
> > In [3]: xml = f.read()
>
> > In [4]: f.close()
>
> > In [5]: print xml
> > ------> print(xml)
> > <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
> > tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
> >> <forecast_information><cit
> > y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6
> > data=""/><longitude_e6 data=""/><forecast_date
> > data="2009-10-17"/><current_date_time data="2009-10
> > -17 14:20:00 +0000"/><unit_system
> > data="SI"/></forecast_information><current_conditions><condition 
> > data="Meistens
> > bew kt"/><temp_f data="43"/><temp_c data="6"/><h
> > umidity data="Feuchtigkeit: 87 %"/><icon
> > data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W 
> > mit
> > Windgeschwindigkeiten von 13 km/h"/></curr
> > ent_conditions><forecast_conditions><day_of_week data="Sa."/><low
> > data="1"/><high data="7"/><icon
> > data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> > ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week
> > data="So."/><low data="-1"/><high data="8"/><icon
> > data="/ig/images/weather/chance_of_sno
> > w.gif"/><condition data="Vereinzelt
> > Schnee"/></forecast_conditions><forecast_conditions><day_of_week
> > data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> > mages/weather/mostly_sunny.gif"/><condition data="Teils
> > sonnig"/></forecast_conditions><forecast_conditions><day_of_week
> > data="Di."/><low data="0"/><high data="8"
> > /><icon data="/ig/images/weather/sunny.gif"/><condition
> > data="Klar"/></forecast_conditions></weather></xml_api_reply>
>
> > As you can see the umlauts in the XML are not displayed properly. When I 
> > want
> > to process this text (for example with xml.sax), I get error messages 
> > because
> > the parses can't read this.
>
> > I've tried to read up on this and there is a lot of information on the web, 
> > but
> > nothing seems to work for me. For example setting the coding to UTF like 
> > this:
> > # -*- coding: utf-8 -*- or using the decode() string method.
>
> > I always have this kind of problem when input contains umlauts, not just in
> > this case. My locale (on Ubuntu) is en_GB.UTF-8.
>
> The string you received from the website is a bytestring and you're just
> printing it to your console, which is configured for UTF-8. However, the
> bytestring isn't valid UTF-8, so the console is replacing the invalid
> parts with the funny characters.
>
> You should decode the bytestring to Unicode and then re-encode it to
> UTF-8. I don't know what encoding the website is actually using; here
> I'm assuming ISO-8859-1:
>
> print xml.decode("iso-8859-1").encode("utf-8")

in 2.6, str.decode return unicode, so you can directly print it.
in 3.1, str.encode return bytes, so you can also directly print it.

so, just decode("cp1252"), it's enough.
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to