Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate
>What does this show you in your interactive interpreter?
>
>>>> print "\xc3\xb6"
>ö
>
>For me, it's o-umlaut, ö. This is because the above bytes are the
>sequence for ö in utf-8.
>
>If this shows something else, you need to adjust your terminal settings.
for me it also prints the correct o-umlaut (ö), so that was not the problem.
All of the below result in xml that shows all umlauts correctly when printed:
xml.decode("cp1252")
xml.decode("cp1252").encode("utf-8")
xml.decode("iso-8859-1")
xml.decode("iso-8859-1").encode("utf-8")
But when I want to parse the xml then, it only works if I
do both decode and encode. If I only decode, I get the following error:
SAXParseException: :1:1: not well-formed (invalid token)
Do I understand right that since the encoding was not specified in the xml
response, it should have been utf-8 by default? And that if it had indeed been
utf-8 I
would not have had the encoding problem in the first place?
Anyway, thanks everybody, this has helped me a lot.
Arian
On Sat 17, 20:17 +0200, Diez B. Roggisch wrote:
> StarWing schrieb:
> >On 10月18日, 上午12时50分, "Diez B. Roggisch" wrote:
> >>StarWing schrieb:
> >>
> >>
> >>
> >>>On 10月17日, 下午9时54分, Arian Kuschki
> >>>wrote:
> >>>>Hi all
> >>>>this has been bugging me for a long time and I do not seem to be able to
> >>>>understand what to do. I always have problems when dealing input text that
> >>>>contains umlauts. Consider the following:
> >>>>In [1]: import urllib
> >>>>In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
> >>>>In [3]: xml = f.read()
> >>>>In [4]: f.close()
> >>>>In [5]: print xml
> >>>>--> print(xml)
> >>>> >>>>tab_id="0" mobile_row="0" mobile_zipped="1" row="0"
> >>>>section="0"> >>>>y data="Munich, BY"/> >>>>data=""/> >>>>data="2009-10-17"/> >>>>data="SI"/> >>>>data="Meistens
> >>>>bew kt"/> >>>>umidity data="Feuchtigkeit: 87 %"/> >>>>data="/ig/images/weather/mostly_cloudy.gif"/> >>>>ent_conditions> >>>>data="1"/> >>>>data="/ig/images/weather/chance_of_rain.gif"/> >>>>data="So."/> >>>>data="/ig/images/weather/chance_of_sno
> >>>>w.gif"/> >>>>data="Mo."/> >>>>data="Di."/> >>>>/> >>>>data="Klar"/>
> >>>>As you can see the umlauts in the XML are not displayed properly. When I
> >>>>want
> >>>>to process this text (for example with xml.sax), I get error messages
> >>>>because
> >>>>the parses can't read this.
> >>>>I've tried to read up on this and there is a lot of information on the
> >>>>web, but
> >>>>nothing seems to work for me. For example setting the coding to UTF like
> >>>>this:
> >>>># -*- coding: utf-8 -*- or using the decode() string method.
> >>>>I always have this kind of problem when input contains umlauts, not just
> >>>>in
> >>>>this case. My locale (on Ubuntu) is en_GB.UTF-8.
> >>>>Cheers
> >>>>Arian
> >>>try this?
> >>># vim: set fencoding=utf-8:
> >>>import urllib
> >>>import xml.sax as sax, xml.sax.handler as handler
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
> >>>xml = f.read()
> >>>xml = xml.decode("cp1252")
> >>>f.close()
> >>>class my_handler(handler.ContentHandler):
> >>>def startElement(self, name, attrs):
> >>>print "begin:", name, attrs
> >>>def endElement(self, name):
> >>>print "end:", name
> >>>sax.parseString(xml, my_handler())
> >>This is wrong. XML is a *byte*-based format, which explicitly states
> >>encodings. So decoding a byte-string to a unicode-object and then
> >>passing it to a parser is not working in the very moment you have data that
> >>
> >> - is outside your def