umlauts

2009-10-17 Thread Arian Kuschki
Hi all

this has been bugging me for a long time and I do not seem to be able to 
understand what to do. I always have problems when dealing input text that 
contains umlauts. Consider the following:

In [1]: import urllib

In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)

In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
--> print(xml)


As you can see the umlauts in the XML are not displayed properly. When I want 
to process this text (for example with xml.sax), I get error messages because 
the parses can't read this.

I've tried to read up on this and there is a lot of information on the web, but 
nothing seems to work for me. For example setting the coding to UTF like this: 
# -*- coding: utf-8 -*- or using the decode() string method.

I always have this kind of problem when input contains umlauts, not just in 
this case. My locale (on Ubuntu) is en_GB.UTF-8.

Cheers
Arian



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: umlauts

2009-10-17 Thread Arian Kuschki
Whoa, that was quick! Thanks for all the answers, I'll try to recapitulate

>What does this show you in your interactive interpreter?
>
>>>> print "\xc3\xb6"
>ö
>
>For me, it's o-umlaut, ö. This is because the above bytes are the
>sequence for ö in utf-8.
>
>If this shows something else, you need to adjust your terminal settings.

for me it also prints the correct o-umlaut (ö), so that was not the problem.


All of the below result in xml that shows all umlauts correctly when printed:

xml.decode("cp1252")
xml.decode("cp1252").encode("utf-8")
xml.decode("iso-8859-1")
xml.decode("iso-8859-1").encode("utf-8")

But when I want to parse the xml then, it only works if I
do both decode and encode. If I only decode, I get the following error:
SAXParseException: :1:1: not well-formed (invalid token)

Do I understand right that since the encoding was not specified in the xml 
response, it should have been utf-8 by default? And that if it had indeed been 
utf-8 I 
would not have had the encoding problem in the first place?

Anyway, thanks everybody, this has helped me a lot.

Arian


On Sat 17, 20:17 +0200, Diez B. Roggisch wrote:

> StarWing schrieb:
> >On 10月18日, 上午12时50分, "Diez B. Roggisch"  wrote:
> >>StarWing schrieb:
> >>
> >>
> >>
> >>>On 10月17日, 下午9时54分, Arian Kuschki 
> >>>wrote:
> >>>>Hi all
> >>>>this has been bugging me for a long time and I do not seem to be able to
> >>>>understand what to do. I always have problems when dealing input text that
> >>>>contains umlauts. Consider the following:
> >>>>In [1]: import urllib
> >>>>In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
> >>>>In [3]: xml = f.read()
> >>>>In [4]: f.close()
> >>>>In [5]: print xml
> >>>>--> print(xml)
> >>>> >>>>tab_id="0" mobile_row="0" mobile_zipped="1" row="0" 
> >>>>section="0"> >>>>y data="Munich, BY"/> >>>>data=""/> >>>>data="2009-10-17"/> >>>>data="SI"/> >>>>data="Meistens
> >>>>bew kt"/> >>>>umidity data="Feuchtigkeit: 87 %"/> >>>>data="/ig/images/weather/mostly_cloudy.gif"/> >>>>ent_conditions> >>>>data="1"/> >>>>data="/ig/images/weather/chance_of_rain.gif"/> >>>>data="So."/> >>>>data="/ig/images/weather/chance_of_sno
> >>>>w.gif"/> >>>>data="Mo."/> >>>>data="Di."/> >>>>/> >>>>data="Klar"/>
> >>>>As you can see the umlauts in the XML are not displayed properly. When I 
> >>>>want
> >>>>to process this text (for example with xml.sax), I get error messages 
> >>>>because
> >>>>the parses can't read this.
> >>>>I've tried to read up on this and there is a lot of information on the 
> >>>>web, but
> >>>>nothing seems to work for me. For example setting the coding to UTF like 
> >>>>this:
> >>>># -*- coding: utf-8 -*- or using the decode() string method.
> >>>>I always have this kind of problem when input contains umlauts, not just 
> >>>>in
> >>>>this case. My locale (on Ubuntu) is en_GB.UTF-8.
> >>>>Cheers
> >>>>Arian
> >>>try this?
> >>># vim: set fencoding=utf-8:
> >>>import urllib
> >>>import xml.sax as sax, xml.sax.handler as handler
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
> >>>xml = f.read()
> >>>xml = xml.decode("cp1252")
> >>>f.close()
> >>>class my_handler(handler.ContentHandler):
> >>>def startElement(self, name, attrs):
> >>>print "begin:", name, attrs
> >>>def endElement(self, name):
> >>>print "end:", name
> >>>sax.parseString(xml, my_handler())
> >>This is wrong. XML is a *byte*-based format, which explicitly states
> >>encodings. So decoding a byte-string to a unicode-object and then
> >>passing it to a parser is not working in the very moment you have data that
> >>
> >>  - is outside your def

Re: umlauts

2009-10-17 Thread Arian Kuschki
I just checked and I see the following in the headers:
Content-Type text/xml; charset=UTF-8

Where does it say ISO-8859-1?

On Sat 17, 20:57 +0200, I V wrote:

> On Sat, 17 Oct 2009 18:54:10 +0200, Diez B. Roggisch wrote:
> 
> > This is wierd. I looked at the site in FireFox - and it was displayed
> > correctly, including umlauts. Bringing up the info-dialog claims the
> > page is UTF-8, the XML itself says so as well (implicit, through the
> > missing declaration of an encoding) - but it clearly is *not* utf-8.
> 
> The headers correctly identify it as ISO-8859-1, which overrides the 
> implicit specification of UTF-8. I'm not sure why Firefox is reporting it 
> as UTF-8 (it does that for me, too); I can see the umlauts, so it's 
> clearly processing it as ISO-8859-1.
> -- 
> http://mail.python.org/mailman/listinfo/python-list

-- 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: umlauts

2009-10-17 Thread Arian Kuschki
Hm yes, that is true. In Firefox on the other hand, the response header is
"Content-Type text/xml; charset=UTF-8"

On Sat 17, 13:16 -0700, Mark Tolonen wrote:

> 
> "Diez B. Roggisch"  wrote in message
> news:7jub5rf37div...@mid.uni-berlin.de...
> [snip]
> >This is wierd. I looked at the site in FireFox - and it was
> >displayed correctly, including umlauts. Bringing up the
> >info-dialog claims the page is UTF-8, the XML itself says so as
> >well (implicit, through the missing declaration of an encoding) -
> >but it clearly is *not* utf-8.
> >
> >One would expect google to be better at this...
> >
> >Diez
> 
> According to the XML 1.0 specification:
> 
> "Although an XML processor is required to read only entities in the
> UTF-8 and UTF-16 encodings, it is recognized that other encodings
> are used around the world, and it may be desired for XML processors
> to read entities that use them. In the absence of external character
> encoding information (such as MIME headers), parsed entities which
> are stored in an encoding other than UTF-8 or UTF-16 must begin with
> a text declaration..."
> 
> So UTF-8 and UTF-16 are the defaults supported without an xml
> declaration in the absence of external encoding information.  But we
> have external character encoding information:
> 
> >>>f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
> >>>f.headers.dict['content-type']
> 'text/xml; charset=ISO-8859-1'
> 
> So the page seems correct.
> 
> -Mark
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list

-- 
-- 
http://mail.python.org/mailman/listinfo/python-list