On Thu, 01 Nov 2007 19:21:03 -0700, 7stud wrote: > BeautifulSoup can convert an html entity representing an 'A' with > umlaut, e.g.: > > Ä > > into an without every touching my keyboard. How does BeautifulSoup > do it?
It maps the HTML entity names to unicode characters. Take a look at the `htmlentitydefs` module. > from BeautifulSoup import BeautifulStoneSoup as bss > > > s1 = "<h1>Ä</h1>" #&_Auml;_ > #I added the comment after the line to show the > #format of the html entity. In case a browser > #might render the comment into the actual character, > #I added underscores to the html entity: > > soup = bss(s1) > text = soup.contents[0].string #gets the 'A' with umlaut out of the > html > > new_s = bss(text, convertEntities=bss.HTML_ENTITIES) > print repr(new_s) > print new_s > > I see the same output for both print statements, and what I see is an > 'A' with umlaut. I expected that the first print statement would show > the utf-8 encoding for the character. Well it does, and apparently your terminal, or wherever the output goes, decodes that UTF-8 encoded 'Ä' and shows it. If you expected the output '\xc3\x84' then remember that you ask the soup object for its representation and not a string. The object itself decides what `repr(obj)` returns. Soup objects represent themselves as UTF-8 encoded strings. Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list