On Feb 12, 8:33 pm, Karen Tracey <kmtra...@gmail.com> wrote: > On Thu, Feb 12, 2009 at 12:18 PM, LaundroMat <laun...@gmail.com> wrote: > > > Hi - > > > I'm scraping some information from a website, but I'm having some > > trouble with special characters such as é. I'm using BeautifulSoup for > > the scraping, and would like to be able to have Django print out muy > > strings correctly (on the template, in the shell, in the admin). > > > The way I go about it is: > > > >>> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup > > >>> html = "André goes to town" > > >>> soup = BeautifulSoup(html) > > >>> soup > > Andr‚ goes to town > > OK, so you've got a problem here, but I'm not sure from what you've said > that you realize it or recognize what it is exactly. (It always helps when > people say what the expected where it differs from what they got.) ‚ > is single low-9 quotation mark, not Latin small letter e with acute. > > This implies BeautifulSoup has guessed wrong what the encoding for your html > string is. It appears to me you are using a Windows command prompt that is > using cp437, where é has the code point value x82, but BeautifulSoup is > guessing the string is encoded using cp1252, where code point value x82 is > assigned to the single low-9 quotation mark. So at this point your Latin > small letter e with acute has been turned into an entirely different > character by BeautifulSoup. That character happens to be U+201A, which is > what you see in the rest of what you show. > > > >>> soup = BeautifulSoup(html, > > convertEntities=BeautifulStoneSoup.HTML_ENTITIES) > > So here all you are doing is asking BeautifulSoup to use the unicode value > of the entity instead of the ‚...but since it's still guessing wrong > on the encoding, you still wind up with the wrong thing, only now it is a > unicode character value which also leads to difficulties in printing it out > in your Windows command prompt. > > > >>> soup > > Traceback (most recent call last): > > File "<console>", line 1, in <module> > > UnicodeEncodeError: 'ascii' codec can't encode character u'\u201a' in > > position 4: ordinal not in range(128) > > This thing returned by BeautifulSoup apparently doesn't return a bytestring > repr, and Python's attempt to auto-convert it to str fails since it contains > a character that has no mapping in ASCII. > > > >>> soup.contents > > [u'Andr\u201a goes to town'] > > >>> soup.contents[0] > > u'Andr\u201a goes to town' > > > >>> from myapp.events.models import Event > > >>> e = Event(title = soup.contents[0]) > > >>> e.save() > > >>> e.name > > u'Andr\u201a goes to town' > > These others are all ways of displaying the value of the unicode object in > an ASCII-only format, to avoid that EncodeError above. I'm not sure if you > are objecting to the fact that the reprs are printed using \u201a notation > or if you are objecting to the fact that your small e with actue accent has > been turned into single low-9 quotation mark? > > > > > But, as you see, the unicode does not get translated. > > Translated to what? You've got something that is not representable ASCII and > are trying to display it in a Windows (I think) command prompt, which is > notoriously bad at handling unicode. But most of what you've run into here > is an artifact of the fact you appear to be using a Windows command prompt, > so I doubt it is actually relevant to your actual Django code. If I try > similar in a Linux command prompt with a utf-8 encoding, I get: > > Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) > [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information.>>> > from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup > >>> html = "André goes to town" > >>> soup = BeautifulSoup(html) > >>> soup > > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position > 4: ordinal not in range(128)>>> print soup > André goes to town > >>> soup = BeautifulSoup(html, > > convertEntities=BeautifulStoneSoup.HTML_ENTITIES)>>> print soup > André goes to town > >>> soup.contents > > [u'Andr\xe9 goes to town'] > > In this case BeautifulSoup doesn't mis-guess what the encoding is, so the > accented character never gets converted to the oddball quote mark, though as > you see you can still run into errors if you do something that tries to > convert that unicode string to ASCII, since it contains a non-ASCII > character. > > What steps> should I take in order to make sure my strings are saved (and > later > > displayed) correctly? > > First, ensure that BeautifulSoup will either guess the correct encoding for > the strings you are feeding it, or provide the correct encoding yourself. > See: > > http://www.crummy.com/software/BeautifulSoup/documentation.html#Beaut... > > Second, don't try to use a Windows command prompt that uses cp437 encoding > to test things out, that just increases confusion. The Windows command to > change the code page is chcp, there is supposedly a code page 65001 that is > for utf8 (note it only works if you do not use raster fonts in you command > prompt, so you may have to change the font setting as well), but I have had > little luck in using it. You might want to try using the IDLE GUI on > Windows since it may deal better with unicode/utf-8, though I can't say I've > tried that myself. I usually just use a Linux box for any shell testing > that requires sane dealing with unicode. > > Karen
I'll print this out and hang it on my wall. Thanks ever so much for this great, lengthy and very informative reply. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---