On Thu, Feb 12, 2009 at 12:18 PM, LaundroMat <laun...@gmail.com> wrote:
> > Hi - > > I'm scraping some information from a website, but I'm having some > trouble with special characters such as é. I'm using BeautifulSoup for > the scraping, and would like to be able to have Django print out muy > strings correctly (on the template, in the shell, in the admin). > > The way I go about it is: > > >>> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup > >>> html = "André goes to town" > >>> soup = BeautifulSoup(html) > >>> soup > Andr‚ goes to town OK, so you've got a problem here, but I'm not sure from what you've said that you realize it or recognize what it is exactly. (It always helps when people say what the expected where it differs from what they got.) ‚ is single low-9 quotation mark, not Latin small letter e with acute. This implies BeautifulSoup has guessed wrong what the encoding for your html string is. It appears to me you are using a Windows command prompt that is using cp437, where é has the code point value x82, but BeautifulSoup is guessing the string is encoded using cp1252, where code point value x82 is assigned to the single low-9 quotation mark. So at this point your Latin small letter e with acute has been turned into an entirely different character by BeautifulSoup. That character happens to be U+201A, which is what you see in the rest of what you show. > >>> soup = BeautifulSoup(html, > convertEntities=BeautifulStoneSoup.HTML_ENTITIES) So here all you are doing is asking BeautifulSoup to use the unicode value of the entity instead of the ‚...but since it's still guessing wrong on the encoding, you still wind up with the wrong thing, only now it is a unicode character value which also leads to difficulties in printing it out in your Windows command prompt. > >>> soup > Traceback (most recent call last): > File "<console>", line 1, in <module> > UnicodeEncodeError: 'ascii' codec can't encode character u'\u201a' in > position 4: ordinal not in range(128) This thing returned by BeautifulSoup apparently doesn't return a bytestring repr, and Python's attempt to auto-convert it to str fails since it contains a character that has no mapping in ASCII. > >>> soup.contents > [u'Andr\u201a goes to town'] > >>> soup.contents[0] > u'Andr\u201a goes to town' > > >>> from myapp.events.models import Event > >>> e = Event(title = soup.contents[0]) > >>> e.save() > >>> e.name > u'Andr\u201a goes to town' > These others are all ways of displaying the value of the unicode object in an ASCII-only format, to avoid that EncodeError above. I'm not sure if you are objecting to the fact that the reprs are printed using \u201a notation or if you are objecting to the fact that your small e with actue accent has been turned into single low-9 quotation mark? > > But, as you see, the unicode does not get translated. Translated to what? You've got something that is not representable ASCII and are trying to display it in a Windows (I think) command prompt, which is notoriously bad at handling unicode. But most of what you've run into here is an artifact of the fact you appear to be using a Windows command prompt, so I doubt it is actually relevant to your actual Django code. If I try similar in a Linux command prompt with a utf-8 encoding, I get: Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup >>> html = "André goes to town" >>> soup = BeautifulSoup(html) >>> soup Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 4: ordinal not in range(128) >>> print soup André goes to town >>> soup = BeautifulSoup(html, convertEntities=BeautifulStoneSoup.HTML_ENTITIES) >>> print soup André goes to town >>> soup.contents [u'Andr\xe9 goes to town'] In this case BeautifulSoup doesn't mis-guess what the encoding is, so the accented character never gets converted to the oddball quote mark, though as you see you can still run into errors if you do something that tries to convert that unicode string to ASCII, since it contains a non-ASCII character. What steps > should I take in order to make sure my strings are saved (and later > displayed) correctly? > > First, ensure that BeautifulSoup will either guess the correct encoding for the strings you are feeding it, or provide the correct encoding yourself. See: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit Second, don't try to use a Windows command prompt that uses cp437 encoding to test things out, that just increases confusion. The Windows command to change the code page is chcp, there is supposedly a code page 65001 that is for utf8 (note it only works if you do not use raster fonts in you command prompt, so you may have to change the font setting as well), but I have had little luck in using it. You might want to try using the IDLE GUI on Windows since it may deal better with unicode/utf-8, though I can't say I've tried that myself. I usually just use a Linux box for any shell testing that requires sane dealing with unicode. Karen --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---