On Feb 12, 8:33 pm, Karen Tracey <kmtra...@gmail.com> wrote:
> On Thu, Feb 12, 2009 at 12:18 PM, LaundroMat <laun...@gmail.com> wrote:
>
> > Hi -
>
> > I'm scraping some information from a website, but I'm having some
> > trouble with special characters such as é. I'm using BeautifulSoup for
> > the scraping, and would like to be able to have Django print out muy
> > strings correctly (on the template, in the shell, in the admin).
>
> > The way I go about it is:
>
> > >>> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
> > >>> html = "André goes to town"
> > >>> soup = BeautifulSoup(html)
> > >>> soup
> > Andr&sbquo; goes to town
>
> OK, so you've got a problem here, but I'm not sure from what you've said
> that you realize it or recognize what it is exactly. (It always helps when
> people say what the expected where it differs from what they got.)  &sbquo;
> is single low-9 quotation mark, not Latin small letter e with acute.
>
> This implies BeautifulSoup has guessed wrong what the encoding for your html
> string is.  It appears to me you are using a Windows command prompt that is
> using cp437, where é has the code point value x82, but BeautifulSoup is
> guessing the string is encoded using cp1252, where code point value x82 is
> assigned to the single low-9 quotation mark.  So at this point your Latin
> small letter e with acute has been turned into an entirely different
> character by BeautifulSoup.  That character happens to be U+201A, which is
> what you see in the rest of what you show.
>
> > >>> soup = BeautifulSoup(html,
> > convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
>
> So here all you are doing is asking BeautifulSoup to use the unicode value
> of the entity instead of the &sbquo;...but since it's still guessing wrong
> on the encoding, you still wind up with the wrong thing, only now it is a
> unicode character value which also leads to difficulties in printing it out
> in your Windows command prompt.
>
> > >>> soup
> > Traceback (most recent call last):
> >  File "<console>", line 1, in <module>
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\u201a' in
> > position 4: ordinal not in range(128)
>
> This thing returned by BeautifulSoup apparently doesn't return a bytestring
> repr, and Python's attempt to auto-convert it to str fails since it contains
> a character that has no mapping in ASCII.
>
> > >>> soup.contents
> > [u'Andr\u201a goes to town']
> > >>> soup.contents[0]
> > u'Andr\u201a goes to town'
>
> > >>> from myapp.events.models import Event
> > >>> e = Event(title = soup.contents[0])
> > >>> e.save()
> > >>> e.name
> > u'Andr\u201a goes to town'
>
> These others are all ways of displaying the value of the unicode object in
> an ASCII-only format, to avoid that EncodeError above.  I'm not sure if you
> are objecting to the fact that the reprs are printed using \u201a notation
> or if you are objecting to the fact that your small e with actue accent has
> been turned into single low-9 quotation mark?
>
>
>
> > But, as you see, the unicode does not get translated.
>
> Translated to what? You've got something that is not representable ASCII and
> are trying to display it in a Windows (I think) command prompt, which is
> notoriously bad at handling unicode.  But most of what you've run into here
> is an artifact of the fact you appear to be using a Windows command prompt,
> so I doubt it is actually relevant to your actual Django code.  If I try
> similar in a Linux command prompt with a utf-8 encoding, I get:
>
> Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
> [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.>>> 
> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
> >>> html = "André goes to town"
> >>> soup = BeautifulSoup(html)
> >>> soup
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position
> 4: ordinal not in range(128)>>> print soup
> André goes to town
> >>> soup = BeautifulSoup(html,
>
> convertEntities=BeautifulStoneSoup.HTML_ENTITIES)>>> print soup
> André goes to town
> >>> soup.contents
>
> [u'Andr\xe9 goes to town']
>
> In this case BeautifulSoup doesn't mis-guess what the encoding is, so the
> accented character never gets converted to the oddball quote mark, though as
> you see you can still run into errors if you do something that tries to
> convert that unicode string to ASCII, since it contains a non-ASCII
> character.
>
> What steps> should I take in order to make sure my strings are saved (and 
> later
> > displayed) correctly?
>
> First, ensure that BeautifulSoup will either guess the correct encoding for
> the strings you are feeding it, or provide the correct encoding yourself.
> See:
>
> http://www.crummy.com/software/BeautifulSoup/documentation.html#Beaut...
>
> Second, don't try to use a Windows command prompt that uses cp437 encoding
> to test things out, that just increases confusion.  The Windows command to
> change the code page is chcp, there is supposedly a code page 65001 that is
> for utf8 (note it only works if you do not use raster fonts in you command
> prompt, so you may have to change the font setting as well), but I have had
> little luck in using  it.  You might want to try using the IDLE GUI on
> Windows since it may deal better with unicode/utf-8, though I can't say I've
> tried that myself.  I usually just use a Linux box for any shell testing
> that requires sane dealing with unicode.
>
> Karen

I'll print this out and hang it on my wall. Thanks ever so much for
this great, lengthy and very informative reply.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to