On Thu, Feb 12, 2009 at 12:18 PM, LaundroMat <laun...@gmail.com> wrote:

>
> Hi -
>
> I'm scraping some information from a website, but I'm having some
> trouble with special characters such as é. I'm using BeautifulSoup for
> the scraping, and would like to be able to have Django print out muy
> strings correctly (on the template, in the shell, in the admin).
>
> The way I go about it is:
>
> >>> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
> >>> html = "André goes to town"
> >>> soup = BeautifulSoup(html)
> >>> soup
> Andr&sbquo; goes to town


OK, so you've got a problem here, but I'm not sure from what you've said
that you realize it or recognize what it is exactly. (It always helps when
people say what the expected where it differs from what they got.)  &sbquo;
is single low-9 quotation mark, not Latin small letter e with acute.

This implies BeautifulSoup has guessed wrong what the encoding for your html
string is.  It appears to me you are using a Windows command prompt that is
using cp437, where é has the code point value x82, but BeautifulSoup is
guessing the string is encoded using cp1252, where code point value x82 is
assigned to the single low-9 quotation mark.  So at this point your Latin
small letter e with acute has been turned into an entirely different
character by BeautifulSoup.  That character happens to be U+201A, which is
what you see in the rest of what you show.


> >>> soup = BeautifulSoup(html,
> convertEntities=BeautifulStoneSoup.HTML_ENTITIES)


So here all you are doing is asking BeautifulSoup to use the unicode value
of the entity instead of the &sbquo;...but since it's still guessing wrong
on the encoding, you still wind up with the wrong thing, only now it is a
unicode character value which also leads to difficulties in printing it out
in your Windows command prompt.


> >>> soup
> Traceback (most recent call last):
>  File "<console>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u201a' in
> position 4: ordinal not in range(128)


This thing returned by BeautifulSoup apparently doesn't return a bytestring
repr, and Python's attempt to auto-convert it to str fails since it contains
a character that has no mapping in ASCII.


> >>> soup.contents
> [u'Andr\u201a goes to town']
> >>> soup.contents[0]
> u'Andr\u201a goes to town'
>
> >>> from myapp.events.models import Event
> >>> e = Event(title = soup.contents[0])
> >>> e.save()
> >>> e.name
> u'Andr\u201a goes to town'
>

These others are all ways of displaying the value of the unicode object in
an ASCII-only format, to avoid that EncodeError above.  I'm not sure if you
are objecting to the fact that the reprs are printed using \u201a notation
or if you are objecting to the fact that your small e with actue accent has
been turned into single low-9 quotation mark?


>
> But, as you see, the unicode does not get translated.


Translated to what? You've got something that is not representable ASCII and
are trying to display it in a Windows (I think) command prompt, which is
notoriously bad at handling unicode.  But most of what you've run into here
is an artifact of the fact you appear to be using a Windows command prompt,
so I doubt it is actually relevant to your actual Django code.  If I try
similar in a Linux command prompt with a utf-8 encoding, I get:

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup
>>> html = "André goes to town"
>>> soup = BeautifulSoup(html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position
4: ordinal not in range(128)
>>> print soup
André goes to town
>>> soup = BeautifulSoup(html,
convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
>>> print soup
André goes to town
>>> soup.contents
[u'Andr\xe9 goes to town']

In this case BeautifulSoup doesn't mis-guess what the encoding is, so the
accented character never gets converted to the oddball quote mark, though as
you see you can still run into errors if you do something that tries to
convert that unicode string to ASCII, since it contains a non-ASCII
character.


What steps
> should I take in order to make sure my strings are saved (and later
> displayed) correctly?
>
>
First, ensure that BeautifulSoup will either guess the correct encoding for
the strings you are feeding it, or provide the correct encoding yourself.
See:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit

Second, don't try to use a Windows command prompt that uses cp437 encoding
to test things out, that just increases confusion.  The Windows command to
change the code page is chcp, there is supposedly a code page 65001 that is
for utf8 (note it only works if you do not use raster fonts in you command
prompt, so you may have to change the font setting as well), but I have had
little luck in using  it.  You might want to try using the IDLE GUI on
Windows since it may deal better with unicode/utf-8, though I can't say I've
tried that myself.  I usually just use a Linux box for any shell testing
that requires sane dealing with unicode.

Karen

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to