Spanish Accents

2011-12-22 Thread Stan Iverson
Hi;
If I write a python page to print to the web with Spanish accents all is
well. However, if I read the data from a text file it prints diamonds with
question marks wherever there are accented vowels. Please advise.
TIA,
Stan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Spanish Accents

2011-12-22 Thread Stan Iverson
On Thu, Dec 22, 2011 at 10:58 AM, Chris Angelico  wrote:

> Firstly, are you using Python 2 or Python 3? Things will be slightly
> different, since the default 'str' object in Py3 is Unicode.
>

2

>
> I would guess that your page is being output as UTF-8; you may find
> that the solution is as easy as declaring the encoding of your text
> file when you read it in.
>

So I tried this:

file = open(p + "2.txt")
for line in file:
  print unicode(line, 'utf-8')

and got this error:

 142   print unicode(line, 'utf-8')
   143
   144 print '''http://13gems.com/Sign_Up.py"; method="post" target="_blank">
 *builtin* *unicode* = , *line* = '\r\n'
 
/usr/lib64/python2.4/encodings/utf_8.pyin
*decode*(input=,
errors='strict')14
15 def decode(input, errors='strict'):
16 return codecs.utf_16_decode(input, errors, True)
17
18 class StreamWriter(codecs.StreamWriter):
 *global* *codecs* = , codecs.*utf_16_decode* = , *input* = , *errors* = 'strict', *builtin* *True* = True

*UnicodeDecodeError*: 'utf16' codec can't decode byte 0x0a in position 20:
truncated data
  args = ('utf16', '\r\n', 20, 21, 'truncated data')
  encoding = 'utf16'
  end = 21
  object = '\r\n'
  reason = 'truncated data'
  start = 20

Tried it with utf-16 with same results.

TIA,

Stan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Spanish Accents

2011-12-22 Thread Stan Iverson
On Thu, Dec 22, 2011 at 11:30 AM, Rami Chowdhury
wrote:

> Could you try using the 'open' function from the 'codecs' module?
>

I believe this is what you meant:

file = codecs.open(p + "2.txt", "r", "utf-8")
for line in file:
  print line

but got this error:

 141 file = codecs.open(p + "2.txt", "r", "utf-8")
   142 for line in file:
   143   print line
   144
 *line* = '\r\n', *file* = 
/usr/lib64/python2.4/codecs.py in *next*(self=)   492
   493 """ Return the next decoded line from the input stream."""
   494 return self.reader.next()
   495
   496 def __iter__(self):
 *self* = , self.*reader* = , self.reader.*next* = >
/usr/lib64/python2.4/codecs.py in *next*(self=)   429
   430 """ Return the next decoded line from the input stream."""
   431 line = self.readline()
   432 if line:
   433 return line
 line *undefined*, *self* = , self.*
readline* = >  /usr/lib64/python2.4/codecs.py in *readline*(self=, size=None, keepends=True)   344
 # If size is given, we call read() only once
   345 while True:
   346 data = self.read(readsize, firstline=True)
   347 if data:
   348
 # If we're at a "\r" read one extra character (which might
 data *undefined*, *self* = , self.*read* =
>, *
readsize* = 72, firstline *undefined*, *builtin* *True* = True
/usr/lib64/python2.4/codecs.py in *read*(self=, size=72, chars=-1, firstline=True)   291
 data = self.bytebuffer + newdata
   292 try:
   293
 newchars, decodedbytes = self.decode(data, self.errors)
   294 except UnicodeDecodeError, exc:
   295 if firstline:
 *newchars* = u'', *decodedbytes* = 0, *self* = , self.*decode* = , *data* =
'\xe1intentado para ellos bastante sabios para discernir lo obvio.
Tales perso',
self.*errors* = 'strict'

*UnicodeDecodeError*: 'utf8' codec can't decode bytes in position 0-2:
invalid data
  args = ('utf8', '\xe1 intentado para ellos bastante sabios para
discernir lo obvio. Tales perso', 0, 3, 'invalid data')
  encoding = 'utf8'
  end = 3
  object = '\xe1 intentado para ellos bastante sabios para discernir lo
obvio. Tales perso'
  reason = 'invalid data'
  start = 0

which is the letter á (a with accent).
So I tried with utf-16 and got this error:

 141 file = codecs.open(p + "2.txt", "r", "utf-16")
   142 for line in file:
   143   print line
   144
 *line* = '\r\n', *file* = 
/usr/lib64/python2.4/codecs.py in *next*(self=)   492
   493 """ Return the next decoded line from the input stream."""
   494 return self.reader.next()
   495
   496 def __iter__(self):
 *self* = , self.*reader* = , self.reader.*next* = >
/usr/lib64/python2.4/codecs.py in *next*(self=)   429
   430 """ Return the next decoded line from the input stream."""
   431 line = self.readline()
   432 if line:
   433 return line
 line *undefined*, *self* = , self.*
readline* = >  /usr/lib64/python2.4/codecs.py in *readline*(self=, size=None, keepends=True)   344
 # If size is given, we call read() only once
   345 while True:
   346 data = self.read(readsize, firstline=True)
   347 if data:
   348
 # If we're at a "\r" read one extra character (which might
 data *undefined*, *self* = , self.*read* =
>, *
readsize* = 72, firstline *undefined*, *builtin* *True* = True
/usr/lib64/python2.4/codecs.py in *read*(self=, size=72, chars=-1, firstline=True)   291
 data = self.bytebuffer + newdata
   292 try:
   293
 newchars, decodedbytes = self.decode(data, self.errors)
   294 except UnicodeDecodeError, exc:
   295 if firstline:
 newchars *undefined*, decodedbytes *undefined*, *self* = , self.*decode* = >, *data* = '\r\nNoticia:
Este sitio web entre este portal est\xe1 i', self.*errors* = 'strict'
/usr/lib64/python2.4/encodings/utf_16.py in *decode*(self=, input='\r\nNoticia: Este
sitio web entre este portal est\xe1 i', errors='strict')47
 self.decode = codecs.utf_16_be_decode
48 elif consumed>=2:
49
 raise UnicodeError,"UTF-16 stream does not start with BOM"
50 return (object, consumed)
51
 *builtin* *UnicodeError* = 

*UnicodeError*: UTF-16 stream does not start with BOM
  args = ('UTF-16 stream does not start with BOM',)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Spanish Accents

2011-12-22 Thread Stan Iverson
On Thu, Dec 22, 2011 at 12:42 PM, Peter Otten <__pete...@web.de> wrote:

> The file is probably encoded in ISO-8859-1, ISO-8859-15, or cp1252 then:
>
> >>> print "\xe1".decode("iso-8859-1")
> á
> >>> print "\xe1".decode("iso-8859-15")
> á
> >>> print "\xe1".decode("cp1252")
> á
>
> Try codecs.open() with one of these encodings.
>

I'm baffled. I duplicated your print statements but when I run this code
(or any of the 3 encodings):

file = codecs.open(p + "2.txt", "r", "cp1252")
#file = codecs.open(p + "2.txt", "r", "utf-8")
for line in file:
  print line

I get this error:

*UnicodeEncodeError*: 'ascii' codec can't encode character u'\xe1' in
position 48: ordinal not in range(128)
  args = ('ascii', u'Noticia: Este sitio web entre este portal
est...r\xe1pidamente va a salir de aqu\xed.\r\n', 48, 49,
'ordinal not in range(128)')
  encoding = 'ascii'
  end = 49
  object = u'Noticia: Este sitio web entre este portal
est...r\xe1pidamente
va a salir de aqu\xed.\r\n'
  reason = 'ordinal not in range(128)'
  start = 48

Please advise. TIA,
Stan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Spanish Accents

2011-12-22 Thread Stan Iverson
On Thu, Dec 22, 2011 at 2:17 PM, Peter Otten <__pete...@web.de> wrote:

> You are now one step further, you have successfully* decoded the file.
> The remaining step is to encode the resulting unicode lines back into
> bytes.
> The encoding implicitly used by the print statement is sys.stdout.encoding
> which is either "ascii" or None in your case. Try to encode explicitly to
> UTF-8 with
>
> f = codecs.open(p + "2.txt", "r", "iso-8859-1")
> for line in f:
>print line.encode("utf-8")
>
>
> OOEEE! Thanks!
Stan
-- 
http://mail.python.org/mailman/listinfo/python-list