UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson
I am using Eclipse to write my python scripts and when i run them from inside 
eclipse they work fine without errors. 

But almost in every script that handle some form of special characters like 
swedish åäö and chinese characters etc i get Unicode errors when running the 
script externally with python.exe or pythonw.exe (but the scripts run 
completely fine from within Eclipse (standard pydev projects, python2.7). I 
have usually launched the script gui from wihin eclipse because of this error 
but now i want to get the bottom of this so i dont have to open eclipse 
everytime i want to run a script!

Here is the error i get now when running the script with python.exe:
UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in position 
32: character maps to 

what can i do to fix this?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson
Ahh so its the actual printing that makes it error out outside of eclipse 
because its a different terminal that its printing to. Its the default DOS 
terminal in windows that runs then i run the script with python.exe and i guess 
its the same when i run with pythonw.exe just that the terminal window is not 
opened up, only the pyqt gui in this case.

I will try to fix it now when i know what it is :)

I never thought about the terminal, last time i had the same problem i just 
were playing around for hours with unicode encode and decode and all that 
not-so-fun stuff :)

Andrew Berg: Thanks, your crystal ball seems to be right :P

On Tuesday, February 12, 2013 12:43:00 PM UTC+1, Steven D'Aprano wrote:
> Magnus Pettersson wrote:
> 
> 
> 
> > I am using Eclipse to write my python scripts and when i run them from
> 
> > inside eclipse they work fine without errors.
> 
> > 
> 
> > But almost in every script that handle some form of special characters
> 
> > like swedish åäö and chinese characters etc
> 
> 
> 
> A comment: they are not "special" characters. They're merely not American.
> 
> 
> 
> 
> 
> > i get Unicode errors when 
> 
> > running the script externally with python.exe or pythonw.exe (but the
> 
> > scripts run completely fine from within Eclipse (standard pydev projects,
> 
> > python2.7). I have usually launched the script gui from wihin eclipse
> 
> > because of this error but now i want to get the bottom of this so i dont
> 
> > have to open eclipse everytime i want to run a script!
> 
> > 
> 
> > Here is the error i get now when running the script with python.exe:
> 
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in
> 
> > position 32: character maps to 
> 
> 
> 
> Please show the *complete* traceback, including the line of code that causes
> 
> the exception.
> 
> 
> 
>  
> 
> > what can i do to fix this?
> 
> 
> 
> My guess is that you are trying to print a character which your terminal
> 
> cannot display. My terminal is set to use UTF-8, and so it can display it
> 
> fine:
> 
> 
> 
> py> c = u'\u898b'
> 
> py> print(c)
> 
> 見
> 
> 
> 
> 
> 
> (or at least it would display fine if the font used had a glyph for that
> 
> character). Why there are still terminals in the world that don't default
> 
> to UTF-8 is beyond me.
> 
> 
> 
> If I manually change the terminal's encoding to Western European ISO 8859-1,
> 
> I get some moji-bake:
> 
> 
> 
> py> print(c)
> 
> è¦
> 
> 
> 
> 
> 
> I can't replicate the exception you give, so I assume it is specific to
> 
> Windows.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson
I have tried now to take away printing to terminal and just keeping the writing 
to a .txt file to disk (which is what the scripts purpose is):

with open(filepath,"a") as f:
for card in cardlist:
f.write(card+"\n")

The file it writes to exists and im just appending to it, but when i run the 
script trough eclipse, all is fine. When i run in terminal i get this error 
instead:

File "K:\dev\python\webscraping\kanji_anki.py", line 69, in savefile
f.write(card+"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u898b' in position 3
2: ordinal not in range(128)

On Tuesday, February 12, 2013 12:01:19 PM UTC+1, Andrew Berg wrote:
> On 2013.02.12 04:43, Magnus Pettersson wrote:
> 
> > I am using Eclipse to write my python scripts and when i run them from 
> > inside eclipse they work fine without errors. 
> 
> > 
> 
> > But almost in every script that handle some form of special characters like 
> > swedish åäö and chinese characters etc i get Unicode errors when running 
> > the script externally with python.exe or pythonw.exe (but the scripts run 
> > completely fine from within Eclipse (standard pydev projects, python2.7). I 
> > have usually launched the script gui from wihin eclipse because of this 
> > error but now i want to get the bottom of this so i dont have to open 
> > eclipse everytime i want to run a script!
> 
> > 
> 
> > Here is the error i get now when running the script with python.exe:
> 
> > UnicodeEncodeError:'charmap' codec cant encode character u'\u898b' in 
> > position 32: character maps to 
> 
> > 
> 
> > what can i do to fix this?
> 
> > 
> 
> Since you didn't say what code actually does this, I'll turn to my
> 
> crystal ball. It says you are trying to print characters to a terminal
> 
> that doesn't support them. If that is the case, you could try changing
> 
> the code page (but only 3.3 supports cp65001, so that probably won't
> 
> help) or use replacement characters when printing.
> 
> 
> 
> -- 
> 
> CPython 3.3.0 | Windows NT 6.2.9200.16461 / FreeBSD 9.1-RELEASE
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson
> Are you sure you are writing the same data? That would mean that pydev 
> 
> changes the default encoding -- which is evil.
> 
> 
> 
> A portable approach would be to use codecs.open() or io.open() instead of 
> 
> the built-in:
> 
> 
> 
> import io
> 
> with io.open(filepath, "a") as f:
> 
> ...
> 
> 
> 
> io.open() uses UTF-8 by default, but you can specify other encodings with
> 
> io.open(filepath, mode, encoding=whatever).


Interesting. Pydev must be doing something behind the scenes because when i 
changed open() to io.open() i get error inside of eclipse now:

f.write(card+"\n")
  File "C:\python27\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character u'\u53c8' in 
position 32: character maps to 



io.open(filepath, "a", encoding="UTF-8") as f: 

Then it works in eclipse. But I seem to be having an encoding problem all over 
the place that works in eclipse but dosnt work outside of eclipse pydev.

Here is the flow of my data, im terrible at using unicode/encode/decode so 
could use some help here:

kanji_anki_gui.py:

def on_addButton_clicked(self):
#code
# self.kanji.text() comes from a kanji letter written into a pyqt4 QLineEdit
kanji = unicode(self.kanji.text())
card = kanji_anki.scrapeKanji(kanji,tags)
#more code

kanji_anki.py:

def scrapeKanji(kanji, tags="", onlymeaning=False):
baseurl = unicode("http://www.romajidesu.com/kanji/";)
url = unicode(baseurl+kanji)
#test to write out url to disk, works outside of eclipse now
savefile([url])

#getting webpage works fine in eclipse, prints "Oh no..." in terminal
try:
page = urllib2.urlopen(url)
except:
print "OH no website dont work"
return None

#Code that does some scraping and returns a string containing kanji letters
return card

def savefile(cardlist,filepath="D:/iknow_kanji.txt"):
with io.open(filepath, "a") as f:
for card in cardlist:
f.write(card+"\n")
return True
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson
> What encoding is this file?  Since you're appending to it, you really 
> 
> need to match the pre-existing encoding, or the next program to deal 
> 
> with it is in big trouble.  So using the io.open() without the encoding= 
> 
> keyword is probably a mistake.

The .txt file is in UTF-8

I have got it to work now in the terminal, but i dont understand what im doing 
and why i didnt need to do all the unicode strings and encode mumbo jumbo in 
eclipse

#Here kanji = u"私"
baseurl = u"http://www.romajidesu.com/kanji/";
url = baseurl+kanji
savefile([url]) #this test works now. uses: io.open(filepath, 
"a",encoding="UTF-8") as f:
# This made the fetching of the website work. Why did i have to write 
url.encode("UTF-8") when url already is unicode? I feel i dont have a good 
understanding of this.
page = urllib2.urlopen(url.encode("UTF-8"))



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson

> You don't show the code that actually does the io.open(), nor the 
> 
> url.encode, so I'm not going to guess what you're actually doing.

Hmm im not sure what you mean but I wrote all code needed in a previous post so 
maybe you missed that one :)
In short I basically just have:
import io
io.open(myfile,"a",encode="UTF-8") as f:
f.write(my_ustring_with_kanji)

the url.encode() is my unicode string variable named "url" using the type built 
in  function .encode() which was the thing i wondered why i needed to use, 
which you explained very well, thank you!

Just one more question since all this is still a little fuzzy in my head.

When do i need to use .decode() in my code? is it when i read lines from f.ex a 
UTF-8 file? And why didn't I have to use .encode() on my unicode string when 
running from within eclipse pydev? someone wrote that it has a default codec 
setting so maybe that handles it for me there (which is kinda dangerous since 
my programs wont work running outside of eclipse since i didnt do any encoding 
or using of unicode strings before in my script and it still worked)

--Magnus
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UnicodeEncodeError when not running script from IDE

2013-02-12 Thread Magnus Pettersson
Thanks a lot Steven, you gave me a good AHA experience! :)

Now I understand why I had to use encoding when calling the urllib2! So 
basically Eclipse PyDev does this in the background for me, and its console 
supports utf-8, so thats why i never had to think about it before (and why some 
scripts tends to fail with unicode errors when run outside of the Eclipse IDE).

cheers
Magnus

> Start here:
> 
> 
> 
> "The Absolute Minimum Every Software Developer Absolutely, Positively Must
> 
> Know About Unicode and Character Sets (No Excuses!)"
> 
> 
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> 
> 
> 
> 
> Basically, Unicode is an in-memory data format. Python knows about Unicode
> 
> characters (to be technical: code points), but files on disk do not.
> 
> Neither do network protocols, or terminals, or other simple devices. They
> 
> only understand bytes.
> 
> 
> 
> So when you have Unicode text, and you want to write it to a file on disk,
> 
> or print it, or send it over the network to another machine, it has to be
> 
> *encoded* into bytes, and then *decoded* back into Unicode when you read it
> 
> from the file again. Sometimes the system will "helpfully" do that encoding
> 
> and decoding automatically for you, which is fine when it works but when it
> 
> doesn't it can be perplexing.
> 
> 
> 
> There are many, many, many different *encoding schemes*. ASCII is one. UTF-8
> 
> is another. And then there are about a bazillion legacy encodings which, if
> 
> you are lucky, you will never need to care about. Only some encodings can
> 
> deal with the entire range of Unicode characters, most can only deal with a
> 
> (typically small) subset of possible characters. E.g. ASCII only knows
> 
> about 127 characters out of the million-plus that Unicode deals with.
> 
> Latin-1 can handle close to 256 different characters. If you have a say in
> 
> the matter, always use UTF-8, since it can handle the full set of Unicode
> 
> characters in the most efficient manner.
> 
> 
> 
> 
> 
> -- 
> 
> Steven

-- 
http://mail.python.org/mailman/listinfo/python-list