Re: convert \uXXXX to native character set?

2004-12-21 Thread Christian Ergh
Miki Tebeka wrote: Hello Joe, Is there any library to convert HTML page with \u encoded text to native character set, e.g. BIG5. Try: help("".decode) I use HTMLFilter.py, you can download it at http://www.shearersoftware.com/software/developers/htmlfilter/ Cheers Chris -- http://mail.pyth

Re: A beginner's problem...

2004-12-16 Thread Christian Ergh
DogWalker wrote: "Marc 'BlackJack' Rintsch" <[EMAIL PROTECTED]> said: In <[EMAIL PROTECTED]>, Amir Dekel wrote: When I import a module I have wrote, and then I find bugs, it seems that I can't import it again after a fix it. It always shows the same problem. I try del module but it doesn't work

Re: Suggestion for "syntax error": ++i, --i

2004-12-13 Thread Christian Ergh
in Python. Petr "Christian Ergh" wrote... Hmm, i never liked the i++ syntax, because there is a value asignment behind it and it does not show - except the case you are already used to it. >>> i = 1 >>> i +=1 >>> i 2 I like this one better, because you see

Re: Suggestion for "syntax error": ++i, --i

2004-12-13 Thread Christian Ergh
Hmm, i never liked the i++ syntax, because there is a value asignment behind it and it does not show - except the case you are already used to it. >>> i = 1 >>> i +=1 >>> i 2 I like this one better, because you see the assignment at once, it is easy to read and inuitive usability is given - in m

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Forgot a part... You need the encoding list: encodings = [ 'utf-8', 'latin-1', 'ascii', 'cp1252', ] Christian Ergh wrote: Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I&#x

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word Finally:

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
- snip - def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_encoded, encoding except UnicodeError: pass -snip- This works fine, but after this

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Once more, indention should be correct now, and the 128 is gone too. So, something like this? Chris import urllib2 url = 'www.someurl.com' f = urllib2.urlopen(url) data = f.read() # if it is not in the pagecode, how do i get the encoding of the page? pageencoding = '???' xmlencoding = 'whatever

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Peter Otten wrote: Steven Bethard wrote: Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) t

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-12 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact