Miki Tebeka wrote:
Hello Joe,
Is there any library to convert HTML page with \u encoded text to
native character set, e.g. BIG5.
Try: help("".decode)
I use HTMLFilter.py, you can download it at
http://www.shearersoftware.com/software/developers/htmlfilter/
Cheers
Chris
--
http://mail.pyth
DogWalker wrote:
"Marc 'BlackJack' Rintsch" <[EMAIL PROTECTED]> said:
In <[EMAIL PROTECTED]>, Amir Dekel wrote:
When I import a module I have wrote, and then I find bugs, it seems that
I can't import it again after a fix it. It always shows the same
problem. I try del module but it doesn't work
in Python.
Petr
"Christian Ergh" wrote...
Hmm, i never liked the i++ syntax, because there is a value asignment
behind it and it does not show - except the case you are already used
to it.
>>> i = 1
>>> i +=1
>>> i
2
I like this one better, because you see
Hmm, i never liked the i++ syntax, because there is a value asignment
behind it and it does not show - except the case you are already used to it.
>>> i = 1
>>> i +=1
>>> i
2
I like this one better, because you see the assignment at once, it is
easy to read and inuitive usability is given - in m
Forgot a part... You need the encoding list:
encodings = [
'utf-8',
'latin-1',
'ascii',
'cp1252',
]
Christian Ergh wrote:
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word
Finally:
- snip -
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass
-snip-
This works fine, but after this
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Chris
import urllib2
url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding = 'whatever
Peter Otten wrote:
Steven Bethard wrote:
Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
A little OT, but (assuming I got your indentation right[1]) t
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII charact
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII charact
11 matches
Mail list logo