Max M wrote:
A smiple way to try out different encodings in a given order:
The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is
somewhat redundant. The 'ASCII' case is never considered, since
Latin-1 effectively works as a catch-all encoding (as all byte
sequences can be considered Latin-1
Christian Ergh wrote:
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.
Also, it might be possible to do this in a for loop, e.g.
for encoding in (pag
Forgot a part... You need the encoding list:
encodings = [
'utf-8',
'latin-1',
'ascii',
'cp1252',
]
Christian Ergh wrote:
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly e
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word
Finally:
- snip -
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass
-snip-
This works fine, but after this
Christian Ergh wrote:
A smiple way to try out different encodings in a given order:
# -*- coding: latin-1 -*-
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_en
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Chris
import urllib2
url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding = 'whatever
Peter Otten wrote:
Steven Bethard wrote:
Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
A little OT, but (assuming I got your indentation right[1]) this kind of
loop i
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII charact
Steven Bethard wrote:
> Christian Ergh wrote:
>> flag = true
>> for char in data:
>> if 127 < ord(char) < 128:
>> flag = false
>> if flag:
>> try:
>> data = data.encode('latin-1')
>> except:
>> pass
>
> A little OT, but (assuming I got your indentation right[1]
Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a
Christian Ergh wrote:
- it works with the characters i mentioned
It does.
- what encoding do you have in the end
US-ASCII
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
I could, but I don't have the time - just try to
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII charact
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII characters, and
character refer
14 matches
Mail list logo