On 01/07/2015 08:38 AM, Jacob Kruger wrote:
Thanks.
Makes more sense now, and yes, using 2.7 here.
Unfortunately, while could pass the binary values into blob fields well
enough, using forms of parameterised statements, the actual generation
of sql script text files is a step they want to work
On 01/07/2015 08:32 AM, Jacob Kruger wrote:
Thanks.
Please don't top-post. Put your responses after each quoted part you're
responding to. And if there are parts you're not responding to, please
delete them.
Issue with knowing encoding could just be that am pretty sure at least
some of the
-
From: "Dave Angel"
To:
Sent: Wednesday, January 07, 2015 2:22 PM
Subject: Re: String character encoding when converting data from one
type/format to another
On 01/07/2015 06:04 AM, Jacob Kruger wrote:
I'm busy using something like pyodbc to pull data out of MS access .mdb
f
- Original Message -
From: "Peter Otten" <__pete...@web.de>
To:
Sent: Wednesday, January 07, 2015 2:11 PM
Subject: Re: String character encoding when converting data from one
type/format to another
Jacob Kruger wrote:
I'm busy using something like pyodbc to pull d
Wilco wants to welcome you...to the space janitor's closet..."
- Original Message -
From: "Ned Batchelder"
To:
Sent: Wednesday, January 07, 2015 2:02 PM
Subject: Re: String character encoding when converting data from one
type/format to another
On 1/7/15 6:0
On Wed, Jan 7, 2015 at 11:02 PM, Ned Batchelder wrote:
>> Any thoughts on a sort of generic method/means to handle any/all
>> characters that might be out of range when having pulled them out of
>> something like these MS access databases?
>
>
> The best thing is to know what encoding was used to
On 01/07/2015 06:04 AM, Jacob Kruger wrote:
I'm busy using something like pyodbc to pull data out of MS access .mdb files,
and then generate .sql script files to execute
against MySQL databases using MySQLdb module, but, issue is forms of
characters in string values that don't fit inside
Jacob Kruger wrote:
> I'm busy using something like pyodbc to pull data out of MS access .mdb
> files, and then generate .sql script files to execute against MySQL
> databases using MySQLdb module, but, issue is forms of characters in
> string values that don't fit inside the 0-127 range - current
On 1/7/15 6:04 AM, Jacob Kruger wrote:
I'm busy using something like pyodbc to pull data out of MS access .mdb
files, and then generate .sql script files to execute against MySQL
databases using MySQLdb module, but, issue is forms of characters in
string values that don't fit inside the 0-127 ran
I'm busy using something like pyodbc to pull data out of MS access .mdb files,
and then generate .sql script files to execute against MySQL databases using
MySQLdb module, but, issue is forms of characters in string values that don't
fit inside the 0-127 range - current one seems to be something
tml_list[-1] if charset_from_html_list
else ''
return charset_from_html if charset_from_html else charset_from_header
> Date: Sun, 9 Jun 2013 04:47:02 -0700
> Subject: Re: how to detect the character encoding in a web page ?
> From: redstone-c...@163.com
> To: python-l
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
here is one thread that can help me understanding my code
http://stackoverflow.com/questions/17001407/how-to-detect-the-ch
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can
get a web page code more securely
even f
On Thu, Jun 6, 2013 at 4:22 PM, Nobody wrote:
> On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
>
>> The HTTP header is completely out of band. This is the best way to
>> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
>> parsing. Once you find a meta tag, you
On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
> The HTTP header is completely out of band. This is the best way to
> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
> parsing. Once you find a meta tag, you stop parsing and go back to the
> top, decoding in th
On Thu, Jun 6, 2013 at 1:14 AM, iMath wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
>> how to detect the character encoding in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way ,we cannot get character enco
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
by the way ,we cannot get character encoding programmatically from the mate
data without knowing the character encod
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
I found PyQt’s QtextStream can very accurately detect the character encoding
in a web page .
even for this bad page
ht
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
I found PyQt’s QtextStream can very accurately detect the character encoding
in a web page .
even for this bad page
chard
In article ,
Roy Smith wrote:
>In article ,
> Alister wrote:
>
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>>
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then us
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
up to now , maybe chadet is the only way to let python automatically do it .
--
http://mail.python.org/mailman/listinfo/python-list
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
first setup chardet
import chardet
#抓取网页html
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=
In article ,
Alister wrote:
> Indeed due to the poor quality of most websites it is not possible to be
> 100% accurate for all sites.
>
> personally I would start by checking the doc type & then the meta data as
> these should be quick & correct, I then use chardectect only if these
> fail t
On Mon, 24 Dec 2012 13:50:39 +, Steven D'Aprano wrote:
> On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
>
>> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
>> wrote:
>>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>>> with confidence 0.803579722043 $
>>
>> And it
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $
>
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RI
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $
And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definit
Am 24.12.2012 um 04:03 schrieb iMath:
> but how to let python do it for you ?
> such as these 2 pages
> http://python.org/
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> how to detect the character encoding in these 2 pages by python ?
If you have the
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
but how to let python do it for you ?
such as these 2 pages
http://python.org/
http://msdn.microsoft.com/en-us/library
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
but how to let python do it for you ?
such as these 2 pages
http://python.org/
http://msdn.microsoft.com/en-us/library
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
but how to let python do it for you ?
such as this page
http://python.org/
how to detect the character encoding in th
On 24/12/12 01:34:47, iMath wrote:
> how to detect the character encoding in a web page ?
That depends on the site: different sites indicate
their encoding differently.
> such as this page: http://python.org/
If you download that page and look at the HTML code, you'll find a li
On Mon, Dec 24, 2012 at 11:34 AM, iMath wrote:
> how to detect the character encoding in a web page ?
> such as this page
>
> http://python.org/
You read part-way into the page, where you find this:
That tells you that the character set is UTF-8.
ChrisA
--
http://mail.python
how to detect the character encoding in a web page ?
such as this page
http://python.org/
--
http://mail.python.org/mailman/listinfo/python-list
Risposta al messaggio di Chris Angelico :
Your __str__ method is not returning a string. It's returning a
Unicode object. Under Python 2 (which you're obviously using, since
you use print as a statement), strings are bytes. The best thing to do
would be to move to Python 3.3, in which the defaul
On Sat, Dec 8, 2012 at 1:14 AM, gialloporpora wrote:
print a
> UnicodeError
print a.__str__()
> OK
By the way, it's *much* more helpful to copy and paste the actual
error message and output, rather than retyping like that. Spending one
extra minute in the interactive interpreter before
On 12/07/2012 11:17 AM, gialloporpora wrote:
Risposta al messaggio di gialloporpora :
This is the code in my test.py:
Sorry, I have wrongly pasted the code:
class msgmarker(object):
def __init__(self, msgid, msgstr, index, encoding="utf-8"):
self._encoding =encoding
self
On Sat, Dec 8, 2012 at 1:14 AM, gialloporpora wrote:
> Dear all,
> I have a problem with character encoding.
> I have created my class and I have redefined the __str__ method for pretty
> printing. I have saved my file as test.py,
> I give these lines:
>
>>>
Risposta al messaggio di gialloporpora :
This is the code in my test.py:
Sorry, I have wrongly pasted the code:
class msgmarker(object):
def __init__(self, msgid, msgstr, index, encoding="utf-8"):
self._encoding =encoding
self.set(msgid, msgstr)
C. Benson Manica, 21.04.2010 19:19:
I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":
#!/usr/bin/env python
#coding: utf-8
import xml.dom.minidom
import codecs
str=u""
doc=xml.dom.minidom.parseString(
C. Benson Manica wrote:
> On Apr 21, 2:25 pm, Peter Otten <__pete...@web.de> wrote:
>
>> Are you sure that your script has
>>
>> str = u"..."
>>
>> like in your post and not just
>>
>> str = "..."
>
> No :-)
>
> str=u" \"ó\"/>"
> doc=xml.dom.minidom.parseString( str.encode("utf-8") )
> xml=doc.
On Apr 21, 2:25 pm, Peter Otten <__pete...@web.de> wrote:
> Are you sure that your script has
>
> str = u"..."
>
> like in your post and not just
>
> str = "..."
No :-)
str=u""
doc=xml.dom.minidom.parseString( str.encode("utf-8") )
xml=doc.toxml( encoding="utf-8")
file=codecs.open( "foo.xml", "w
C. Benson Manica wrote:
> On Apr 21, 1:58 pm, Peter Otten <__pete...@web.de> wrote:
>> C. Benson Manica wrote:
>>> (snip)
>>
>> It seems that parseString() doesn't like unicode
>
> Yes, I noticed that, and I already tried...
>
>> -- let's try a byte string
>> then:
>>
>> >>> doc = xml.dom.minido
On Apr 21, 1:58 pm, Peter Otten <__pete...@web.de> wrote:
> C. Benson Manica wrote:
>> (snip)
>
> It seems that parseString() doesn't like unicode
Yes, I noticed that, and I already tried...
> -- let's try a byte string
> then:
>
> >>> doc = xml.dom.minidom.parseString(s.encode("utf-8"))
> >>> xm
C. Benson Manica wrote:
> I have the following simple script running on 2.5.2 on a machine where
> the default character encoding is "ascii":
>
> #!/usr/bin/env python
> #coding: utf-8
>
> import xml.dom.minidom
> import codecs
>
> str=u" \"ó\&
I have the following simple script running on 2.5.2 on a machine where
the default character encoding is "ascii":
#!/usr/bin/env python
#coding: utf-8
import xml.dom.minidom
import codecs
str=u""
doc=xml.dom.minidom.parseString( str )
xml=doc.toxml( encoding="utf-8"
Risposta al messaggio di gialloporpora :
Dear all,
I have a strange problem that I am not able to solve myself.
Ok, I have solved my problem, sorry for the post.
First I had no view this function:
sys.getfilesystemencoding()
that return the console encoding, sorry.
Sandro
*gialloporpora:
Dear all,
I have a strange problem that I am not able to solve myself.
I have written a little Python script to download image from last.fm,
now, if I call it from the python environment it works, if I call it
from Windows console it doesn't works
If I open the prompt and run python I call
Dave Angel writes:
> But I wanted to comment on the (c) remark. If you're in the US,
> that's the wrong abbreviation for copyright. The only recognized
> abbreviation is (copr).
More reading on this:
http://en.wikipedia.org/wiki/Universal_Copyright_Convention>
http://en.wikipedia.org/
Robert Dailey wrote:
Hello,
I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps t
On Thu, Aug 6, 2009 at 12:41 PM, Robert Dailey wrote:
> On Aug 6, 11:31 am, "Richard Brodie" wrote:
>> "Robert Dailey" wrote in message
>>
>> news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com...
>>
>> > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
>> >
On Aug 6, 2009, at 3:14 PM, Martin v. Löwis wrote:
As a side note, you should probably use something other than "file"
for
the parameter name in GetFileContentsAsString() since file() is a
Python
function.
Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42)
[GCC 4.3.3] on linux2
Type "help
> As a side note, you should probably use something other than "file" for
> the parameter name in GetFileContentsAsString() since file() is a Python
> function.
Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more inform
On Thu, 06 Aug 2009 09:14:08 -0700, Robert Dailey wrote:
> I'm loading a file via open() in Python 3.1 and I'm getting the
> following error when I try to print the contents of the file that I
> obtained through a call to read():
>
> UnicodeEncodeError: 'charmap' codec can't encode character '\xa
"Robert Dailey" wrote in message
news:f64f9830-c416-41b1-a510-c1e486271...@g19g2000vbi.googlegroups.com...
> As you can see, I am trying to load the file with encoding 'cp1252'
> which, according to the python 3.1 docs, translates to windows-1252. I
> also tried 'latin_1', which translates to I
On Aug 6, 2009, at 12:41 PM, Robert Dailey wrote:
On Aug 6, 11:31 am, "Richard Brodie" wrote:
"Robert Dailey" wrote in message
news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com
...
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: c
On Thu, 2009-08-06 at 09:14 -0700, Robert Dailey wrote:
> Hello,
>
> I'm loading a file via open() in Python 3.1 and I'm getting the
> following error when I try to print the contents of the file that I
> obtained through a call to read():
>
> UnicodeEncodeError: 'charmap' codec can't encode char
On Aug 6, 11:31 am, "Richard Brodie" wrote:
> "Robert Dailey" wrote in message
>
> news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com...
>
> > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
> > position 1650: character maps to
>
> > The file is defined a
"Robert Dailey" wrote in message
news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com...
> UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
> position 1650: character maps to
>
> The file is defined as ASCII.
That's the problem: ASCII is a seven bit code.
On Aug 6, 2009, at 12:14 PM, Robert Dailey wrote:
Hello,
I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
Hello,
I'm loading a file via open() in Python 3.1 and I'm getting the
following error when I try to print the contents of the file that I
obtained through a call to read():
UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in
position 1650: character maps to
The file is defined
shymon wrote:
>
>
>
> Diez B. Roggisch-2 wrote:
>>
>> shymon wrote:
>>
>>> I'm using SimpleXmlRpcServer class. Although I set encoding parameter in
>>> the constructor, I have to return all strings in default platform
>>> encoding
>>> (windows-1250/win32 or iso-8859-2/linux in my case). When
esult received by the client was the same as if I sent UTF-8 encoded
string.
--
View this message in context:
http://www.nabble.com/SimpleXmlRpcServer-and-character-encoding-tp19896427p19898136.html
Sent from the Python - python-list mailing list archive at Nabble.com.
--
http://mail.python.org/mailman/listinfo/python-list
shymon wrote:
>
>
> I'm using SimpleXmlRpcServer class. Although I set encoding parameter in
> the constructor, I have to return all strings in default platform encoding
> (windows-1250/win32 or iso-8859-2/linux in my case). When I send values
> in, for example, UTF-8, string received by client
lient is written in java using Apache XmlRpc library 2.0.
Is there any solution other than sending all string values in Base64
encoding?
--
View this message in context:
http://www.nabble.com/SimpleXmlRpcServer-and-character-encoding-tp19896427p19896427.html
Sent from the Python - python-list ma
mp wrote:
> I have html document titles with characters like >, , and
> ‡. How do I decode a string with these values in Python?
>
> Thanks
>
>
This is definitely the most FAQ. It comes up about once a week.
The stream-editing way is like this:
>>> import SE
>>> HTM_Decoder = SE.SE ('htm2is
Dennis Lee Bieber wrote:
> On 7 Nov 2006 11:34:32 -0800, "mp" <[EMAIL PROTECTED]> declaimed the
> following in comp.lang.python:
>
> > I have html document titles with characters like >, , and
> > ‡. How do I sddecode a string with these values in Python?
> >
>
> Wouldn't HTMLParser be suit
At Tuesday 7/11/2006 17:10, mp wrote:
I'd prefer a more generalized solution which takes care of all possible
ampersand characters. I assume that there is code already written which
does this.
Try the htmlentitydefs module
--
Gabriel Genellina
Softlab SRL
_
I'd prefer a more generalized solution which takes care of all possible
ampersand characters. I assume that there is code already written which
does this.
Thanks
i80and wrote:
> I would suggest using string.replace. Simply replace ' ' with ' '
> for each time it occurs. It doesn't take too much
I would suggest using string.replace. Simply replace ' ' with ' '
for each time it occurs. It doesn't take too much code.
On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote:
> I have html document titles with characters like >, , and
> ‡. How do I decode a string with these values in Python?
>
I have html document titles with characters like >, , and
‡. How do I decode a string with these values in Python?
Thanks
--
http://mail.python.org/mailman/listinfo/python-list
Michal wrote:
> Hello,
> is there any way how to detect string encoding in Python?
>
> I need to proccess several files. Each of them could be encoded in
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and
> encode it to utf-8 (with string function encode).
Well, about how to
Perhaps this project's code or ideas could be of service:
http://freshmeat.net/projects/enca/
Jeff
pgpYyDfS0xrTp.pgp
Description: PGP signature
--
http://mail.python.org/mailman/listinfo/python-list
Martin P. Hellwig wrote:
> I read or heard (can't remember the origin) that MS IE has a quite good
> implementation of guessing the language en character encoding of web
> pages when there not or falsely specified.
Yes, I think that's right. In my experience MS Word does
Thanks everybody for helpfull advices.
Michal
--
http://mail.python.org/mailman/listinfo/python-list
Diez B. Roggisch wrote:
> So cp1250 doesn't have all codepoints defined - but the others have.
> Sure, this helps you to eliminate 1 of the three choices the OP wanted
> to choose between - but how many texts you have that have a 129 in them?
For the iso8859 ones, you should assume that the char
Martin P. Hellwig wrote:
> From what I can remember is that they used an algorithm to create some
> statistics of the specific page and compared that with statistic about
> all kinds of languages and encodings and just mapped the most likely.
More hearsay: I believe language-based heuristics ar
[Diez B. Roggisch]
>Michal wrote:
>> is there any way how to detect string encoding in Python?
>Recode might be of help here, it has such heuristics built in AFAIK.
If we are speaking about the same Recode ☺, there are some built in
tools that could help a human to discover a charset, but this
Mike Meyer wrote:
> "Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
>
>>Michal wrote:
>>
>>>is there any way how to detect string encoding in Python?
>>>I need to proccess several files. Each of them could be encoded in
>>>different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>>and enc
Martin> I read or heard (can't remember the origin) that MS IE has a
Martin> quite good implementation of guessing the language en character
Martin> encoding of web pages when there not or falsely specified.
Gee, that's nice. Too bad the source isn't available... <0.5 wink>
Skip
--
Mike Meyer wrote:
> "Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
>> Michal wrote:
>>> is there any way how to detect string encoding in Python?
>>> I need to proccess several files. Each of them could be encoded in
>>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
>>> and enco
You may want to look at some Python Cookbook recipes, such as
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52257
"Auto-detect XML encoding" by Paul Prescod
--
http://mail.python.org/mailman/listinfo/python-list
Mentre io pensavo ad una intro simpatica "Michal" scriveva:
> Hello,
> is there any way how to detect string encoding in Python?
> I need to proccess several files. Each of them could be encoded in
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and
> encode it to utf-8 (with
"Diez B. Roggisch" <[EMAIL PROTECTED]> writes:
> Michal wrote:
>> is there any way how to detect string encoding in Python?
>> I need to proccess several files. Each of them could be encoded in
>> different charset (iso-8859-2, cp1250, etc). I want to detect it,
>> and encode it to utf-8 (with stri
Michal wrote:
> Hello,
> is there any way how to detect string encoding in Python?
>
> I need to proccess several files. Each of them could be encoded in
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and
> encode it to utf-8 (with string function encode).
You can only gues
Michal wrote:
> Hello,
> is there any way how to detect string encoding in Python?
>
> I need to proccess several files. Each of them could be encoded in
> different charset (iso-8859-2, cp1250, etc). I want to detect it, and
> encode it to utf-8 (with string function encode).
>
> Thank you for
Hello,
is there any way how to detect string encoding in Python?
I need to proccess several files. Each of them could be encoded in
different charset (iso-8859-2, cp1250, etc). I want to detect it, and
encode it to utf-8 (with string function encode).
Thank you for any answer
Regards
Michal
--
Max M wrote:
A smiple way to try out different encodings in a given order:
The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is
somewhat redundant. The 'ASCII' case is never considered, since
Latin-1 effectively works as a catch-all encoding (as all byte
sequences can be considered Latin-1
Christian Ergh wrote:
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.
Also, it might be possible to do this in a for loop, e.g.
for encoding in (pag
Forgot a part... You need the encoding list:
encodings = [
'utf-8',
'latin-1',
'ascii',
'cp1252',
]
Christian Ergh wrote:
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly e
Dylan wrote:
Here's what I'm trying to do:
- scrape some html content from various sources
The issue I'm running to:
- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word
Finally:
- snip -
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_encoded, encoding
except UnicodeError:
pass
-snip-
This works fine, but after this
Christian Ergh wrote:
A smiple way to try out different encodings in a given order:
# -*- coding: latin-1 -*-
def get_encoded(st, encodings):
"Returns an encoding that doesn't fail"
for encoding in encodings:
try:
st_encoded = st.decode(encoding)
return st_en
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Chris
import urllib2
url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding = 'whatever
Peter Otten wrote:
Steven Bethard wrote:
Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
A little OT, but (assuming I got your indentation right[1]) this kind of
loop i
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII charact
Steven Bethard wrote:
> Christian Ergh wrote:
>> flag = true
>> for char in data:
>> if 127 < ord(char) < 128:
>> flag = false
>> if flag:
>> try:
>> data = data.encode('latin-1')
>> except:
>> pass
>
> A little OT, but (assuming I got your indentation right[1]
Christian Ergh wrote:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a
Christian Ergh wrote:
- it works with the characters i mentioned
It does.
- what encoding do you have in the end
US-ASCII
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
I could, but I don't have the time - just try to
Martin v. Löwis wrote:
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII charact
Dylan wrote:
Things I have tried include encode()/decode()
This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then
htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
will give you a file that contains only ASCII characters, and
character refer
1 - 100 of 101 matches
Mail list logo