Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Dave Angel
On 01/07/2015 08:38 AM, Jacob Kruger wrote: Thanks. Makes more sense now, and yes, using 2.7 here. Unfortunately, while could pass the binary values into blob fields well enough, using forms of parameterised statements, the actual generation of sql script text files is a step they want to work

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Dave Angel
On 01/07/2015 08:32 AM, Jacob Kruger wrote: Thanks. Please don't top-post. Put your responses after each quoted part you're responding to. And if there are parts you're not responding to, please delete them. Issue with knowing encoding could just be that am pretty sure at least some of the

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger
- From: "Dave Angel" To: Sent: Wednesday, January 07, 2015 2:22 PM Subject: Re: String character encoding when converting data from one type/format to another On 01/07/2015 06:04 AM, Jacob Kruger wrote: I'm busy using something like pyodbc to pull data out of MS access .mdb f

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger
- Original Message - From: "Peter Otten" <__pete...@web.de> To: Sent: Wednesday, January 07, 2015 2:11 PM Subject: Re: String character encoding when converting data from one type/format to another Jacob Kruger wrote: I'm busy using something like pyodbc to pull d

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger
Wilco wants to welcome you...to the space janitor's closet..." - Original Message - From: "Ned Batchelder" To: Sent: Wednesday, January 07, 2015 2:02 PM Subject: Re: String character encoding when converting data from one type/format to another On 1/7/15 6:0

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Chris Angelico
On Wed, Jan 7, 2015 at 11:02 PM, Ned Batchelder wrote: >> Any thoughts on a sort of generic method/means to handle any/all >> characters that might be out of range when having pulled them out of >> something like these MS access databases? > > > The best thing is to know what encoding was used to

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Dave Angel
On 01/07/2015 06:04 AM, Jacob Kruger wrote: I'm busy using something like pyodbc to pull data out of MS access .mdb files, and then generate .sql script files to execute against MySQL databases using MySQLdb module, but, issue is forms of characters in string values that don't fit inside

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Peter Otten
Jacob Kruger wrote: > I'm busy using something like pyodbc to pull data out of MS access .mdb > files, and then generate .sql script files to execute against MySQL > databases using MySQLdb module, but, issue is forms of characters in > string values that don't fit inside the 0-127 range - current

Re: String character encoding when converting data from one type/format to another

2015-01-07 Thread Ned Batchelder
On 1/7/15 6:04 AM, Jacob Kruger wrote: I'm busy using something like pyodbc to pull data out of MS access .mdb files, and then generate .sql script files to execute against MySQL databases using MySQLdb module, but, issue is forms of characters in string values that don't fit inside the 0-127 ran

String character encoding when converting data from one type/format to another

2015-01-07 Thread Jacob Kruger
I'm busy using something like pyodbc to pull data out of MS access .mdb files, and then generate .sql script files to execute against MySQL databases using MySQLdb module, but, issue is forms of characters in string values that don't fit inside the 0-127 range - current one seems to be something

RE: how to detect the character encoding in a web page ?

2013-06-09 Thread Carlos Nepomuceno
tml_list[-1] if charset_from_html_list else '' return charset_from_html if charset_from_html else charset_from_header > Date: Sun, 9 Jun 2013 04:47:02 -0700 > Subject: Re: how to detect the character encoding in a web page ? > From: redstone-c...@163.com > To: python-l

Re: how to detect the character encoding in a web page ?

2013-06-09 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ here is one thread that can help me understanding my code http://stackoverflow.com/questions/17001407/how-to-detect-the-ch

Re: how to detect the character encoding in a web page ?

2013-06-09 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely even f

Re: how to detect the character encoding in a web page ?

2013-06-06 Thread Chris Angelico
On Thu, Jun 6, 2013 at 4:22 PM, Nobody wrote: > On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > >> The HTTP header is completely out of band. This is the best way to >> transmit encoding information. Otherwise, you assume 7-bit ASCII and start >> parsing. Once you find a meta tag, you

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread Nobody
On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > The HTTP header is completely out of band. This is the best way to > transmit encoding information. Otherwise, you assume 7-bit ASCII and start > parsing. Once you find a meta tag, you stop parsing and go back to the > top, decoding in th

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread Chris Angelico
On Thu, Jun 6, 2013 at 1:14 AM, iMath wrote: > 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: >> how to detect the character encoding in a web page ? >> >> such as this page >> >> >> >> http://python.org/ > > by the way ,we cannot get character enco

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encod

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ I found PyQt’s QtextStream can very accurately detect the character encoding in a web page . even for this bad page ht

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ I found PyQt’s QtextStream can very accurately detect the character encoding in a web page . even for this bad page chard

Re: how to detect the character encoding in a web page ?

2013-01-14 Thread Albert van der Horst
In article , Roy Smith wrote: >In article , > Alister wrote: > >> Indeed due to the poor quality of most websites it is not possible to be >> 100% accurate for all sites. >> >> personally I would start by checking the doc type & then the meta data as >> these should be quick & correct, I then us

Re: how to detect the character encoding in a web page ?

2013-01-07 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ up to now , maybe chadet is the only way to let python automatically do it . -- http://mail.python.org/mailman/listinfo/python-list

Re: how to detect the character encoding in a web page ?

2012-12-28 Thread python培训
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ first setup chardet import chardet #抓取网页html html_1 = urllib2.urlopen(line,timeout=120).read() #print html_1 mychar=

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Roy Smith
In article , Alister wrote: > Indeed due to the poor quality of most websites it is not possible to be > 100% accurate for all sites. > > personally I would start by checking the doc type & then the meta data as > these should be quick & correct, I then use chardectect only if these > fail t

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Alister
On Mon, 24 Dec 2012 13:50:39 +, Steven D'Aprano wrote: > On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > >> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller >> wrote: >>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >>> with confidence 0.803579722043 $ >> >> And it

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Steven D'Aprano
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller > wrote: >> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >> with confidence 0.803579722043 $ > > And it sucks, because it uses magic, and not reading the HTML tags. The > RI

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Kwpolska
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller wrote: > $ wget -q -O - http://python.org/ | chardetect.py > stdin: ISO-8859-2 with confidence 0.803579722043 > $ And it sucks, because it uses magic, and not reading the HTML tags. The RIGHT thing to do for websites is detect the meta charset definit

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Kurt Mueller
Am 24.12.2012 um 04:03 schrieb iMath: > but how to let python do it for you ? > such as these 2 pages > http://python.org/ > http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx > how to detect the character encoding in these 2 pages by python ? If you have the

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as these 2 pages http://python.org/ http://msdn.microsoft.com/en-us/library

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as these 2 pages http://python.org/ http://msdn.microsoft.com/en-us/library

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as this page http://python.org/ how to detect the character encoding in th

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread Hans Mulder
On 24/12/12 01:34:47, iMath wrote: > how to detect the character encoding in a web page ? That depends on the site: different sites indicate their encoding differently. > such as this page: http://python.org/ If you download that page and look at the HTML code, you'll find a li

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread Chris Angelico
On Mon, Dec 24, 2012 at 11:34 AM, iMath wrote: > how to detect the character encoding in a web page ? > such as this page > > http://python.org/ You read part-way into the page, where you find this: That tells you that the character set is UTF-8. ChrisA -- http://mail.python

how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
how to detect the character encoding in a web page ? such as this page http://python.org/ -- http://mail.python.org/mailman/listinfo/python-list

Re: Problem with __str__ method and character encoding

2012-12-07 Thread gialloporpora
Risposta al messaggio di Chris Angelico : Your __str__ method is not returning a string. It's returning a Unicode object. Under Python 2 (which you're obviously using, since you use print as a statement), strings are bytes. The best thing to do would be to move to Python 3.3, in which the defaul

Re: Problem with __str__ method and character encoding

2012-12-07 Thread Chris Angelico
On Sat, Dec 8, 2012 at 1:14 AM, gialloporpora wrote: print a > UnicodeError print a.__str__() > OK By the way, it's *much* more helpful to copy and paste the actual error message and output, rather than retyping like that. Spending one extra minute in the interactive interpreter before

Re: Problem with __str__ method and character encoding

2012-12-07 Thread peter
On 12/07/2012 11:17 AM, gialloporpora wrote: Risposta al messaggio di gialloporpora : This is the code in my test.py: Sorry, I have wrongly pasted the code: class msgmarker(object): def __init__(self, msgid, msgstr, index, encoding="utf-8"): self._encoding =encoding self

Re: Problem with __str__ method and character encoding

2012-12-07 Thread Chris Angelico
On Sat, Dec 8, 2012 at 1:14 AM, gialloporpora wrote: > Dear all, > I have a problem with character encoding. > I have created my class and I have redefined the __str__ method for pretty > printing. I have saved my file as test.py, > I give these lines: > >>>

Re: Problem with __str__ method and character encoding

2012-12-07 Thread gialloporpora
Risposta al messaggio di gialloporpora : This is the code in my test.py: Sorry, I have wrongly pasted the code: class msgmarker(object): def __init__(self, msgid, msgstr, index, encoding="utf-8"): self._encoding =encoding self.set(msgid, msgstr)

Re: xml.dom.minidom character encoding

2010-04-21 Thread Stefan Behnel
C. Benson Manica, 21.04.2010 19:19: I have the following simple script running on 2.5.2 on a machine where the default character encoding is "ascii": #!/usr/bin/env python #coding: utf-8 import xml.dom.minidom import codecs str=u"" doc=xml.dom.minidom.parseString(

Re: xml.dom.minidom character encoding

2010-04-21 Thread Peter Otten
C. Benson Manica wrote: > On Apr 21, 2:25 pm, Peter Otten <__pete...@web.de> wrote: > >> Are you sure that your script has >> >> str = u"..." >> >> like in your post and not just >> >> str = "..." > > No :-) > > str=u" \"ó\"/>" > doc=xml.dom.minidom.parseString( str.encode("utf-8") ) > xml=doc.

Re: xml.dom.minidom character encoding

2010-04-21 Thread C. Benson Manica
On Apr 21, 2:25 pm, Peter Otten <__pete...@web.de> wrote: > Are you sure that your script has > > str = u"..." > > like in your post and not just > > str = "..." No :-) str=u"" doc=xml.dom.minidom.parseString( str.encode("utf-8") ) xml=doc.toxml( encoding="utf-8") file=codecs.open( "foo.xml", "w

Re: xml.dom.minidom character encoding

2010-04-21 Thread Peter Otten
C. Benson Manica wrote: > On Apr 21, 1:58 pm, Peter Otten <__pete...@web.de> wrote: >> C. Benson Manica wrote: >>> (snip) >> >> It seems that parseString() doesn't like unicode > > Yes, I noticed that, and I already tried... > >> -- let's try a byte string >> then: >> >> >>> doc = xml.dom.minido

Re: xml.dom.minidom character encoding

2010-04-21 Thread C. Benson Manica
On Apr 21, 1:58 pm, Peter Otten <__pete...@web.de> wrote: > C. Benson Manica wrote: >> (snip) > > It seems that parseString() doesn't like unicode Yes, I noticed that, and I already tried... > -- let's try a byte string > then: > > >>> doc = xml.dom.minidom.parseString(s.encode("utf-8")) > >>> xm

Re: xml.dom.minidom character encoding

2010-04-21 Thread Peter Otten
C. Benson Manica wrote: > I have the following simple script running on 2.5.2 on a machine where > the default character encoding is "ascii": > > #!/usr/bin/env python > #coding: utf-8 > > import xml.dom.minidom > import codecs > > str=u" \"ó\&

xml.dom.minidom character encoding

2010-04-21 Thread C. Benson Manica
I have the following simple script running on 2.5.2 on a machine where the default character encoding is "ascii": #!/usr/bin/env python #coding: utf-8 import xml.dom.minidom import codecs str=u"" doc=xml.dom.minidom.parseString( str ) xml=doc.toxml( encoding="utf-8"

Re: Problem with character encoding in commandline

2009-10-15 Thread gialloporpora
Risposta al messaggio di gialloporpora : Dear all, I have a strange problem that I am not able to solve myself. Ok, I have solved my problem, sorry for the post. First I had no view this function: sys.getfilesystemencoding() that return the console encoding, sorry. Sandro *gialloporpora:

Problem with character encoding in commandline

2009-10-15 Thread gialloporpora
Dear all, I have a strange problem that I am not able to solve myself. I have written a little Python script to download image from last.fm, now, if I call it from the python environment it works, if I call it from Windows console it doesn't works If I open the prompt and run python I call

Re: Character encoding & the copyright symbol

2009-08-13 Thread Ben Finney
Dave Angel writes: > But I wanted to comment on the (c) remark. If you're in the US, > that's the wrong abbreviation for copyright. The only recognized > abbreviation is (copr). More reading on this: http://en.wikipedia.org/wiki/Universal_Copyright_Convention> http://en.wikipedia.org/

Re: Character encoding & the copyright symbol

2009-08-06 Thread Dave Angel
Robert Dailey wrote: Hello, I'm loading a file via open() in Python 3.1 and I'm getting the following error when I try to print the contents of the file that I obtained through a call to read(): UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 1650: character maps t

Re: Character encoding & the copyright symbol

2009-08-06 Thread Benjamin Kaplan
On Thu, Aug 6, 2009 at 12:41 PM, Robert Dailey wrote: > On Aug 6, 11:31 am, "Richard Brodie" wrote: >> "Robert Dailey" wrote in message >> >> news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com... >> >> > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in >> >

Re: Character encoding & the copyright symbol

2009-08-06 Thread Philip Semanchuk
On Aug 6, 2009, at 3:14 PM, Martin v. Löwis wrote: As a side note, you should probably use something other than "file" for the parameter name in GetFileContentsAsString() since file() is a Python function. Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42) [GCC 4.3.3] on linux2 Type "help

Re: Character encoding & the copyright symbol

2009-08-06 Thread Martin v. Löwis
> As a side note, you should probably use something other than "file" for > the parameter name in GetFileContentsAsString() since file() is a Python > function. Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more inform

Re: Character encoding & the copyright symbol

2009-08-06 Thread Nobody
On Thu, 06 Aug 2009 09:14:08 -0700, Robert Dailey wrote: > I'm loading a file via open() in Python 3.1 and I'm getting the > following error when I try to print the contents of the file that I > obtained through a call to read(): > > UnicodeEncodeError: 'charmap' codec can't encode character '\xa

Re: Character encoding & the copyright symbol

2009-08-06 Thread Richard Brodie
"Robert Dailey" wrote in message news:f64f9830-c416-41b1-a510-c1e486271...@g19g2000vbi.googlegroups.com... > As you can see, I am trying to load the file with encoding 'cp1252' > which, according to the python 3.1 docs, translates to windows-1252. I > also tried 'latin_1', which translates to I

Re: Character encoding & the copyright symbol

2009-08-06 Thread Philip Semanchuk
On Aug 6, 2009, at 12:41 PM, Robert Dailey wrote: On Aug 6, 11:31 am, "Richard Brodie" wrote: "Robert Dailey" wrote in message news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com ... UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 1650: c

Re: Character encoding & the copyright symbol

2009-08-06 Thread Albert Hopkins
On Thu, 2009-08-06 at 09:14 -0700, Robert Dailey wrote: > Hello, > > I'm loading a file via open() in Python 3.1 and I'm getting the > following error when I try to print the contents of the file that I > obtained through a call to read(): > > UnicodeEncodeError: 'charmap' codec can't encode char

Re: Character encoding & the copyright symbol

2009-08-06 Thread Robert Dailey
On Aug 6, 11:31 am, "Richard Brodie" wrote: > "Robert Dailey" wrote in message > > news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com... > > > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in > > position 1650: character maps to > > > The file is defined a

Re: Character encoding & the copyright symbol

2009-08-06 Thread Richard Brodie
"Robert Dailey" wrote in message news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com... > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in > position 1650: character maps to > > The file is defined as ASCII. That's the problem: ASCII is a seven bit code.

Re: Character encoding & the copyright symbol

2009-08-06 Thread Philip Semanchuk
On Aug 6, 2009, at 12:14 PM, Robert Dailey wrote: Hello, I'm loading a file via open() in Python 3.1 and I'm getting the following error when I try to print the contents of the file that I obtained through a call to read(): UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in

Character encoding & the copyright symbol

2009-08-06 Thread Robert Dailey
Hello, I'm loading a file via open() in Python 3.1 and I'm getting the following error when I try to print the contents of the file that I obtained through a call to read(): UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 1650: character maps to The file is defined

Re: SimpleXmlRpcServer and character encoding

2008-10-09 Thread Diez B. Roggisch
shymon wrote: > > > > Diez B. Roggisch-2 wrote: >> >> shymon wrote: >> >>> I'm using SimpleXmlRpcServer class. Although I set encoding parameter in >>> the constructor, I have to return all strings in default platform >>> encoding >>> (windows-1250/win32 or iso-8859-2/linux in my case). When

Re: SimpleXmlRpcServer and character encoding

2008-10-09 Thread shymon
esult received by the client was the same as if I sent UTF-8 encoded string. -- View this message in context: http://www.nabble.com/SimpleXmlRpcServer-and-character-encoding-tp19896427p19898136.html Sent from the Python - python-list mailing list archive at Nabble.com. -- http://mail.python.org/mailman/listinfo/python-list

Re: SimpleXmlRpcServer and character encoding

2008-10-09 Thread Diez B. Roggisch
shymon wrote: > > > I'm using SimpleXmlRpcServer class. Although I set encoding parameter in > the constructor, I have to return all strings in default platform encoding > (windows-1250/win32 or iso-8859-2/linux in my case). When I send values > in, for example, UTF-8, string received by client

SimpleXmlRpcServer and character encoding

2008-10-09 Thread shymon
lient is written in java using Apache XmlRpc library 2.0. Is there any solution other than sending all string values in Base64 encoding? -- View this message in context: http://www.nabble.com/SimpleXmlRpcServer-and-character-encoding-tp19896427p19896427.html Sent from the Python - python-list ma

Re: Character encoding

2006-11-08 Thread Frederic Rentsch
mp wrote: > I have html document titles with characters like >,  , and > ‡. How do I decode a string with these values in Python? > > Thanks > > This is definitely the most FAQ. It comes up about once a week. The stream-editing way is like this: >>> import SE >>> HTM_Decoder = SE.SE ('htm2is

Re: Character encoding

2006-11-08 Thread [EMAIL PROTECTED]
Dennis Lee Bieber wrote: > On 7 Nov 2006 11:34:32 -0800, "mp" <[EMAIL PROTECTED]> declaimed the > following in comp.lang.python: > > > I have html document titles with characters like >,  , and > > ‡. How do I sddecode a string with these values in Python? > > > > Wouldn't HTMLParser be suit

Re: Character encoding

2006-11-07 Thread Gabriel Genellina
At Tuesday 7/11/2006 17:10, mp wrote: I'd prefer a more generalized solution which takes care of all possible ampersand characters. I assume that there is code already written which does this. Try the htmlentitydefs module -- Gabriel Genellina Softlab SRL _

Re: Character encoding

2006-11-07 Thread mp
I'd prefer a more generalized solution which takes care of all possible ampersand characters. I assume that there is code already written which does this. Thanks i80and wrote: > I would suggest using string.replace. Simply replace ' ' with ' ' > for each time it occurs. It doesn't take too much

Re: Character encoding

2006-11-07 Thread i80and
I would suggest using string.replace. Simply replace ' ' with ' ' for each time it occurs. It doesn't take too much code. On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote: > I have html document titles with characters like >,  , and > ‡. How do I decode a string with these values in Python? >

Character encoding

2006-11-07 Thread mp
I have html document titles with characters like >,  , and ‡. How do I decode a string with these values in Python? Thanks -- http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

2005-12-05 Thread The new guy
Michal wrote: > Hello, > is there any way how to detect string encoding in Python? > > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with string function encode). Well, about how to

Re: Detect character encoding

2005-12-05 Thread jepler
Perhaps this project's code or ideas could be of service: http://freshmeat.net/projects/enca/ Jeff pgpYyDfS0xrTp.pgp Description: PGP signature -- http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

2005-12-05 Thread Kent Johnson
Martin P. Hellwig wrote: > I read or heard (can't remember the origin) that MS IE has a quite good > implementation of guessing the language en character encoding of web > pages when there not or falsely specified. Yes, I think that's right. In my experience MS Word does

Re: Detect character encoding

2005-12-05 Thread Michal
Thanks everybody for helpfull advices. Michal -- http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

2005-12-04 Thread Martin v. Löwis
Diez B. Roggisch wrote: > So cp1250 doesn't have all codepoints defined - but the others have. > Sure, this helps you to eliminate 1 of the three choices the OP wanted > to choose between - but how many texts you have that have a 129 in them? For the iso8859 ones, you should assume that the char

Re: Detect character encoding

2005-12-04 Thread Martin v. Löwis
Martin P. Hellwig wrote: > From what I can remember is that they used an algorithm to create some > statistics of the specific page and compared that with statistic about > all kinds of languages and encodings and just mapped the most likely. More hearsay: I believe language-based heuristics ar

Re: Detect character encoding

2005-12-04 Thread François Pinard
[Diez B. Roggisch] >Michal wrote: >> is there any way how to detect string encoding in Python? >Recode might be of help here, it has such heuristics built in AFAIK. If we are speaking about the same Recode ☺, there are some built in tools that could help a human to discover a charset, but this

Re: Detect character encoding

2005-12-04 Thread Diez B. Roggisch
Mike Meyer wrote: > "Diez B. Roggisch" <[EMAIL PROTECTED]> writes: > >>Michal wrote: >> >>>is there any way how to detect string encoding in Python? >>>I need to proccess several files. Each of them could be encoded in >>>different charset (iso-8859-2, cp1250, etc). I want to detect it, >>>and enc

Re: Detect character encoding

2005-12-04 Thread skip
Martin> I read or heard (can't remember the origin) that MS IE has a Martin> quite good implementation of guessing the language en character Martin> encoding of web pages when there not or falsely specified. Gee, that's nice. Too bad the source isn't available... <0.5 wink> Skip --

Re: Detect character encoding

2005-12-04 Thread Martin P. Hellwig
Mike Meyer wrote: > "Diez B. Roggisch" <[EMAIL PROTECTED]> writes: >> Michal wrote: >>> is there any way how to detect string encoding in Python? >>> I need to proccess several files. Each of them could be encoded in >>> different charset (iso-8859-2, cp1250, etc). I want to detect it, >>> and enco

Re: Detect character encoding

2005-12-04 Thread B Mahoney
You may want to look at some Python Cookbook recipes, such as http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52257 "Auto-detect XML encoding" by Paul Prescod -- http://mail.python.org/mailman/listinfo/python-list

Re: Detect character encoding

2005-12-04 Thread Nemesis
Mentre io pensavo ad una intro simpatica "Michal" scriveva: > Hello, > is there any way how to detect string encoding in Python? > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with

Re: Detect character encoding

2005-12-04 Thread Mike Meyer
"Diez B. Roggisch" <[EMAIL PROTECTED]> writes: > Michal wrote: >> is there any way how to detect string encoding in Python? >> I need to proccess several files. Each of them could be encoded in >> different charset (iso-8859-2, cp1250, etc). I want to detect it, >> and encode it to utf-8 (with stri

Re: Detect character encoding

2005-12-04 Thread Diez B. Roggisch
Michal wrote: > Hello, > is there any way how to detect string encoding in Python? > > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with string function encode). You can only gues

Re: Detect character encoding

2005-12-04 Thread Scott David Daniels
Michal wrote: > Hello, > is there any way how to detect string encoding in Python? > > I need to proccess several files. Each of them could be encoded in > different charset (iso-8859-2, cp1250, etc). I want to detect it, and > encode it to utf-8 (with string function encode). > > Thank you for

Detect character encoding

2005-12-04 Thread Michal
Hello, is there any way how to detect string encoding in Python? I need to proccess several files. Each of them could be encoded in different charset (iso-8859-2, cp1250, etc). I want to detect it, and encode it to utf-8 (with string function encode). Thank you for any answer Regards Michal --

Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Max M wrote: A smiple way to try out different encodings in a given order: The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is somewhat redundant. The 'ASCII' case is never considered, since Latin-1 effectively works as a catch-all encoding (as all byte sequences can be considered Latin-1

Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Christian Ergh wrote: Once more, indention should be correct now, and the 128 is gone too. So, something like this? Yes, something like this. The tricky part is of, course, then the fragments which you didn't implement. Also, it might be possible to do this in a for loop, e.g. for encoding in (pag

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Forgot a part... You need the encoding list: encodings = [ 'utf-8', 'latin-1', 'ascii', 'cp1252', ] Christian Ergh wrote: Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly e

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word Finally:

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
- snip - def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_encoded, encoding except UnicodeError: pass -snip- This works fine, but after this

Re: character encoding conversion

2004-12-13 Thread Max M
Christian Ergh wrote: A smiple way to try out different encodings in a given order: # -*- coding: latin-1 -*- def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_en

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Once more, indention should be correct now, and the 128 is gone too. So, something like this? Chris import urllib2 url = 'www.someurl.com' f = urllib2.urlopen(url) data = f.read() # if it is not in the pagecode, how do i get the encoding of the page? pageencoding = '???' xmlencoding = 'whatever

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Peter Otten wrote: Steven Bethard wrote: Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) this kind of loop i

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-13 Thread Peter Otten
Steven Bethard wrote: > Christian Ergh wrote: >> flag = true >> for char in data: >> if 127 < ord(char) < 128: >> flag = false >> if flag: >> try: >> data = data.encode('latin-1') >> except: >> pass > > A little OT, but (assuming I got your indentation right[1]

Re: character encoding conversion

2004-12-13 Thread Steven Bethard
Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) this kind of loop is exactly what the else clause of a

Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Christian Ergh wrote: - it works with the characters i mentioned It does. - what encoding do you have in the end US-ASCII - and how exactly are you doing all this? All with somestring.decode() or... Can you please give an example for these 7 steps? I could, but I don't have the time - just try to

Re: character encoding conversion

2004-12-12 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII characters, and character refer

  1   2   >