Re: Character encoding & the copyright symbol

2009-08-13 Thread Ben Finney
Dave Angel writes: > But I wanted to comment on the (c) remark. If you're in the US, > that's the wrong abbreviation for copyright. The only recognized > abbreviation is (copr). More reading on this: http://en.wikipedia.org/wiki/Universal_Copyright_Convention> http://en.wikipedia.org/

Re: Character encoding & the copyright symbol

2009-08-06 Thread Dave Angel
Robert Dailey wrote: Hello, I'm loading a file via open() in Python 3.1 and I'm getting the following error when I try to print the contents of the file that I obtained through a call to read(): UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 1650: character maps t

Re: Character encoding & the copyright symbol

2009-08-06 Thread Benjamin Kaplan
On Thu, Aug 6, 2009 at 12:41 PM, Robert Dailey wrote: > On Aug 6, 11:31 am, "Richard Brodie" wrote: >> "Robert Dailey" wrote in message >> >> news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com... >> >> > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in >> >

Re: Character encoding & the copyright symbol

2009-08-06 Thread Philip Semanchuk
On Aug 6, 2009, at 3:14 PM, Martin v. Löwis wrote: As a side note, you should probably use something other than "file" for the parameter name in GetFileContentsAsString() since file() is a Python function. Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42) [GCC 4.3.3] on linux2 Type "help

Re: Character encoding & the copyright symbol

2009-08-06 Thread Martin v. Löwis
> As a side note, you should probably use something other than "file" for > the parameter name in GetFileContentsAsString() since file() is a Python > function. Python 3.1.1a0 (py3k:74094, Jul 19 2009, 13:39:42) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more inform

Re: Character encoding & the copyright symbol

2009-08-06 Thread Nobody
On Thu, 06 Aug 2009 09:14:08 -0700, Robert Dailey wrote: > I'm loading a file via open() in Python 3.1 and I'm getting the > following error when I try to print the contents of the file that I > obtained through a call to read(): > > UnicodeEncodeError: 'charmap' codec can't encode character '\xa

Re: Character encoding & the copyright symbol

2009-08-06 Thread Richard Brodie
"Robert Dailey" wrote in message news:f64f9830-c416-41b1-a510-c1e486271...@g19g2000vbi.googlegroups.com... > As you can see, I am trying to load the file with encoding 'cp1252' > which, according to the python 3.1 docs, translates to windows-1252. I > also tried 'latin_1', which translates to I

Re: Character encoding & the copyright symbol

2009-08-06 Thread Philip Semanchuk
On Aug 6, 2009, at 12:41 PM, Robert Dailey wrote: On Aug 6, 11:31 am, "Richard Brodie" wrote: "Robert Dailey" wrote in message news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com ... UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in position 1650: c

Re: Character encoding & the copyright symbol

2009-08-06 Thread Albert Hopkins
On Thu, 2009-08-06 at 09:14 -0700, Robert Dailey wrote: > Hello, > > I'm loading a file via open() in Python 3.1 and I'm getting the > following error when I try to print the contents of the file that I > obtained through a call to read(): > > UnicodeEncodeError: 'charmap' codec can't encode char

Re: Character encoding & the copyright symbol

2009-08-06 Thread Robert Dailey
On Aug 6, 11:31 am, "Richard Brodie" wrote: > "Robert Dailey" wrote in message > > news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com... > > > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in > > position 1650: character maps to > > > The file is defined a

Re: Character encoding & the copyright symbol

2009-08-06 Thread Richard Brodie
"Robert Dailey" wrote in message news:29ab0981-b95d-4435-91bd-a7a520419...@b15g2000yqd.googlegroups.com... > UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in > position 1650: character maps to > > The file is defined as ASCII. That's the problem: ASCII is a seven bit code.

Re: Character encoding & the copyright symbol

2009-08-06 Thread Philip Semanchuk
On Aug 6, 2009, at 12:14 PM, Robert Dailey wrote: Hello, I'm loading a file via open() in Python 3.1 and I'm getting the following error when I try to print the contents of the file that I obtained through a call to read(): UnicodeEncodeError: 'charmap' codec can't encode character '\xa9' in

Re: Character encoding

2006-11-08 Thread Frederic Rentsch
mp wrote: > I have html document titles with characters like >,  , and > ‡. How do I decode a string with these values in Python? > > Thanks > > This is definitely the most FAQ. It comes up about once a week. The stream-editing way is like this: >>> import SE >>> HTM_Decoder = SE.SE ('htm2is

Re: Character encoding

2006-11-08 Thread [EMAIL PROTECTED]
Dennis Lee Bieber wrote: > On 7 Nov 2006 11:34:32 -0800, "mp" <[EMAIL PROTECTED]> declaimed the > following in comp.lang.python: > > > I have html document titles with characters like >,  , and > > ‡. How do I sddecode a string with these values in Python? > > > > Wouldn't HTMLParser be suit

Re: Character encoding

2006-11-07 Thread Gabriel Genellina
At Tuesday 7/11/2006 17:10, mp wrote: I'd prefer a more generalized solution which takes care of all possible ampersand characters. I assume that there is code already written which does this. Try the htmlentitydefs module -- Gabriel Genellina Softlab SRL _

Re: Character encoding

2006-11-07 Thread mp
I'd prefer a more generalized solution which takes care of all possible ampersand characters. I assume that there is code already written which does this. Thanks i80and wrote: > I would suggest using string.replace. Simply replace ' ' with ' ' > for each time it occurs. It doesn't take too much

Re: Character encoding

2006-11-07 Thread i80and
I would suggest using string.replace. Simply replace ' ' with ' ' for each time it occurs. It doesn't take too much code. On Nov 7, 1:34 pm, "mp" <[EMAIL PROTECTED]> wrote: > I have html document titles with characters like >,  , and > ‡. How do I decode a string with these values in Python? >

Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Max M wrote: A smiple way to try out different encodings in a given order: The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is somewhat redundant. The 'ASCII' case is never considered, since Latin-1 effectively works as a catch-all encoding (as all byte sequences can be considered Latin-1

Re: character encoding conversion

2004-12-13 Thread "Martin v. Löwis"
Christian Ergh wrote: Once more, indention should be correct now, and the 128 is gone too. So, something like this? Yes, something like this. The tricky part is of, course, then the fragments which you didn't implement. Also, it might be possible to do this in a for loop, e.g. for encoding in (pag

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Forgot a part... You need the encoding list: encodings = [ 'utf-8', 'latin-1', 'ascii', 'cp1252', ] Christian Ergh wrote: Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly e

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Dylan wrote: Here's what I'm trying to do: - scrape some html content from various sources The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word Finally:

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
- snip - def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_encoded, encoding except UnicodeError: pass -snip- This works fine, but after this

Re: character encoding conversion

2004-12-13 Thread Max M
Christian Ergh wrote: A smiple way to try out different encodings in a given order: # -*- coding: latin-1 -*- def get_encoded(st, encodings): "Returns an encoding that doesn't fail" for encoding in encodings: try: st_encoded = st.decode(encoding) return st_en

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Once more, indention should be correct now, and the 128 is gone too. So, something like this? Chris import urllib2 url = 'www.someurl.com' f = urllib2.urlopen(url) data = f.read() # if it is not in the pagecode, how do i get the encoding of the page? pageencoding = '???' xmlencoding = 'whatever

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Peter Otten wrote: Steven Bethard wrote: Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) this kind of loop i

Re: character encoding conversion

2004-12-13 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-13 Thread Peter Otten
Steven Bethard wrote: > Christian Ergh wrote: >> flag = true >> for char in data: >> if 127 < ord(char) < 128: >> flag = false >> if flag: >> try: >> data = data.encode('latin-1') >> except: >> pass > > A little OT, but (assuming I got your indentation right[1]

Re: character encoding conversion

2004-12-13 Thread Steven Bethard
Christian Ergh wrote: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass A little OT, but (assuming I got your indentation right[1]) this kind of loop is exactly what the else clause of a

Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Christian Ergh wrote: - it works with the characters i mentioned It does. - what encoding do you have in the end US-ASCII - and how exactly are you doing all this? All with somestring.decode() or... Can you please give an example for these 7 steps? I could, but I don't have the time - just try to

Re: character encoding conversion

2004-12-12 Thread Christian Ergh
Martin v. Löwis wrote: Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII charact

Re: character encoding conversion

2004-12-12 Thread "Martin v. Löwis"
Dylan wrote: Things I have tried include encode()/decode() This should work. If you somehow manage to guess the encoding, e.g. guess it as cp1252, then htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace") will give you a file that contains only ASCII characters, and character refer