I'm sorry, I have misinterpreted your question. On Mon, Sep 26, 2016 at 12:59:04PM -0400, bruce wrote:
> I've got a page from a web fetch. I'm simply trying to go from utf-8 to > ascii. Why would you do that? It's 2016, not 1953, and ASCII is well and truly obsolete. (ASCII was even obsolete in 1953, even then there were characters in common use in American English that couldn't be written in ASCII, like ¢.) Any modern program should be dealing with UTF-8. Nevertheless, assuming you have a good reason, you are dealing with data scraped from a webpage, so it is likely to include HTML escape codes, as you have already learned. So you need to go from something like this: – *first* to the actual EN DASH character, and then to the - hyphen. And remember that HTML supports all(?) of Unicode via character escapes, so you shouldn't assume that this is the only Unicode character. Assuming you scrape the data from the webpage as a byte string, you'll have something like this: data = "hello world – goodbye" # byte-string read from HTML page from HTMLParser import HTMLParser parser = HTMLParser() text = parser.unescape(data) print text which should display: hello world – goodbye including the en-dash. So now you have a Unicode string, which you can manipulate any way you like: text = text.replace(u'–', u'--') # remember to use Unicode strings here See also the text.translate() method if you have to do lots of changes in one go. Lastly you can convert to an ASCII byte-string using the encode method. By default, this will raise an exception if there are any non-ASCII characters in your text string: data = text.encode('ascii') You can also skip non-ASCII characters, replace them with question marks, or replace them with an escape code: data = text.encode('ascii', 'ignore') data = text.encode('ascii', 'replace') data = text.encode('ascii', 'xmlcharrefreplace') which will finally give you something suitable for use in programs written in the 1970s :-) -- Steve _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor