Thx Martin for your comments. indeed the charset of the web document is set in the meta tag, it's iso-8859-1 so i'll decode it to unicode using something like:
html = html.decode('iso-8859-1') html then contains the unicode version of the html document As i've finally managed to make this work i'll post here my comments on the few things i still don't understand, maybe you can explain why it works that way with more technical terms than i can provide myself.. So the whole thing is to regex parse some html document, and store the results inside an xml file that can be parsed again by python minidom for further use.. ############### CODE START ############### import urllib, string, codecs, types import sys, traceback, os.path, re, shutil import cachedhttp from xml.dom.minidom import parse, parseString NODE_ELEMENT=1 NODE_ATTRIBUTE=2 NODE_TEXT=3 NODE_CDATA_SECTION=4 httpFetcher=cachedhttp.CachedHTTP() # Fetch Menu Links Page, httpFetcher is from the cachedhttp lib developped by someone for another script, it returns a bytestring from the local cached file, once downloaded of the internet, using a simple f = open(file,'r') & f.read() data = httpFetcher.urlopen('http://www.canalplus.fr/pid6.htm') data = data.decode('iso-8859-1') # at that point i have my html document in unicode # utf8bin.xml is an utf-8 encoded xml file, "bin" is because of the way i have to use to save it back to file, see at bottom dom = parse('utf8bin.xml') # find the data we need from the html document # title contains the text and so some special chars x = re.compile('<li[^>]*>[^<]*<a href="http://www.canalplus.fr/(?P<url>[^"]+)"[^>]*>(?:<b>)?(?P<title>[^<]+)(?:</b>)?</a>[^<]*</li>', re.DOTALL|re.IGNORECASE|re.UNICODE) for match in x.finditer(data): urlid = match.group('url') url = match.expand('http://www.canalplus.fr/\g<url>') title = match.expand('\g<title>') # everything here is still unicode objects match = None nodes = dom.getElementsByTagName('page') for node in nodes: if GetNodeValue(node,'title') == title: print 'Found Match: ' + title + ' == ' + GetNodeValue(node,'title') match = node break if match is None: # create page node and set attributes newnode = dom.createElement('page') att = dom.createAttribute('id') newnode.setAttributeNode(att) newnode.setAttribute('id',urlid) # create title childnode and set CDATA section vnode = dom.createElement('title') newnode.appendChild(vnode) dnode = dom.createCDATASection(title) vnode.appendChild(dnode) # create value childnode and set CDATA section vnode = dom.createElement('value') newnode.appendChild(vnode) dnode = dom.createCDATASection(url) vnode.appendChild(dnode) root = dom.documentElement root.appendChild(newnode) f = open('utf8bin.xml', 'wb') f.write(dom.toxml(encoding="utf-8")) f.close() # just to make sure we can still parse our xml file print '\nParsing utf8bin.xml and Printing titles' dom = parse('utf8bin.xml') nodes = dom.getElementsByTagName('page') for node in nodes: print GetNodeValue(node,'title') # Some xml helper functions # GetNodeText returns a unicode object def GetNodeText(node): dout='' for tnode in node.childNodes: if (tnode.nodeType==NODE_TEXT)|(tnode.nodeType==NODE_CDATA_SECTION): dout=dout+tnode.nodeValue return dout # GetNodeValue returns a unicode object or None def GetNodeValue(node,tag=None): if tag is None: return GetNodeText(node) nattr=node.attributes.getNamedItem(tag) if not (nattr is None): return nattr.value for child in node.childNodes: if child.nodeName == tag: return GetNodeText(child) return None ############### CODE END ############### Now the comments : so what i understood of all this, is that once you're using unicode objects you're safe ! At least as long as you don't use statements or operators that will implicitely try to convert the unicode object back to bytestring using your default encoding (ascii) which will most certainly result in codec Errors... Also, minidom seems to use unicode object what was not really documented in the python 2.3 doc i've read about it.. so passing the unicode object from my regex matches to minidom elements will make minidom behave nicely.. If you start to pass encoded bytestrings to minidom elements it may fail when you call "toxml()".. I know i managed to do that once or twice i don't remember exactly what kind of bytestrings i passed to the minidom element but one thing's for sure it made "toxml()" fail whatever encoding you specify.. So if you stick to unicode, it will then encode all that unicode content to whatever encoding you've specified when calling "dom.toxml(encoding="utf-8")" then you just have to store the output of that as it is without any further encoding As a matter of fact using the following sequence will most certainly fail : f = codecs.open('utf8codecs.xml', 'w', 'utf-8') f.write(dom.toxml(encoding="utf-8")) f.close() then again maybe this will work, i just thought of it.. f = codecs.open('utf8codecs.xml', 'w', 'utf-8') f.write(dom.toxml()) f.close() I didn't understand at first that once you're using unicode object and as long as you've properly decoded your bytestring source, then unicode is unicode and you can forget about encodings "ascii", "iso-", "utf-".. The next important thing is to make sure to use functions and objects that support unicode all the way, like minidom seems to do.. my original script has another function "FindDataNode" that will do a more sofisticated loop, into the dom object you provide, in order to check if there's already a node with the same title, and i use there some .lower() methods and a another "Sanitize" function that replaces a few chars.. So i guess i'll have to make sure that none of those manipulations converts my unicode obect back to bytestrings.. Thx for reading, let me know if you see really really weird (bad?) things in my code, or if you have further comments to add on the unicode topic.. Marc Martin v. Löwis wrote: > webdev wrote: > >>1. when fetching a web page from the net, how am i supposed to know how >>it's encoded.. And can i decode it to unicode and encode it back to a >>byte string so i can use it in my code, with the charsets i want, like >>utf-8.. ? > > > It depends on the content type. If the HTTP header declares a charset= > attribute for content-type, then use that (beware: some web servers > report the content type incorrectly. To deal with that gracefully, > you have to implement very complex algorithms, which are part of > any recent web browser). > > If there is no charset= attribute, then > - if the content type is text/html, look at a meta http-equiv tag > in the content. If that declares a charset, use that. > - if the content type is xml (plain, or xhtml+xml), look at the > XML declaration. Alternatively, pass it to your XML parser. > > >>2. in the same idea could anyone try to post the few lines that would >>actually parse an xml file, with non ascii chars, with minidom >>(parseString i guess). > > > doc = xml.dom.minidom.parse("foo.xml") > > >>Then convert a string grabbed from the net so parts of it can be >>inserted in that dom object into new nodes or existing nodes. > > > doc..documentElement.setAttribute("bar", text_from_net.decode("koi-8r")) > > >>And finally write that dom object back to a file in a way it can be used >>again later with the same script.. > > > open("/tmp/foo.txt","w").write(doc.toxml()) > > >>I've been trying to do that for a few days with no luck.. >>I can do each separate part of the job, not that i'm quite sure how i >>decode/encode stuff in there, but as soon as i try to do everything at >>the same time i get encoding errors thrown all the time.. > > > It would help if you would state what precise code you are using, > and what precise error you are getting (for what precise input). > > >>3. in order to help me understand what's going on when doing >>encodes/decodes could you please tell me if in the following example, s >>and backToBytes are actually the same thing ?? >> >>s = "hello normal string" >>u = unicode( s, "utf-8" ) >>backToBytes = u.encode( "utf-8" ) >> >>i knwo they both are bytestrings but i doubt they have actually the same >>content.. > > > They do have the same content. There is nothing to a byte string except > for the bytes. If the byte string is meant to represent characters, > they are the same "thing" only if the assumed encoding is the same. > Since the assumed encoding is "utf-8" for both s and backToBytes, > they are the same thing. > > >>4. I've also tried to set the default encoding of python for my script >>using the sys.setdefaultencoding('utf-8') but it keeps telling me that >>this module does not have that method.. i'm left no choice but to edit >>the site.py file manually to change "ascii" to "utf-8", but i won't be >>able to do that on the client computers so.. > > > Don't do that. It's meant as a last resort for backwards compatibility, > and shouldn't be used for new code. > > Regards, > Martin -- http://mail.python.org/mailman/listinfo/python-list