Hi, I'd like to verify some (x)html / / html5 / xml documents from a server.
These documents have a very limited number of different doc types / DTDs. So what I would like to do is to build a small DTD cache and some code, that would avoid searching the DTDs over and over from the net. What would be the best way to do this? I guess, that the fields od en ElementTre, that I have to look at are docinfo.public_id docinfo.system_uri There's also mentioning af a catalogue, but I don't know how to use a catalog and how to know what is inside my catalogue and what isn't. Below a non working skeleto (first shot): --------------------------------------------- Would this be the right way?? ### ufnctions with '???' are not implemented / are the ones ### where I don't know whether they exist alreday. import os import urllib from lxml import etree cache_dir = os.path.join(os.environ['HOME'], ''.my_dtd_cache') def get_from_cache(docinfo): """ the function which I'd like to implement most efficiently """ fpi = docinfo.public_id uri = docinfo.system_uri dtd = ???get_from_dtd_cache(fpi, uri) if dtd is not None: return dtd # how can I check what is in my 'catalogue' if ???dtd_in_catalogue(??): return ???get_dtd_from_catalogue??? dtd_rdr = urllib.urlopen(uri) dtd_filename = ???create_cache_filename(docinfo) (fname, _headers) = urllib.urlretrieve(uri, dtd_filename) return etree.DTD(fname) def check_doc_cached(filename): """ function, which should report errors if a doc doesn't validate. """ doc = etree.parse(filename) dtd = get_from_cache(doc.docinfo) rslt = dtd.validate(doc) if not rlst: print "validate error:" print(dtd.error_log.filter_from_errors()[0]) -- http://mail.python.org/mailman/listinfo/python-list