On Dec 3, 4:19 pm, inhahe <inh...@gmail.com> wrote: > or i guess you could go the middle-way and just use regex. > people generally say don't use regex for html (regex can't do the > nesting), but it's what i would do in this case. > though i don't exactly understand the question, re the html file > parsing script you say you have already, or how the date is 'modified > from' the meta-data. > > On Wed, Dec 2, 2009 at 10:24 PM, Mark G <markgraha...@gmail.com> wrote: > > Hi all, > > > I am new to python and don't yet know the libraries well. What would > > be the best way to approach this problem: I have a html file parsing > > script - the file sits on my harddrive. I want to extract the date > > modified from the meta-data. Should I read through lines of the file > > doing a string.find to look for the character patterns of the meta- > > tag, or should I use a DOM type library to retrieve the html element I > > want? Which is best practice? which occupies least code? > > > Regards, Mark > > -- > >http://mail.python.org/mailman/listinfo/python-list > >
I'm tempted to use regex too. I have done a bit of perl & bash, and that is how I would do it with those. However, I thought there would be a smarter way to do it with libraries. I have done some digging through the libraries and think I can do it with xml.dom.minidom. Here is what I want to try... # if html file already exists, inherit the created date # 'output' is the filename for the parsed file html_xml = xml.dom.minidom.parse(output) for node in html_xml.getElementsByTagName('meta'): # visit every node <meta /> #debug print node.toxml() metatag_type = nodes.attributes["name"] if metatag_type.name == "DC.Date.Modified": metatag_date = nodes.attributes["content"] date_created = metatag_date.value() print date_created I haven't quite got up to hear in my debugging. I'll let you know if it works... RE: your questions. 1) I already have the script in bash - wanting to convert to Python :) I'm half way through I want to extract the value of the tag <metadata name="DC.Date.Modified" value="2009-11-17"> -- http://mail.python.org/mailman/listinfo/python-list