KevinUT wrote: > Hello Folks: > > I want to globally change the following: <a href="http:// > www.mysite.org/?page=contacts"><font color="#269BD5"> > > into: <a href="pages/contacts.htm"><font color="#269BD5"> > > You'll notice that the match would be http://www.mysite.org/?page= but > I also need to add a ".htm" to the end of "contacts" so it becomes > "contacts.htm" This part of the URL is variable, so how can I use a > combination of Python and/or a regular expression to replace the match > the above and also add a ".htm" to the end of that variable part? > > Here are a few dummy URLs for example so you can see the pattern and > the variable too. > > <a href="http://www.mysite.org/?page=newsletter"><font > color="#269BD5"> > > change to: <a href="pages/newsletter.htm"><font color="#269BD5"> > > <a href="http://www.mysite.org/?page=faq"> > > change to: <a href="pages/faq.htm"> > > So, again the script needs to replace all the full absolute URL links > with nothing and replace the PHP "?page=" with just the variable page > name (i.e. contacts) plus the ".htm" > > Is there a combination of Python code and/or regex that can do this? > Any help would be greatly appreciated!
Don't know if the following will in practice be more reliable than a simple regex, but here goes: import sys import urlparse from BeautifulSoup import BeautifulSoup as BS if __name__ == "__main__": html = open(sys.argv[1]).read() bs = BS(html) for a in bs("a"): href = a["href"] url = urlparse.urlparse(href) if url.netloc == "www.mysite.org": qs = urlparse.parse_qs(url.query) a["href"] = "pages/" + qs[u"page"][0] + ".htm" print print bs Peter -- http://mail.python.org/mailman/listinfo/python-list