Hello, I'm trying to write a tool to scrape through some of the Ribosomal Database Project II's (http://rdp.cme.msu.edu/) pages, specifically, through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/) The Hierarchy Browser is accessed first through a page with a form. There are four fields with several options to be chosen from (Strain, Source, Size, and Taxonomy) and then a submit button labeled "Browse". The HTML of the form is as follows (note, I am also including the Javascript code, as it is called by the submit button):
--------excerpted HTML---------------- <script language="Javascript"> function resetHiddenVar(){ var f_form = document.forms['hierarchyForm']; f_form.action= "HierarchyControllerServlet/start"; return ; } </script> <form name="hierarchyForm" method="POST" action="HierarchyControllerServlet/start/"> <input type='hidden' name='printParams' value='no' /> <h1>Hierarchy Browser - Start</h1><div class="cart" style="float: right">[ <a href="hb_help.jsp">help</a> ]</div> <p> </p> <div id="options"> <table summary="options area" cellpadding="0" cellspacing="0" border="0"><tr><td align="left" valign="middle"> <table border="0" cellspacing="0" cellpadding="0" summary="Options" align="left" class="borderup"> <tr> <th align="right" valign="middle" class="bottom greenbg" nowrap="nowrap">Strain:</th> <td class="bottom formtext" nowrap="nowrap"><input id="type" name="strain" type="radio" value="type"> <label for="type">Type</label></td> <td class="bottom formtext" nowrap="nowrap"><input id="nontype" name="strain" type="radio" value="nontype"> <label for="nontype">Non Type</label> </td> <td class="bottom formtext" nowrap="nowrap"><input name="strain" type="radio" id="strainboth" value="both" checked> <label for="strainboth">Both</label> </td> </tr> <tr> <th align="right" valign="middle" class="bottom greenbg">Source:</th> <td class="bottom formtext" nowrap="nowrap"><input id="environmental" name="source" type="radio" value="environ"> <label for="environmental">Uncultured </label></td> <td class="bottom formtext" nowrap="nowrap"><input id="isolates" name="source" type="radio" value="isolates"> <label for="isolates">Isolates</label></td> <td class="bottom formtext" nowrap="nowrap"><input name="source" type="radio" id="sourceboth" value="both" checked > <label for="sourceboth">Both</label></td> </tr> <tr> <th align="right" valign="middle" class="bottom greenbg">Size:</th> <td class="bottom formtext" nowrap="nowrap"><input id="greaterthan1200" name="size" type="radio" value="gt1200" checked> <label for="greaterthan1200"><u>></u>1200</label></td> <td class="bottom formtext" nowrap="nowrap"><input id="lessthan1200" name="size" type="radio" value="lt1200"> <label for="lessthan1200"><1200</label></td> <td class="bottom formtext" nowrap="nowrap"><input id="sizeboth" name="size" type="radio" value="both"> <label for="sizeboth">Both</label></td> </tr> <tr> <th align="right" valign="middle" class="bottom greenbg">Taxonomy:</th> <td class="bottom formtext" nowrap="nowrap"><input id="bergeys" name="taxonomy" type="radio" value="rdpHome" checked> <label for="bergeys">Bergey's</label></td> <td colspan="2" class="bottom formtext" nowrap="nowrap"><input id="ncbi" name="taxonomy" type="radio" value="ncbiHome"> <label for="ncbi">NCBI</label></td> </tr> </table> </td> <td align="left" valign="middle"> <input name="browse" type="submit" id="browse" onclick="resetHiddenVar(); return true;" value="Browse"> </td></tr></table></p> </div> <!-- end options --> </form> ----------end excerpted HTML-------------- The options I would like to simulate are browsing by strain=type, source=both, size = gt1200, and taxonomy = bergeys. I see that the form method is POST, and I read through the urllib documentation, and saw that the syntax for POSTing is urllib.urlopen(url, data). Since the submit button calls HierarchyControllerServlet/start (see the Javascript), I figure that the url I should be contacting is http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start Thus, I came up with the following test code: --------Python test code--------------- #!/usr/bin/python import urllib options = [("strain", "type"), ("source", "both"), ("size", "gt1200"), ("taxonomy", "bergeys"), ("browse", "Browse")] params = urllib.urlencode(options) rdpbrowsepage = urllib.urlopen( "http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start", params) pagehtml = rdpbrowsepage.read() print pagehtml ---------end Python test code---------- However, the page that is returned is an error page that says the request could not be completed. The correct page should show various bacterial taxonomies, which are clickable to reveal greater detail of that particular taxon. I'm a bit stumped, and admittedly, I am in over my head on the subject matter of networking and web-clients. Perhaps I should be using the httplib module for connecting to the RDP instead, but I am unsure what methods I need to use to do this. This is complicated by the fact that these are JSP generated pages and I'm unsure what exactly the server requires before giving up the desired page. For instance, there's a jsessionid that's given and I'm unsure if this is required to access pages, and if it is, how to place it in POST requests. If anyone has suggestions, I would greatly appreciate them. If any more information is needed that I haven't provided, please let me know and I'll be happy to give what I am able. Thanks very, very much in advance. Chris -- http://mail.python.org/mailman/listinfo/python-list