[Chris Lasher] > I'm trying to write a tool to scrape through some of the Ribosomal > Database Project II's (http://rdp.cme.msu.edu/) pages, specifically, > through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)
I'm sure that urllib is the right tool to use. However, there may be one or two problems with the way you're using it.
> --------excerpted HTML----------------
<!-- snip -->
> <form name="hierarchyForm" method="POST" > action="HierarchyControllerServlet/start/"> > <input type='hidden' name='printParams' value='no' />
This is an omission from the params you are passing to the HierarchyServlet. Although the "printParams" field is not visible to you in a browser, the browser still submits a name/value pair in its form submission. So you should also in your code, as shwon below.
> <input id="bergeys" name="taxonomy" type="radio" value="rdpHome" checked>
Also, you are using the wrong value for the taxonomy field. You are setting a value of "bergeys", which is the ID of the field, not its value. The correct value is "rdpHome".
> --------Python test code--------------- > #!/usr/bin/python > > import urllib > > options = [("strain", "type"), ("source", "both"), > ("size", "gt1200"), ("taxonomy", "bergeys"), > ("browse", "Browse")]
Try this
options = [ ("printParams", "no"), ("strain", "type"), ("source", "both"), ("size", "gt1200"), ("taxonomy", "rdpHome"), ("browse", "Browse"),]
> > params = urllib.urlencode(options) > > rdpbrowsepage = urllib.urlopen( > "http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start", > params) > > pagehtml = rdpbrowsepage.read() > > print pagehtml > ---------end Python test code----------
HTH,
-- alan kennedy ------------------------------------------------------ email alan: http://xhaus.com/contact/alan -- http://mail.python.org/mailman/listinfo/python-list