[Chris Lasher]
>   I'm trying to write a tool to scrape through some of the Ribosomal
> Database Project II's (http://rdp.cme.msu.edu/) pages, specifically,
> through the Hierarchy Browser. (http://rdp.cme.msu.edu/hierarchy/)

I'm sure that urllib is the right tool to use. However, there may be one or two problems with the way you're using it.

> --------excerpted HTML----------------

<!-- snip -->

> <form name="hierarchyForm" method="POST"
> action="HierarchyControllerServlet/start/">
> <input type='hidden' name='printParams' value='no' />

This is an omission from the params you are passing to the HierarchyServlet. Although the "printParams" field is not visible to you in a browser, the browser still submits a name/value pair in its form submission. So you should also in your code, as shwon below.

> <input id="bergeys" name="taxonomy" type="radio" value="rdpHome" checked>

Also, you are using the wrong value for the taxonomy field. You are setting a value of "bergeys", which is the ID of the field, not its value. The correct value is "rdpHome".

> --------Python test code---------------
> #!/usr/bin/python
>
> import urllib
>
> options = [("strain", "type"), ("source", "both"),
>            ("size", "gt1200"), ("taxonomy", "bergeys"),
>            ("browse", "Browse")]

Try this

options = [ ("printParams", "no"), ("strain", "type"),
            ("source", "both"), ("size", "gt1200"),
            ("taxonomy", "rdpHome"), ("browse", "Browse"),]

>
> params = urllib.urlencode(options)
>
> rdpbrowsepage = urllib.urlopen(
>     "http://rdp.cme.msu.edu/hierarchy/HierarchyControllerServlet/start";,
>     params)
>
> pagehtml = rdpbrowsepage.read()
>
> print pagehtml
> ---------end Python test code----------

HTH,

--
alan kennedy
------------------------------------------------------
email alan:              http://xhaus.com/contact/alan
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to