Are there any libraries that can extract structured web content as 
represented visually in the browser?

I realize I could write regexes and extract using the HTML, but I was 
wondering if there was something that worked with the browser-rendered 
representation.  I.e., something a tad more human-readable: I'd like to have 
a simple syntax to represent the input pattern as presented in the browser 
and have it map into a structured list/map/array.

For instance, (from the UMichigan Online Library)

Poetry Here and Then
A sampling of the papers of Michigan poets from various collections housed 
at the Bentley Historical Library, featuring handwritten and typed 
manuscripts, letters and essays as well as photographs, sketches, 
certificates and other personal items.
Format: Image Collections
Access: public
Search within group: University of Michigan Collections
Sponsor: Digital Library Production Service
Statistics Detail: statistics detail 


[Next record, same format]  


That presentation is standardized, and repeated for a hundred+ items.

I'd like to be able to easily turn it into something like:

'(:title "Poetry Here and Then" :description "A sampling..." :format "Image 
Collections" :access "public" ... etc.)


Again, I know I can do it with regex parsing of the HTML itself, but I was 
wondering if there were any libraries to make that process smoother.


Thanks,
Jonathan

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to