Hi All - Thanks for your help! I found this last night and it looks pretty promising. It is apparently part of Apache Tika (which I have never heard of until now) that has a lot of interesting functionality!
https://boilerpipe-web.appspot.com/ Thanks! On Jun 5, 11:14 pm, Bruce Williams <williams.br...@gmail.com> wrote: > I looked at HtmlCleaner and it pretty cleans up the 'syntax' of the > html but does nothing with the 'semantics' - ads,etc > > Bruce Williams > Concepts, like individuals, have their histories and are just as incapable of > withstanding the ravages of time as are individuals. But in and > through all this > they retain a kind of homesickness for the scenes of their childhood. > Soren Kierkegaard > > On Sun, Jun 5, 2011 at 8:04 PM, Andreas Kostler > > > > > > > > <andreas.koestler.le...@gmail.com> wrote: > > There's a Java library called HtmlCleaner. You might wanna give that a shot. > > Btw, I'm working on quite a similar project so if you like email me and we > > can maybe join forces. > > Andreas > > > On 06/06/2011, at 11:01 AM, Base wrote: > > >> hi all, > > >> I am working on an app that will parse web pages to do some NLP and > >> statistics. I am able to parse the HTML using several different tool > >> ( enlive, HTML parser, etc). However I would like to discard all the > >> rest of the junk in the web page that is not pertinent (I.e. Ads). > >> Does anyone have any experience doing this? Any tips On how to do > >> this - or even better, tools that you can recommend? I have been > >> digging around on this for a while now and am stuck! > > >> Thanks! > > >> Base > > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "Clojure" group. > >> To post to this group, send email to clojure@googlegroups.com > >> Note that posts from new members are moderated - please be patient with > >> your first post. > >> To unsubscribe from this group, send email to > >> clojure+unsubscr...@googlegroups.com > >> For more options, visit this group at > >>http://groups.google.com/group/clojure?hl=en > > > -- > > You received this message because you are subscribed to the Google > > Groups "Clojure" group. > > To post to this group, send email to clojure@googlegroups.com > > Note that posts from new members are moderated - please be patient with > > your first post. > > To unsubscribe from this group, send email to > > clojure+unsubscr...@googlegroups.com > > For more options, visit this group at > >http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en