I looked at HtmlCleaner and it pretty cleans up the 'syntax' of the html but does nothing with the 'semantics' - ads,etc
Bruce Williams Concepts, like individuals, have their histories and are just as incapable of withstanding the ravages of time as are individuals. But in and through all this they retain a kind of homesickness for the scenes of their childhood. Soren Kierkegaard On Sun, Jun 5, 2011 at 8:04 PM, Andreas Kostler <andreas.koestler.le...@gmail.com> wrote: > There's a Java library called HtmlCleaner. You might wanna give that a shot. > Btw, I'm working on quite a similar project so if you like email me and we > can maybe join forces. > Andreas > > On 06/06/2011, at 11:01 AM, Base wrote: > >> hi all, >> >> I am working on an app that will parse web pages to do some NLP and >> statistics. I am able to parse the HTML using several different tool >> ( enlive, HTML parser, etc). However I would like to discard all the >> rest of the junk in the web page that is not pertinent (I.e. Ads). >> Does anyone have any experience doing this? Any tips On how to do >> this - or even better, tools that you can recommend? I have been >> digging around on this for a while now and am stuck! >> >> Thanks! >> >> Base >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with your >> first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en