Hi All -

Thanks for your help!   I found this last night and it looks pretty
promising.  It is apparently part of Apache Tika (which I have never
heard of until now) that has a lot of interesting functionality!

https://boilerpipe-web.appspot.com/

Thanks!

On Jun 5, 11:14 pm, Bruce Williams <williams.br...@gmail.com> wrote:
> I looked at HtmlCleaner and it pretty cleans up the 'syntax' of the
> html but does nothing with the 'semantics'   - ads,etc
>
> Bruce Williams
> Concepts, like individuals, have their histories and are just as  incapable of
> withstanding the ravages of time as are individuals.  But in and
> through all this
> they retain a kind of homesickness  for the scenes of their childhood.
> Soren Kierkegaard
>
> On Sun, Jun 5, 2011 at 8:04 PM, Andreas Kostler
>
>
>
>
>
>
>
> <andreas.koestler.le...@gmail.com> wrote:
> > There's a Java library called HtmlCleaner. You might wanna give that a shot.
> > Btw, I'm working on quite a similar project so if you like email me and we 
> > can maybe join forces.
> > Andreas
>
> > On 06/06/2011, at 11:01 AM, Base wrote:
>
> >> hi all,
>
> >> I am working on an app that will parse web pages to do some NLP and
> >> statistics.  I am able to parse the HTML using several different tool
> >> ( enlive, HTML parser, etc).  However I would like to discard all the
> >> rest of the junk in the web page that is not pertinent (I.e. Ads).
> >> Does anyone have any experience doing this?  Any tips On how to do
> >> this - or even better, tools that you can recommend?   I have been
> >> digging around on this for a while now and am stuck!
>
> >> Thanks!
>
> >> Base
>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "Clojure" group.
> >> To post to this group, send email to clojure@googlegroups.com
> >> Note that posts from new members are moderated - please be patient with 
> >> your first post.
> >> To unsubscribe from this group, send email to
> >> clojure+unsubscr...@googlegroups.com
> >> For more options, visit this group at
> >>http://groups.google.com/group/clojure?hl=en
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Clojure" group.
> > To post to this group, send email to clojure@googlegroups.com
> > Note that posts from new members are moderated - please be patient with 
> > your first post.
> > To unsubscribe from this group, send email to
> > clojure+unsubscr...@googlegroups.com
> > For more options, visit this group at
> >http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to