I looked at HtmlCleaner and it pretty cleans up the 'syntax' of the
html but does nothing with the 'semantics'   - ads,etc

Bruce Williams
Concepts, like individuals, have their histories and are just as  incapable of
withstanding the ravages of time as are individuals.  But in and
through all this
they retain a kind of homesickness  for the scenes of their childhood.
Soren Kierkegaard



On Sun, Jun 5, 2011 at 8:04 PM, Andreas Kostler
<andreas.koestler.le...@gmail.com> wrote:
> There's a Java library called HtmlCleaner. You might wanna give that a shot.
> Btw, I'm working on quite a similar project so if you like email me and we 
> can maybe join forces.
> Andreas
>
> On 06/06/2011, at 11:01 AM, Base wrote:
>
>> hi all,
>>
>> I am working on an app that will parse web pages to do some NLP and
>> statistics.  I am able to parse the HTML using several different tool
>> ( enlive, HTML parser, etc).  However I would like to discard all the
>> rest of the junk in the web page that is not pertinent (I.e. Ads).
>> Does anyone have any experience doing this?  Any tips On how to do
>> this - or even better, tools that you can recommend?   I have been
>> digging around on this for a while now and am stuck!
>>
>> Thanks!
>>
>> Base
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with your 
>> first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to