Hi,
I have worked on a similar project before and have found the following
link useful
http://blog.prashanthellina.com/2009/07/27/extracting-relevant-text-from-html-pages/
Best regards
~ Mukul Joshi
Director & CEO,
SpotOn Software Pvt. Ltd.
_SpotOn : One stop spot for your mobile development
2011/6/6 Base :
> hi all,
>
> I am working on an app that will parse web pages to do some NLP and
> statistics. I am able to parse the HTML using several different tool
> ( enlive, HTML parser, etc). However I would like to discard all the
> rest of the junk in the web page that is not pertinent
Hi All -
Thanks for your help! I found this last night and it looks pretty
promising. It is apparently part of Apache Tika (which I have never
heard of until now) that has a lot of interesting functionality!
https://boilerpipe-web.appspot.com/
Thanks!
On Jun 5, 11:14 pm, Bruce Williams wrot
I looked at HtmlCleaner and it pretty cleans up the 'syntax' of the
html but does nothing with the 'semantics' - ads,etc
Bruce Williams
Concepts, like individuals, have their histories and are just as incapable of
withstanding the ravages of time as are individuals. But in and
through all this
Me too, starting in October. I still need to get up to speed with Clojure
however.
On Sun, Jun 5, 2011 at 11:04 PM, Andreas Kostler <
andreas.koestler.le...@gmail.com> wrote:
> There's a Java library called HtmlCleaner. You might wanna give that a
> shot.
> Btw, I'm working on quite a similar pro
There's a Java library called HtmlCleaner. You might wanna give that a shot.
Btw, I'm working on quite a similar project so if you like email me and we can
maybe join forces.
Andreas
On 06/06/2011, at 11:01 AM, Base wrote:
> hi all,
>
> I am working on an app that will parse web pages to do so