st regards,
Xuefeng
http://www.crackj2ee.com
-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction
John Wang wrote:
> Hi Xuefeng:
>
> Can you please send me your h
Hi,
Would you please send me your parser too?
Thanks!
Malcolm
- Original Message
From: Liao Xuefeng <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, June 23, 2006 12:54:29 AM
Subject: RE: HTML text extraction
hi, all,
I wrote my own html parser because it just
MAIL PROTECTED]
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction
John Wang wrote:
> Hi Xuefeng:
>
> Can you please send me your htmlparser too?
Xuefeng, would it be possible to open source your parser?
Thanks
Michi
>
> thanks
&
John Wang wrote:
Hi Xuefeng:
Can you please send me your htmlparser too?
Xuefeng, would it be possible to open source your parser?
Thanks
Michi
thanks
-John
On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Simon Courtenage wrote:
> I also use htmlparser, which is rather good. I'
Hi Xuefeng:
Can you please send me your htmlparser too?
thanks
-John
On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Simon Courtenage wrote:
> I also use htmlparser, which is rather good. I've had to customize it,
> though, to parse strings containing
> html source rather than accept
, June 21, 2006 1:40 PM
To: java-user@lucene.apache.org
Subject: HTML text extraction
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
Otis, what do you guys use at Simpy?
Thanks
Simon Courtenage wrote:
I also use htmlparser, which is rather good. I've had to customize it,
though, to parse strings containing
html source rather than accept urls of resources to fetch etc. Also it
crashes on meta tags that don't have
name attributes (something I discovered only a couple
n Wang <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extraction
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
Otis, what do you guys
hi,
i wrote my own html parser to do html2text and it works well. i can send you
my code if it matches your require.
-Original Message-
From: John Wang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 21, 2006 1:40 PM
To: java-user@lucene.apache.org
Subject: HTML text extraction
Can
if you just want something to extract the text from HTML, without trying
to extract structure (ie: you don't care about title vs h1 vs bold vs meta
keywords) then the HTMLStripReader (or
HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be
usefull. It wasn't intended to deal with fu
o).
Simon
Daniel Noll wrote:
John Wang wrote:
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
We use this library to do our HTML parsing work:
http://htmlparser.sourceforge.net/
It's fairly resili
pt (a simple
tweak), but "kept on trucking" with any size document.
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 21 June 2006 07:37
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction
John,
I also wrote about using NekoHTML, I think. I
John,
I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also
tells you what Simpy.com uses.
Otis
- Original Message
From: John Wang <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extractio
John Wang wrote:
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
We use this library to do our HTML parsing work:
http://htmlparser.sourceforge.net/
It's fairly resilient to bad code, incl
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
Otis, what do you guys use at Simpy?
Thanks
-john
15 matches
Mail list logo