st regards,
Xuefeng
http://www.crackj2ee.com
-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction
John Wang wrote:
> Hi Xuefeng:
>
> Can you please send me your h
Hi,
Would you please send me your parser too?
Thanks!
Malcolm
- Original Message
From: Liao Xuefeng <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, June 23, 2006 12:54:29 AM
Subject: RE: HTML text extraction
hi, all,
I wrote my own html parser because it just
MAIL PROTECTED]
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction
John Wang wrote:
> Hi Xuefeng:
>
> Can you please send me your htmlparser too?
Xuefeng, would it be possible to open source your parser?
Thanks
Michi
>
> thanks
&
John Wang wrote:
Hi Xuefeng:
Can you please send me your htmlparser too?
Xuefeng, would it be possible to open source your parser?
Thanks
Michi
thanks
-John
On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Simon Courtenage wrote:
> I also use htmlparser, which is rather good. I'
Hi Xuefeng:
Can you please send me your htmlparser too?
thanks
-John
On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
Simon Courtenage wrote:
> I also use htmlparser, which is rather good. I've had to customize it,
> though, to parse strings containing
> html source rather than accept
Please send it to me,thanks very much!
2006/6/21, Liao Xuefeng <[EMAIL PROTECTED]>:
hi,
i wrote my own html parser to do html2text and it works well. i can send
you
my code if it matches your require.
-Original Message-
From: John Wang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 21
Simon Courtenage wrote:
I also use htmlparser, which is rather good. I've had to customize it,
though, to parse strings containing
html source rather than accept urls of resources to fetch etc. Also it
crashes on meta tags that don't have
name attributes (something I discovered only a couple
Thanks everyone for your responses!
I will try them out.
-John
On 6/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
John,
I also wrote about using NekoHTML, I think. I prefer that to JTidy. That
also tells you what Simpy.com uses.
Otis
- Original Message
From: John Wang <[EMAI
hi,
i wrote my own html parser to do html2text and it works well. i can send you
my code if it matches your require.
-Original Message-
From: John Wang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 21, 2006 1:40 PM
To: java-user@lucene.apache.org
Subject: HTML text extraction
Can someo
if you just want something to extract the text from HTML, without trying
to extract structure (ie: you don't care about title vs h1 vs bold vs meta
keywords) then the HTMLStripReader (or
HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be
usefull. It wasn't intended to deal with fu
I also use htmlparser, which is rather good. I've had to customize it,
though, to parse strings containing
html source rather than accept urls of resources to fetch etc. Also it
crashes on meta tags that don't have
name attributes (something I discovered only a couple of days ago).
Simon
Dan
pt (a simple
tweak), but "kept on trucking" with any size document.
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 21 June 2006 07:37
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction
John,
I also wrote about using NekoHTML, I think. I
John,
I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also
tells you what Simpy.com uses.
Otis
- Original Message
From: John Wang <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extraction
Can
John Wang wrote:
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
We use this library to do our HTML parsing work:
http://htmlparser.sourceforge.net/
It's fairly resilient to bad code, including thin
14 matches
Mail list logo