回复: RE: HTML text extraction

2006-06-29 Thread 田春峰
st regards, Xuefeng http://www.crackj2ee.com -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 11:30 PM To: java-user@lucene.apache.org Subject: Re: HTML text extraction John Wang wrote: > Hi Xuefeng: > > Can you please send me your h

Re: HTML text extraction

2006-06-29 Thread MALCOLM CLARK
Hi, Would you please send me your parser too? Thanks! Malcolm - Original Message From: Liao Xuefeng <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, June 23, 2006 12:54:29 AM Subject: RE: HTML text extraction hi, all, I wrote my own html parser because it just

RE: HTML text extraction

2006-06-22 Thread Liao Xuefeng
MAIL PROTECTED] Sent: Thursday, June 22, 2006 11:30 PM To: java-user@lucene.apache.org Subject: Re: HTML text extraction John Wang wrote: > Hi Xuefeng: > > Can you please send me your htmlparser too? Xuefeng, would it be possible to open source your parser? Thanks Michi > > thanks &

Re: HTML text extraction

2006-06-22 Thread Michael Wechner
John Wang wrote: Hi Xuefeng: Can you please send me your htmlparser too? Xuefeng, would it be possible to open source your parser? Thanks Michi thanks -John On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Simon Courtenage wrote: > I also use htmlparser, which is rather good. I'

Re: HTML text extraction

2006-06-22 Thread John Wang
Hi Xuefeng: Can you please send me your htmlparser too? thanks -John On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Simon Courtenage wrote: > I also use htmlparser, which is rather good. I've had to customize it, > though, to parse strings containing > html source rather than accept

Re: HTML text extraction

2006-06-21 Thread 张瑾
, June 21, 2006 1:40 PM To: java-user@lucene.apache.org Subject: HTML text extraction Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. Otis, what do you guys use at Simpy? Thanks

Re: HTML text extraction

2006-06-21 Thread Daniel Noll
Simon Courtenage wrote: I also use htmlparser, which is rather good. I've had to customize it, though, to parse strings containing html source rather than accept urls of resources to fetch etc. Also it crashes on meta tags that don't have name attributes (something I discovered only a couple

Re: HTML text extraction

2006-06-21 Thread John Wang
n Wang <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 21, 2006 1:39:41 AM Subject: HTML text extraction Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. Otis, what do you guys

RE: HTML text extraction

2006-06-21 Thread Liao Xuefeng
hi, i wrote my own html parser to do html2text and it works well. i can send you my code if it matches your require. -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 1:40 PM To: java-user@lucene.apache.org Subject: HTML text extraction Can

Re: HTML text extraction

2006-06-21 Thread Chris Hostetter
if you just want something to extract the text from HTML, without trying to extract structure (ie: you don't care about title vs h1 vs bold vs meta keywords) then the HTMLStripReader (or HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be usefull. It wasn't intended to deal with fu

Re: HTML text extraction

2006-06-21 Thread Simon Courtenage
o). Simon Daniel Noll wrote: John Wang wrote: Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. We use this library to do our HTML parsing work: http://htmlparser.sourceforge.net/ It's fairly resili

RE: HTML text extraction

2006-06-21 Thread Rob Staveley (Tom)
pt (a simple tweak), but "kept on trucking" with any size document. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 21 June 2006 07:37 To: java-user@lucene.apache.org Subject: Re: HTML text extraction John, I also wrote about using NekoHTML, I think. I

Re: HTML text extraction

2006-06-20 Thread Otis Gospodnetic
John, I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also tells you what Simpy.com uses. Otis - Original Message From: John Wang <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 21, 2006 1:39:41 AM Subject: HTML text extractio

Re: HTML text extraction

2006-06-20 Thread Daniel Noll
John Wang wrote: Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. We use this library to do our HTML parsing work: http://htmlparser.sourceforge.net/ It's fairly resilient to bad code, incl

HTML text extraction

2006-06-20 Thread John Wang
Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. Otis, what do you guys use at Simpy? Thanks -john