subject:"RE\: HTML text extraction"

回复： RE: HTML text extraction

2006-06-29 Thread 田春峰

st regards, Xuefeng http://www.crackj2ee.com -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 22, 2006 11:30 PM To: java-user@lucene.apache.org Subject: Re: HTML text extraction John Wang wrote: > Hi Xuefeng: > > Can you please send me your h

Re: HTML text extraction

2006-06-29 Thread MALCOLM CLARK

Hi, Would you please send me your parser too? Thanks! Malcolm - Original Message From: Liao Xuefeng <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, June 23, 2006 12:54:29 AM Subject: RE: HTML text extraction hi, all, I wrote my own html parser because it just

RE: HTML text extraction

2006-06-22 Thread Liao Xuefeng

MAIL PROTECTED] Sent: Thursday, June 22, 2006 11:30 PM To: java-user@lucene.apache.org Subject: Re: HTML text extraction John Wang wrote: > Hi Xuefeng: > > Can you please send me your htmlparser too? Xuefeng, would it be possible to open source your parser? Thanks Michi > > thanks &

Re: HTML text extraction

2006-06-22 Thread Michael Wechner

John Wang wrote: Hi Xuefeng: Can you please send me your htmlparser too? Xuefeng, would it be possible to open source your parser? Thanks Michi thanks -John On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Simon Courtenage wrote: > I also use htmlparser, which is rather good. I'

Re: HTML text extraction

2006-06-22 Thread John Wang

Hi Xuefeng: Can you please send me your htmlparser too? thanks -John On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Simon Courtenage wrote: > I also use htmlparser, which is rather good. I've had to customize it, > though, to parse strings containing > html source rather than accept

Re: HTML text extraction

2006-06-21 Thread 张瑾

Please send it to me,thanks very much! 2006/6/21, Liao Xuefeng <[EMAIL PROTECTED]>: hi, i wrote my own html parser to do html2text and it works well. i can send you my code if it matches your require. -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21

Re: HTML text extraction

2006-06-21 Thread Daniel Noll

Simon Courtenage wrote: I also use htmlparser, which is rather good. I've had to customize it, though, to parse strings containing html source rather than accept urls of resources to fetch etc. Also it crashes on meta tags that don't have name attributes (something I discovered only a couple

Re: HTML text extraction

2006-06-21 Thread John Wang

Thanks everyone for your responses! I will try them out. -John On 6/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: John, I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also tells you what Simpy.com uses. Otis - Original Message From: John Wang <[EMAI

RE: HTML text extraction

2006-06-21 Thread Liao Xuefeng

hi, i wrote my own html parser to do html2text and it works well. i can send you my code if it matches your require. -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 1:40 PM To: java-user@lucene.apache.org Subject: HTML text extraction Can someo

Re: HTML text extraction

2006-06-21 Thread Chris Hostetter

if you just want something to extract the text from HTML, without trying to extract structure (ie: you don't care about title vs h1 vs bold vs meta keywords) then the HTMLStripReader (or HTMLStripWhitespaceTokenizerFactory) Yonik wrote for Solr might be usefull. It wasn't intended to deal with fu

Re: HTML text extraction

2006-06-21 Thread Simon Courtenage

I also use htmlparser, which is rather good. I've had to customize it, though, to parse strings containing html source rather than accept urls of resources to fetch etc. Also it crashes on meta tags that don't have name attributes (something I discovered only a couple of days ago). Simon Dan

RE: HTML text extraction

2006-06-21 Thread Rob Staveley (Tom)

pt (a simple tweak), but "kept on trucking" with any size document. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 21 June 2006 07:37 To: java-user@lucene.apache.org Subject: Re: HTML text extraction John, I also wrote about using NekoHTML, I think. I

Re: HTML text extraction

2006-06-20 Thread Otis Gospodnetic

John, I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also tells you what Simpy.com uses. Otis - Original Message From: John Wang <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 21, 2006 1:39:41 AM Subject: HTML text extraction Can

Re: HTML text extraction

2006-06-20 Thread Daniel Noll

John Wang wrote: Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. We use this library to do our HTML parsing work: http://htmlparser.sourceforge.net/ It's fairly resilient to bad code, including thin

回复： RE: HTML text extraction

Re: HTML text extraction

RE: HTML text extraction

Re: HTML text extraction

Re: HTML text extraction

Re: HTML text extraction

Re: HTML text extraction

Re: HTML text extraction

RE: HTML text extraction

Re: HTML text extraction

Re: HTML text extraction

RE: HTML text extraction

Re: HTML text extraction

Re: HTML text extraction

14 matches

Site Navigation

Mail list logo

Footer information