1. parser is the preprocessing of documents, lucene will not know anything about it. 2. I have only used NekoHtmlParser. Cobra is a java browser and it seems a little heavy. VietSpider is very heavy because it embed mozilla browser by SWT. MozillaParser is similar but embeding by itself(which need jni). if you care about speed, you can try Neko, Jericho, JTidy or Java HTML Parser if you care parser quality, you can try cobra or VietSpider. because the deals well with css javascript or related things.
But I think the parser will most be used when crawling. So you can use these parsers when crawling and save parsed result only. HtmlUnit is also a good tool for this purpose which support javascript and parsing web pages. 2011/3/11 shrinath.m <shrinat...@webyog.com> > Thank you Li Li. > > Two questions : > > 1. Is there anything *in* *Lucene* that I need to know of ? some contrib > module or anything as such ? > 2. You ran a search in java-source.net for me, thanks for that, but do you > mind telling me which is the easiest and fastest ?? > > On Fri, Mar 11, 2011 at 4:38 PM, Li Li [via Lucene] < > ml-node+2664327-2139887543-376...@n3.nabble.com> wrote: > > > http://java-source.net/open-source/html-parsers > > > > 2011/3/11 shrinath.m <[hidden email]< > http://user/SendEmail.jtp?type=node&node=2664327&i=0&by-user=t>> > > > > > > > I am trying to index content withing certain HTML tags, how do I index > it > > ? > > > Which is the best parser/tokenizer available to do this ? > > > > > > -- > > > View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664316.html > < > http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664316.html?by-user=t > > > > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [hidden email]< > http://user/SendEmail.jtp?type=node&node=2664327&i=1&by-user=t> > > > For additional commands, e-mail: [hidden email]< > http://user/SendEmail.jtp?type=node&node=2664327&i=2&by-user=t> > > > > > > > > > > > > ------------------------------ > > If you reply to this email, your message will be added to the discussion > > below: > > > > > http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664327.html > > To unsubscribe from Which is the +best +fast HTML parser/tokenizer that > I > > can use with Lucene for indexing HTML content today ?, click here< > http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2664316&code=c2hyaW5hdGgubUB3ZWJ5b2cuY29tfDI2NjQzMTZ8LTIxMzY3ODQ0ODI= > >. > > > > > > > > -- > Regards > Shrinath.M > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664331.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. >