Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Li Li Fri, 11 Mar 2011 03:36:07 -0800

1. parser is the preprocessing of documents, lucene will not know anything
about it.
2. I have only used  NekoHtmlParser. Cobra  is a java browser and it seems a
little heavy. VietSpider is very heavy because it embed mozilla browser by
SWT. MozillaParser is similar but embeding by itself(which need jni).
    if you care about speed, you can try Neko, Jericho, JTidy or Java HTML
Parser
    if you care parser quality, you can try cobra or VietSpider. because the
deals well with css javascript or related things.


    But I think the parser will most be used when crawling. So you can use
these parsers when crawling and save parsed result only.
    HtmlUnit is also a good tool for this purpose which support javascript
and parsing web pages.

2011/3/11 shrinath.m <shrinat...@webyog.com>

> Thank you Li Li.
>
> Two questions :
>
> 1. Is there anything *in* *Lucene* that I need to know of ? some contrib
> module or anything as such ?
> 2. You ran a search in java-source.net for me, thanks for that, but do you
> mind telling me which is the easiest and fastest ??
>
> On Fri, Mar 11, 2011 at 4:38 PM, Li Li [via Lucene] <
> ml-node+2664327-2139887543-376...@n3.nabble.com> wrote:
>
> > http://java-source.net/open-source/html-parsers
> >
> > 2011/3/11 shrinath.m <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=2664327&i=0&by-user=t>>
> >
> >
> > > I am trying to index content withing certain HTML tags, how do I index
> it
> > ?
> > > Which is the best parser/tokenizer available to do this ?
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664316.html
> <
> http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664316.html?by-user=t
> >
> > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]<
> http://user/SendEmail.jtp?type=node&node=2664327&i=1&by-user=t>
> > > For additional commands, e-mail: [hidden email]<
> http://user/SendEmail.jtp?type=node&node=2664327&i=2&by-user=t>
> > >
> > >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664327.html
> >  To unsubscribe from Which is the +best +fast HTML parser/tokenizer that
> I
> > can use with Lucene for indexing HTML content today ?, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2664316&code=c2hyaW5hdGgubUB3ZWJ5b2cuY29tfDI2NjQzMTZ8LTIxMzY3ODQ0ODI=
> >.
> >
> >
>
>
>
> --
> Regards
> Shrinath.M
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2664331.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Reply via email to