: Document 1:
: "united states is United airlines operates in 50 states. United
: states government."
:
: Document 2:
: "united states is United airlines operates in 50 states. United
: some other word states"
:
: If you consider the tf-idf weight of individual terms "united" and
: "s
On Jul 29, 2005, at 4:40 PM, Chris Fraschetti wrote:
I've got an index which I rebuild each time and don't do any deletes
until the end, so doc ids shouldn't change... at index time, is there
a better way to discover the id of the document i just added than
docCount() ?
When building a new ind
I've got an index which I rebuild each time and don't do any deletes
until the end, so doc ids shouldn't change... at index time, is there
a better way to discover the id of the document i just added than
docCount() ?
--
___
Chris Fraschetti
e [EMAIL
I was wondering how Lucene's phrase query would work in case of n-gram
indexing. There are two scenarios for popsition increments while adding
the index for n-grams. For example consider tri-grams of "united states
of america".
Scenario 1:
Index position token
0 "united"
1
> Is there a faster way to access the total hits
> count??
The solution I outlined could be adapted to work
across multiple indexes - you'd just have to aggregate
the totals.
If going from all category terms to matching doc ids
is slow you could do it the other way going from
matching doc ids to
Thanks Mark
I've looked at your posting and it's not the answer to my problem. In
testing one large index v. several small indexes, I've found that for
high frequency terms, the small individual indexes perform better by a
factory of 2 to 3 times. I know this is contrary to what is recommended
b
Hi Novelli
Do you insist on HtmlParser in Nutch?
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net
http://htmlparser.sourceforge.net/
Regards
/Jack
On 7/29/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote:
> Hello,
> I'm working to the development of a multi-agen
I reply myself, the problem was generated because I closed multisearcher
when I finalize:
SearchResults sr=null;
Query buscar =
MultiFieldQueryParser.parse(search_text,fields,required,analyzer);
Hits encontrados=searcher.search(buscar);
sr = new SearchResult
I have tried both HtmlParser v1.5 and NekoHTML. About the former my
implementation doesn't work as i.e. it get text from javascripts; I
have followed the hint from
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html
The following is my NOT working implement
Hi Giovanni
We are using the Neko HTML parser. Some simple example code can be
found in the "Lucene in Action" book.
For more information:
http://www.manning.com/books/hatcher2
http://www.apache.org/~andyc/neko/doc/html/
Patrick
On 29/07/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote:
> Hello,
Hello,
I'm working to the development of a multi-agents software that
involves some information indexing, information retrieval and
information categorization tasks. I want to build the training set for
categorization using a set of HTML pages fetched from DMOZ RDF dumps.
I have tried the HtmlParse
11 matches
Mail list logo