RE: n-gram indexing

2005-07-29 Thread Chris Hostetter
: Document 1: : "united states is United airlines operates in 50 states. United : states government." : : Document 2: : "united states is United airlines operates in 50 states. United : some other word states" : : If you consider the tf-idf weight of individual terms "united" and : "s

Re: indexed document id

2005-07-29 Thread Erik Hatcher
On Jul 29, 2005, at 4:40 PM, Chris Fraschetti wrote: I've got an index which I rebuild each time and don't do any deletes until the end, so doc ids shouldn't change... at index time, is there a better way to discover the id of the document i just added than docCount() ? When building a new ind

indexed document id

2005-07-29 Thread Chris Fraschetti
I've got an index which I rebuild each time and don't do any deletes until the end, so doc ids shouldn't change... at index time, is there a better way to discover the id of the document i just added than docCount() ? -- ___ Chris Fraschetti e [EMAIL

RE: n-gram indexing

2005-07-29 Thread Rajesh Munavalli
I was wondering how Lucene's phrase query would work in case of n-gram indexing. There are two scenarios for popsition increments while adding the index for n-grams. For example consider tri-grams of "united states of america". Scenario 1: Index position token 0 "united" 1

RE: hit count within categories

2005-07-29 Thread mark harwood
> Is there a faster way to access the total hits > count?? The solution I outlined could be adapted to work across multiple indexes - you'd just have to aggregate the totals. If going from all category terms to matching doc ids is slow you could do it the other way going from matching doc ids to

RE: hit count within categories

2005-07-29 Thread Tim Johnson
Thanks Mark I've looked at your posting and it's not the answer to my problem. In testing one large index v. several small indexes, I've found that for high frequency terms, the small individual indexes perform better by a factory of 2 to 3 times. I know this is contrary to what is recommended b

Re: Text extraction from HTML

2005-07-29 Thread Jack Tang
Hi Novelli Do you insist on HtmlParser in Nutch? Or some alternatives are available, maybe, you can try htmlparser hosted on sf.net http://htmlparser.sourceforge.net/ Regards /Jack On 7/29/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote: > Hello, > I'm working to the development of a multi-agen

Re: another problem with Multisearcher

2005-07-29 Thread Daniel Cortes
I reply myself, the problem was generated because I closed multisearcher when I finalize: SearchResults sr=null; Query buscar = MultiFieldQueryParser.parse(search_text,fields,required,analyzer); Hits encontrados=searcher.search(buscar); sr = new SearchResult

Re: Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
I have tried both HtmlParser v1.5 and NekoHTML. About the former my implementation doesn't work as i.e. it get text from javascripts; I have followed the hint from http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html The following is my NOT working implement

Re: Text extraction from HTML

2005-07-29 Thread Patrick Kimber
Hi Giovanni We are using the Neko HTML parser. Some simple example code can be found in the "Lucene in Action" book. For more information: http://www.manning.com/books/hatcher2 http://www.apache.org/~andyc/neko/doc/html/ Patrick On 29/07/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote: > Hello,

Text extraction from HTML

2005-07-29 Thread Giovanni Novelli
Hello, I'm working to the development of a multi-agents software that involves some information indexing, information retrieval and information categorization tasks. I want to build the training set for categorization using a set of HTML pages fetched from DMOZ RDF dumps. I have tried the HtmlParse