Re: Stemming terms in SpanQuery

2006-05-02 Thread Jason Calabrese
I think the best way to tokening/stem is to use the analyzer directly. for example: TokenStream ts = analyzer.tokenStream(field, new StringReader(text)); Token token = null; while ((token = ts.next()) != null) { Term newTerm = new Term(field, token.termTe

ArrayIndexOutOfBoundsException w/ ImageSearcher

2006-05-02 Thread Michael Dodson
Hi, I'm getting an ArrayIndexOutOfBoundsException when I try to create an instance of IndexSearcher with an FSDirectory. for IndexSearcher searcher = new IndexSearcher(directory); I get the following stack trace: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1

Re: max_score(multi_valued_field) function?

2006-05-02 Thread Chris Hostetter
: yes - i guess this is more or less what i mean. an example are the two : documents: : : 1 - with the titles: : "http" : "hypertext transfer protocol" : : 2 - with the title: : "http tunnel" : : when i use multi-valued fields and do a search on "http" the title : score on the second document is hi

Re: max_score(multi_valued_field) function?

2006-05-02 Thread Günther Starnberger
hello, > i can think of two possibilities you might be refering to when you say > "noise" ... one is that the lengthNorm for docs with many variant > titles causes matches in those titles to not score as well as > documents with only one title -- this can be dealt with by overriding > the lengthNo

Re: max_score(multi_valued_field) function?

2006-05-02 Thread Chris Hostetter
: - If I index all of the possible titles in a multivalued field this : introduces some kind of noise and therefore also bad results. The : reason is that Lucene concatenates all the values of multi-valued : fields when searching them. While a single one of this fields may be a : perfect match thi

max_score(multi_valued_field) function?

2006-05-02 Thread Günther Starnberger
Hello, I would like to use Lucene to index a set of articles, where several different titles may belong to one single article. Currently I use a field for the article as well as a multi-valued field for the titles. My problem is: - If I index only one of the titles I won't get matches when someo

RE: Occurence (freq) and ordering

2006-05-02 Thread Chris Hostetter
: By example doc contains 3 times the word "test", and 1 time the word : "example", and the query was looking for both words, the score for the doc : should be 4. : : But whatever I do, score is 1. 1) this is where Searcher.explain really comes in handy ... it will help you seewhat is going on.

RE: Does lucene works with NFS?

2006-05-02 Thread Bryzek.Michael
We ran into a problem when implementing a similar infrastructure using NFS. We were updating our indexes continuously throughout the day which caused disk space problems when using NFS. I no longer recall the specific details, but in our configuration, NFS did not appear to flush stale file hand

Does lucene works with NFS?

2006-05-02 Thread ningjun . wang
Hello: We are developing a WebSphere application using Lucene. Can we use the following architecture? 1. Store the index in a NFS file system which mount to all four UNIX machines 2. The WebSphere application just perform search (read only access to the index on NFS). 3. One of the four machin

Re: creating indexReader object

2006-05-02 Thread trupti mulajkar
i have indexed files uisng IndexFiles, how can i add the field to the document using this. cheers, trupti mulajkar MSc Advanced Computer Science Quoting karl wettin <[EMAIL PROTECTED]>: > > 2 maj 2006 kl. 16.11 skrev trupti mulajkar: > > > > doc(i).get("contents"); > > > > i get an only NULL

RE: OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread Ramana Jelda
Thx for ur quick reply. I will go through it. Rgds, Jelda > -Original Message- > From: mark harwood [mailto:[EMAIL PROTECTED] > Sent: Tuesday, May 02, 2006 5:03 PM > To: java-user@lucene.apache.org > Subject: RE: OutOfMemoryError while enumerating through > reader.terms(fieldName) >

RE: OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread mark harwood
"Category counts" should really be a FAQ entry. There is no one right solution to prescribe because it depends on the shape of your data. For previous discussions/code samples see here: http://www.mail-archive.com/java-user@lucene.apache.org/msg05123.html and here for more space-efficient repre

RE: OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread Ramana Jelda
I just got an idea for category counting instead following this BitSet approach.. I will maintain and array with docIds to cateogy_ids as value. i.e. documents[docId] =category_id Which is taking for 1 million docs,around each docid=4 bytes,category_id=4bytes = 8MBytes And then from user que

RE: Occurence (freq) and ordering

2006-05-02 Thread Philippe Deslauriers
Thanks for the Field.setOmitNorms(true) tip! Regarding the Similarity implementation I am trying to do, somehow it does not work. Here's what I understand: Scorer implementation uses the method defined in Similarity, to compute score. (the formula expressed in "http://lucene.apache.org/java/docs

RE: OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread Ramana Jelda
I am trying to implement category count almost similar to CNET approach. At the initialization time , I am trying to create all these BitSets and then trying to and them with user query(with a bitset obtained from queryfilter containing user query).. This way my application is performant..Don't u

RE: creating indexReader object

2006-05-02 Thread Frank Kunemann
Lucene's fields are case sensitive and I think "contents" is written in lower case by default. Cheers, Frank -Original Message- From: trupti mulajkar [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 02, 2006 4:11 PM To: java-user@lucene.apache.org Subject: Re: creating indexReader object

RE: OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread mark harwood
>>Any advise is relly welcome. Don't cache all that data. You need a minimum of (numUniqueTerms*numDocs)/8 bytes to hold that info. Assuming 10,000 unique terms and 1 million docs you'd need over 1 Gig of RAM. I suppose the question is what are you trying to achieve and why can't you use the exis

Re: creating indexReader object

2006-05-02 Thread karl wettin
2 maj 2006 kl. 16.11 skrev trupti mulajkar: doc(i).get("Contents"); i get an only NULL any ideas ? Did you index the field with term vector when you added it to the document? - To unsubscribe, e-mail: [EMAIL PROTECTED]

RE: creating indexReader object

2006-05-02 Thread Satuluri, Venu_Madhav
Try using luke to see how the document actually is in the index. http://www.getopt.org/luke/ -Venu -Original Message- From: trupti mulajkar [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 02, 2006 7:41 PM To: java-user@lucene.apache.org Subject: Re: creating indexReader object thanx hann

Re: creating indexReader object

2006-05-02 Thread trupti mulajkar
thanx hannes, but i dont think i made my query clear enough. i have created the index reader object just the way you mentioned it, but after that when i try to do create the vectors like term frequency and document frequency using doc(i).get("Contents"); i get an only NULL any ideas ? cheer

Re: creating indexReader object

2006-05-02 Thread Hannes Carl Meyer
Hi, IndexReader has some static methods, e.g. IndexReader reader = IndexReader.open(new File("/index")); http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#open(java.lang.String) Hannes trupti mulajkar schrieb: i am trying to create an object of index reader class

creating indexReader object

2006-05-02 Thread trupti mulajkar
i am trying to create an object of index reader class that reads my index. i need this to further generate the document and term frequency vectors. however when i try to print the contents of the documents (doc.get("contents")) it shows -null . any suggestions, if i cant read the contents then i c

RE: OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread Ramana Jelda
Hi, I just debugged it closely.. Sorry I am getting OutOfMemoryError not because of reader.terms() But because of invoking QueryFilter.bits() method for each unique term. I will try explain u with psuedo code. while(term != null){ if(term.field().equals(name)){ String termText

Re: Kneobase: open source, enterprise search

2006-05-02 Thread Lukas Vlcek
I was quickly looking at its web page eariler this day and it looks good so far! Good news! However, I have one question: does Kneobase contain any kind of web crawler functionality (like Nutch) or do I have to feed it with all sources *manually*? How much can be gathering of web data automated?

OutOfMemoryError while enumerating through reader.terms(fieldName)

2006-05-02 Thread Ramana Jelda
Hi, I am getting OutOfMemoryError , while enumerating through TermEnum after invoking reader.terms(fieldName). Just to provide you more information, I have almost 1 unique terms in field A. I can successfully enumerate around 5000terms but later I am gettting OutOfMemoryError. I set jvm max

Kneobase: open source, enterprise search

2006-05-02 Thread Mariano Barcia
Hi list,   I’m glad to announce Colaborativa.net has released Kneobase, an open source "enterprise search" product, based on Lucene.   Kneobase can accept many data sources as searchable elements, and can provide search results in multiple formats, including SOAP, which might make it a