RE: document object

2011-03-10 Thread suman.holani
Hello Erick, Hits .length is 1800 Version is lucene 3.0.3 I need the entire result set . As I ll be fetching records which satisfy the search conditions. And will be validating them wrt to current counts , scheduling the successful resultset.Selecting one of them on basis of random scheduling.

Indexing of multilingual labels

2011-03-10 Thread Stephane Fellah
I am trying to index in Lucene a field that could have label of concepts in different languages. Most of the approaches I have seen so far are: - Use a single index, where each document has a field per each language it uses, or - Use M indexes, M being the number of languages in

Re: IndexSearcher Single Instance Bottleneck?

2011-03-10 Thread Erick Erickson
No, Lucene itself shouldn't be doing this, the recommendation is for multiple threads to share a single searcher. I'd first look upstream, are your requests being processed serially? I.e. is there a single thread that's handling requests? Best Erick On Thu, Mar 10, 2011 at 4:25 PM, RobM wrote: >

Re: getting the number of updated documents

2011-03-10 Thread Koji Sekiguchi
Does IndexWriter (or somewhere else) have the method such that it gets the number of updated documents before commit? you have maxDocs which gives you the maxdocid-1 but this might not be super accurate since there might have been merges going on in the background. I am not sure if this number yo

IndexSearcher Single Instance Bottleneck?

2011-03-10 Thread RobM
I currently have two types of searches on my website that are using the same index and same instance of index searcher. One of the searches usually only takes 50 - 100 milliseconds but the second usually takes 2 seconds. It seems as though when someone does the second search and another user does t

Re: reopen with optimize and FileNotFoundException

2011-03-10 Thread bart_212
I've tried different lucene locks to use, however I always get FNFE during huge index update while IR reopens index. To prevent two IW at the same time, I synchronize this operation using external locking mechanism based on atomic filesystem operation like creating directory. So just before indexin

Re: Search one index but use IDF from another?

2011-03-10 Thread Andrzej Bialecki
On 3/10/11 8:32 PM, Felipe Hummel wrote: Hi, I'm building a system where I want to show only results indexed in the past few days. Furthermore, I don't want to maintain a giant index with millions of documents if I only want to return results from a couple of days (thousands of documents). My sy

Re: getting the number of updated documents

2011-03-10 Thread Simon Willnauer
hey Koji, 2011/3/10 Koji Sekiguchi : > Hello, > > Does IndexWriter (or somewhere else) have the method such that > it gets the number of updated documents before commit? you have maxDocs which gives you the maxdocid-1 but this might not be super accurate since there might have been merges going on

Search one index but use IDF from another?

2011-03-10 Thread Felipe Hummel
Hi, I'm building a system where I want to show only results indexed in the past few days. Furthermore, I don't want to maintain a giant index with millions of documents if I only want to return results from a couple of days (thousands of documents). My system heavily relies that the occurrences of

Re: reopen with optimize and FileNotFoundException

2011-03-10 Thread Michael McCandless
On Wed, Mar 9, 2011 at 2:44 PM, bart_212 wrote: > Hi, > I have two web applications that uses lucene 2.3.2. Both share the same > index and can write or read. Writing is synchronized based on file system to > allow only one IndexWriter to work at the moment. There can be multiple > IndexReader. In

RE: ManifoldCF in Action

2011-03-10 Thread karl.wright
Ah, I was not thinking of a Solr addon! I thought you were referring to some other crawler that I'd never heard of. So the answer to your question is that ManifoldCF differs from DIH in at least the following ways: - ManifoldCF can handle a wide range of repositories, not just database tables

Re: index enforcing query terms to appear within the same sentence

2011-03-10 Thread Michael Wiegand
Conceptually, I think I know what to do. Unfortunately, with the given interfaces of Lucene I have some difficulty. If I add the content of a document sentence by sentence, i.e. line by line, (using a multi-valued field), there are only two constructors possible: Field(String name, String val

Re: ManifoldCF in Action

2011-03-10 Thread Paul Libbrecht
Erm, google DIH SOLR or http://wiki.apache.org/solr/DataImportHandler paul Le 10 mars 2011 à 14:37, karl.wri...@nokia.com a écrit : >>> > Karl, > > can you give, in one paragraph, the difference between ManifoldCF and DIH? > > thanks in advance > > paul > << > > I am unfamiliar

getting the number of updated documents

2011-03-10 Thread Koji Sekiguchi
Hello, Does IndexWriter (or somewhere else) have the method such that it gets the number of updated documents before commit? I have an optimized index and I'm using iw.updateDocument(Term,Document) with the index, and before commit, I'd like to know the number of updated documents from IndexWrite

Re: Detecting duplicates

2011-03-10 Thread mark harwood
This is possible using contrib's DuplicateFilter. Below is an example of your problem defined as an XML-based test which I just ran OK through my test writer/runner. Hopefully this is readable and demonstrates the use of FilteredQuery/DuplicateFilter. This is my test

Re: Detecting duplicates

2011-03-10 Thread Alexander Aristov
did you check it http://wiki.apache.org/solr/Deduplication Best Regards Alexander Aristov On 10 March 2011 18:35, Mark wrote: > My understanding is It can mark documents with the same signature > indicating that they are similar however there is no way at query time to > return only 1 "unique

Re: Detecting duplicates

2011-03-10 Thread Mark
My understanding is It can mark documents with the same signature indicating that they are similar however there is no way at query time to return only 1 "unique" document per signature. Am I missing something? Doc 1) This is my test Doc 2) This is my test Doc 3) Another test Doc 4) This is my

Re: Detecting duplicates

2011-03-10 Thread Grant Ingersoll
On Mar 5, 2011, at 8:35 PM, Mark wrote: > I'm familiar with Deduplication however I do not wish to remove my duplicates > and my needs are slightly different. I would like to mark the first document > with signature 'xyz' as unique but the next one as a duplicate. This way I > can filter out "

Re: document object

2011-03-10 Thread Erick Erickson
If you're loading 100,000 documents, you can expect it to be slow. If you're loading 10 documents, it should be quite fast... So how big is hits.length? And what version of Lucene are you using? The Hits object has been deprecated for quite some time I believe. The problem here is that you're

Re: ManifoldCF in Action

2011-03-10 Thread karl.wright
>> Karl, can you give, in one paragraph, the difference between ManifoldCF and DIH? thanks in advance paul << I am unfamiliar with DIH as an acronym in either the content management or crawling infrastructure space. Can you clarify what you mean? Karl

Re: reopen with optimize and FileNotFoundException

2011-03-10 Thread Ian Lea
Usage sounds OK, but missing files on IndexReader.reopen definitely doesn't sound OK. 2.3.2 is ancient and there have been many improvements since then. I'd upgrade if possible. You could also try losing the optimizes. On recent releases you don't really need to use it. Not sure about 2.3.2 thou

Re: document object

2011-03-10 Thread Anshum
Depends on your data. I know that's a vague answer but that's the point. What you could do is use FieldCache if memory and data let you do so. Would it? -- Anshum Gupta http://ai-cafe.blogspot.com On Thu, Mar 10, 2011 at 3:12 PM, suman.holani wrote: > Hi Anshum, > > Thanks for prompt reply. > >

RE: document object

2011-03-10 Thread suman.holani
Hi Anshum, Thanks for prompt reply. I am only storing the fields in index , which I want to get/fetch after search. The area I am not sure is when we call searcher/reader class to initialize Document object is heavy? Can we use something else in that place, which doesnot needs to load all doc ag

Re: document object

2011-03-10 Thread Anshum
Hi Suman, Do you need to load/use all fields that you have stored in the index? If that's not the case I'd suggest you to use the public Document *doc*(int i, FieldSelector fieldSelector) http://lucene.apache

document object

2011-03-10 Thread suman.holani
Hi, I am facing the problem The line in the loop is going very slow giving me a performance hit for (int i = 0; i < hits.length; ++i) { int docId = hits[i].doc; Document d = searcher.doc(docId); //problem } How can I improve