Re: Can I do boosting based on term postions?

2007-08-02 Thread Paul Elschot
Cedric,

SpanFirstQuery could be a solution without payloads.
You may want to give it your own Similarity.sloppyFreq() .

Regards,
Paul Elschot

On Thursday 02 August 2007 04:07, Cedric Ho wrote:
> Thanks for the quick response =)
> 
> On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote:
> > Yes, it is easily doable through "Payload" facility. During indexing 
process
> > (mainly tokenization), you need to push this extra information in each
> > token. And then you can use BoostingTermQuery for using Payload value to
> > include Payload in the score. You also need to implement Similarity for 
this
> > (mainly scorePayload method).
> 
> If I store, say a custom boost factor as Payload, does it means that I
> will store one more byte per term per document in the index file? So
> the index file would be much larger?
> 
> >
> > Other way can be to extend SpanTermQuery, this already calculates the
> > position of match. You just need to do something to use this position 
value
> > in the score calculation.
> 
> I see that SpanTermQuery takes a TermPositions from the indexReader
> and I can get the term position from there. However I am not sure how
> to incorporate it into the score calculation. Would you mind give a
> little more detail on this?
> 
> >
> > One possible advantage of SpanTermQuery approach is that you can play
> > around, without re-creating indices everytime.
> >
> > Thanks,
> > Shailendra Sharma,
> > CTO, Ver se' Innovation Pvt. Ltd.
> > Bangalore, India
> >
> > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi all,
> > >
> > > I was wondering if it is possible to do boosting by search terms'
> > > position in the document.
> > >
> > > for example:
> > > search terms appear in the first 100 words, or first 10% words, or in
> > > first two paragraphs would be given higher score.
> > >
> > > Is it achievable through using the new Payload function in lucene 2.2?
> > > Or are there any easier ways to achieve these ?
> > >
> > >
> > > Regards,
> > > Cedric
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> 
> Thanks,
> Cedric
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexReader deletes more than expected

2007-08-02 Thread Ridwan Habbal











> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > 
> If I'm reading this correctly, there's something a little wonky here. In> 
> your example code, you close the IndexWriter and then, without creating> a 
> new IndexWriter, you call addDocument again. This shouldn't be> possible 
> (what version of Lucene are you using?)   Yes, you are correct, I close 
> indexWriter and then add more docs. What's wrong? it worked out fine, and add 
> docs i add will appear to NEW INSTANCES OF INDEX SEARCHERS after calling 
> close on the indexWriter.As for creating new IndexWriter, I tried to, 
> however i suffered of the lock exception i got, although i was closing the 
> IndexWriter instance, before creating a new IndexWriter instance! I don't 
> know why! furthermore, this is useless, for multiThreaded app, cause you 
> can't know who is still writing to your index, and who has colsed his 
> IndexWriter. Even checking that the index is locked b4, leads to unnecessary 
> overhead which can be avoided since it works for me and i can write with one 
> single instance of IndexWriter. > > Assuming for the time being that you are 
> creating the IndexWriter again,> the other issue here is that you shouldn't 
> be able to have a reader and> a writer changing an index at the same time. 
> There should be a lock> failure. This should occur either in the Index Well, 
> I think that i don't get the problems you expect cause i use the Lucene 
> version that is shiped by compass distribution (www.compassframework.org) In 
> short, compass to Lucene is the same of ORM server like hibernate for DBMS 
> like oracle. It's really works fine but i couldn't understand why compass 
> hides the deleteDocuments(Term) method on IndexWriter 
> classhttp://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term).
>  This is why i used delete on a reader rather than using it on the same 
> writer instance, and the only one i have. I couldn't manage my index in one 
> particular situation using compass, because i had to store data not in the 
> usual way we do (every row in table is a record). So i think i have to ask 
> The compass team about that. Anyways, if you have coments, you or the others 
> please do. > > Might you be creating your IndexWriters (which you don't show) 
> with the> create flag always set to true? That will wipe your index each 
> time,> ignoring the locks and cause all sorts of weird results.No, I don't 
> create a new Instance of indexWriter, the only one I create is in the Service 
> constructor, so i create a new clean (has no docs) index only when program 
> starts up. public LuceneServiceSHImp(String indexDirectory) throws 
> IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new 
> StandardAnalyzer();indexWriter = new IndexWriter(new 
> java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}> 
> > -Original Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > 
> Sent: Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> 
> Subject: IndexReader deletes more that expected> > Hi, I got unexpected 
> behavior while testing lucene. To shortly address> the problem: Using 
> IndexWriter I add docs with fields named ID with a> consecutive order 
> (1,2,3,4, etc) then close that index. I get new> IndexReader, and call 
> IndexReader.deleteDocuments(Term). The term is> simply new Term("ID", "1"). 
> and then class close on IndexReader. Things> work out fine. But if i add docs 
> using IndexWriter, close writer, then> create new IndexReader to delete one 
> of the docs already inserted, but> without closing index. while the 
> indexReader that perform deletion is> still not closed, I add more docs, and 
> then commit the IndexWriter, so> when i search I get all added docs in the 
> two phases (before using> deleteDocuments() on IndexReader and after because 
> i haven't closed> IndexReader, howerer, closed IndexWriter). I close 
> IndexReader and then> query the index, so i deletes all docs after opening it 
> till closing it,> in addition to the specified doc in the Term object (in 
> this test case:> ID=1). I know that i can avoid this by close IndexReader 
> directly after> deleting docs, but what about runing it on mutiThread app 
> like web> application? There you are the code: > IndexSearcher indexSearcher 
> = new IndexSearcher(this.indexDirectory);> Hits hitsB4InsertAndClose = null;> 
> hitsB4InsertAndClose = getAllAsHits(indexSearcher);> int beforeInsertAndClose 
> = hitsB4InsertAndClose.length();> > 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> indexWriter.close();> 
> IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> 
> indexSe

RE: IndexReader deletes more that expected

2007-08-02 Thread Ridwan Habbal








 




> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > 
> If I'm reading this correctly, there's something a little wonky here. In> 
> your example code, you close the IndexWriter and then, without creating> a 
> new IndexWriter, you call addDocument again. This shouldn't be> possible 
> (what version of Lucene are you using?)
Yes, you are correct, I close indexWriter and then add more docs. What's wrong? 
it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX 
SEARCHERS after calling close on the indexWriter. 
   As for creating new IndexWriter, I tried to, however i suffered of the lock 
exception i got, although i was closing the IndexWriter instance, before 
creating a new IndexWriter instance! I don't know why! furthermore, this is 
useless, for multiThreaded app, cause you can't know who is still writing to 
your index, and who has colsed his IndexWriter. Even checking that the index is 
locked b4, leads to unnecessary overhead which can be avoided since it works 
for me and i can write with one single instance of IndexWriter. >
> > Assuming for the time being that you are creating the IndexWriter again,> 
> > the other issue here is that you shouldn't be able to have a reader and> a 
> > writer changing an index at the same time. There should be a lock> failure. 
> > This should occur either in the Index 
Well, I think that i don't get the problems you expect cause i use the Lucene 
version that is shiped by compass distribution (www.compassframework.org) In 
short, compass to Lucene is the same of ORM server like hibernate for DBMS like 
oracle. It's really works fine but i couldn't understand why compass hides the 
deleteDocuments(Term) method on IndexWriter class
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term)
. This is why i used delete on a reader rather than using it on the same writer 
instance, and the only one i have. I couldn't manage my index in one particular 
situation using compass, because i had to store data not in the usual way we do 
(every row in table is a record). So i think i have to ask The compass team 
about that. Anyways, if you have coments, you or the others please do. >
> > Might you be creating your IndexWriters (which you don't show) with the> 
> > create flag always set to true? That will wipe your index each time,> 
> > ignoring the locks and cause all sorts of weird results.
No, I don't create a new Instance of indexWriter, the only one I create is in 
the Service constructor, so i create a new clean (has no docs) index only when 
program starts up. public LuceneServiceSHImp(String indexDirectory) throws 
IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new 
StandardAnalyzer();indexWriter = new IndexWriter(new 
java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}
> > -Original Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > 
> > Sent: Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> 
> > Subject: IndexReader deletes more that expected> > Hi, I got unexpected 
> > behavior while testing lucene. To shortly address> the problem: Using 
> > IndexWriter I add docs with fields named ID with a> consecutive order 
> > (1,2,3,4, etc) then close that index. I get new> IndexReader, and call 
> > IndexReader.deleteDocuments(Term). The term is> simply new Term("ID", "1"). 
> > and then class close on IndexReader. Things> work out fine. But if i add 
> > docs using IndexWriter, close writer, then> create new IndexReader to 
> > delete one of the docs already inserted, but> without closing index. while 
> > the indexReader that perform deletion is> still not closed, I add more 
> > docs, and then commit the IndexWriter, so> when i search I get all added 
> > docs in the two phases (before using> deleteDocuments() on IndexReader and 
> > after because i haven't closed> IndexReader, howerer, closed IndexWriter). 
> > I close IndexReader and then> query the index, so i deletes all docs after 
> > opening it till closing it,> in addition to the specified doc in the Term 
> > object (in this test case:> ID=1). I know that i can avoid this by close 
> > IndexReader directly after> deleting docs, but what about runing it on 
> > mutiThread app like web> application? There you are the code: > 
> > IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory);> Hits 
> > hitsB4InsertAndClose = null;> hitsB4InsertAndClose = 
> > getAllAsHits(indexSearcher);> int beforeInsertAndClose = 
> > hitsB4InsertAndClose.length();> > 
> > indexWriter.addDocument(getNewElement());> 
> > indexWriter.addDocument(getNewElement());> 
> > indexWriter.addDocument(getNewElement());> indexWriter.close();> 
> > IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> 

RE: IndexReader deletes more that expected

2007-08-02 Thread Ridwan Habbal
Yes, you are correct, I close indexWriter and then add more docs. What's wrong? 
it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX 
SEARCHERS after calling close on the indexWriter. 
   As for creating new IndexWriter, I tried to, however i suffered of the lock 
exception i got, although i was closing the IndexWriter instance, before 
creating a new IndexWriter instance! I don't know why! furthermore, this is 
useless, for multiThreaded app, cause you can't know who is still writing to 
your index, and who has colsed his IndexWriter. Even checking that the index is 
locked b4, leads to unnecessary overhead which can be avoided since it works 
for me and i can write with one single instance of IndexWriter. 

Well, I think that i don't get the problems you expect cause i use the Lucene 
version that is shiped by compass distribution (www.compassframework.org) In 
short, compass to Lucene is the same of ORM server like hibernate for DBMS like 
oracle. It's really works fine but i couldn't understand why compass hides the 
deleteDocuments(Term) method on IndexWriter class
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term)
. This is why i used delete on a reader rather than using it on the same writer 
instance, and the only one i have. I couldn't manage my index in one particular 
situation using compass, because i had to store data not in the usual way we do 
(every row in table is a record). So i think i have to ask The compass team 
about that. Anyways, if you have coments, you or the others please do. 
 

No, I don't create a new Instance of indexWriter, the only one I create is in 
the Service constructor, so i create a new clean (has no docs) index only when 
program starts up. public LuceneServiceSHImp(String indexDirectory) throws 
IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new 
StandardAnalyzer();indexWriter = new IndexWriter(new 
java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}







> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > 
> If I'm reading this correctly, there's something a little wonky here. In> 
> your example code, you close the IndexWriter and then, without creating> a 
> new IndexWriter, you call addDocument again. This shouldn't be> possible 
> (what version of Lucene are you using?)> > Assuming for the time being that 
> you are creating the IndexWriter again,> the other issue here is that you 
> shouldn't be able to have a reader and> a writer changing an index at the 
> same time. There should be a lock> failure. This should occur either in the 
> Index > > Might you be creating your IndexWriters (which you don't show) with 
> the> create flag always set to true? That will wipe your index each time,> 
> ignoring the locks and cause all sorts of weird results.> > -Original 
> Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > Sent: 
> Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> Subject: 
> IndexReader deletes more that expected> > Hi, I got unexpected behavior while 
> testing lucene. To shortly address> the problem: Using IndexWriter I add docs 
> with fields named ID with a> consecutive order (1,2,3,4, etc) then close that 
> index. I get new> IndexReader, and call IndexReader.deleteDocuments(Term). 
> The term is> simply new Term("ID", "1"). and then class close on IndexReader. 
> Things> work out fine. But if i add docs using IndexWriter, close writer, 
> then> create new IndexReader to delete one of the docs already inserted, but> 
> without closing index. while the indexReader that perform deletion is> still 
> not closed, I add more docs, and then commit the IndexWriter, so> when i 
> search I get all added docs in the two phases (before using> 
> deleteDocuments() on IndexReader and after because i haven't closed> 
> IndexReader, howerer, closed IndexWriter). I close IndexReader and then> 
> query the index, so i deletes all docs after opening it till closing it,> in 
> addition to the specified doc in the Term object (in this test case:> ID=1). 
> I know that i can avoid this by close IndexReader directly after> deleting 
> docs, but what about runing it on mutiThread app like web> application? There 
> you are the code: > IndexSearcher indexSearcher = new 
> IndexSearcher(this.indexDirectory);> Hits hitsB4InsertAndClose = null;> 
> hitsB4InsertAndClose = getAllAsHits(indexSearcher);> int beforeInsertAndClose 
> = hitsB4InsertAndClose.length();> > 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> indexWriter.close();> 
> IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> 
> indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID",

Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar

Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 2 contracts in the file system index each with some
parameterName and Value. Given a search criterion - (paramValue='draft'). I
need to get just an ArrayList of Strings containing contract Ids. I dont
need the lucene documents, just the Ids.

Can this be done ?

-thanks

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Getting only the Ids, not the whole documents.

2007-08-02 Thread Chhabra, Kapil
What is the structure of your index?
If you havnt already, then add a new field to your index that stores the
contractId. For all other fields, set the "store" flag to false while
indexing.

You can now safely retrieve the value of this contractId field based on
your search results.

Regards,
kapilChhabra


-Original Message-
From: makkhar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 02, 2007 2:26 PM
To: java-user@lucene.apache.org
Subject: Getting only the Ids, not the whole documents.


Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 2 contracts in the file system index each with some
parameterName and Value. Given a search criterion -
(paramValue='draft'). I
need to get just an ArrayList of Strings containing contract Ids. I dont
need the lucene documents, just the Ids.

Can this be done ?

-thanks

-- 
View this message in context:
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
f4204907.html#a11960750
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi
you should not store them in an Array structure since they will take up the
memory.
the BitSet is the best structure to store them


On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
> Heres my index structure :
>
> Document -> contract ID   -id (index AND store)
>   -> paramName   -name (index AND store)
>   -> paramValue   -value (index BUT NOT store)
>
> When I get back 2 hits, each document contains ID and paramName, I
> have
> no interest in paramName (but I have to STORE it for some other reason),
> can
> I not just get a plain java String Array of the contract IDs that matched
> ?
> !
>
> -thanks for the prompt reply.
>
>
>
> Chhabra, Kapil wrote:
> >
> > What is the structure of your index?
> > If you havnt already, then add a new field to your index that stores the
> > contractId. For all other fields, set the "store" flag to false while
> > indexing.
> >
> > You can now safely retrieve the value of this contractId field based on
> > your search results.
> >
> > Regards,
> > kapilChhabra
> >
> >
> > -Original Message-
> > From: makkhar [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, August 02, 2007 2:26 PM
> > To: java-user@lucene.apache.org
> > Subject: Getting only the Ids, not the whole documents.
> >
> >
> > Hi all,
> >
> >Can I get just a list of document Ids given a search criteria ? To
> > elaborate here is my situation:
> >
> > I store 2 contracts in the file system index each with some
> > parameterName and Value. Given a search criterion -
> > (paramValue='draft'). I
> > need to get just an ArrayList of Strings containing contract Ids. I dont
> > need the lucene documents, just the Ids.
> >
> > Can this be done ?
> >
> > -thanks
> >
> > --
> > View this message in context:
> > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
> > f4204907.html#a11960750
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/


Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi
yes it decrease the performance but the only solution.
I've spent many weeks to find best way to retrive my own IDs but find this
way as last one

now I am storing the ids in a BitSet structure and it's fast enough

public void collect(...){
idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));

}

On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
>
> Hi,
>
>The solution you suggested will definitely work but will definitely
> slow
> down my search by an order of magnitude. The problem I am trying to solve
> is
> performance, thats why I need the collection of IDs and not the whole
> documents.
>
> - thanks for the prompt reply.
>
>
> is_maximum wrote:
> >
> > yes if you extend your class from HitCollector and override the
> collect()
> > mthod with following signature you can get IDs
> >
> > public void collect(int id, float score)
> >
> > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >> Hi all,
> >>
> >>Can I get just a list of document Ids given a search criteria ? To
> >> elaborate here is my situation:
> >>
> >> I store 2 contracts in the file system index each with some
> >> parameterName and Value. Given a search criterion -
> (paramValue='draft').
> >> I
> >> need to get just an ArrayList of Strings containing contract Ids. I
> dont
> >> need the lucene documents, just the Ids.
> >>
> >> Can this be done ?
> >>
> >> -thanks
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> > --
> > Regards,
> > Mohammad
> > --
> > see my blog: http://brainable.blogspot.com/
> > another in Persian: http://fekre-motefavet.blogspot.com/
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/


RE: Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar

Heres my index structure :

Document -> contract ID   -id (index AND store)
  -> paramName   -name (index AND store)
  -> paramValue   -value (index BUT NOT store)

When I get back 2 hits, each document contains ID and paramName, I have
no interest in paramName (but I have to STORE it for some other reason), can
I not just get a plain java String Array of the contract IDs that matched ?
!

-thanks for the prompt reply.



Chhabra, Kapil wrote:
> 
> What is the structure of your index?
> If you havnt already, then add a new field to your index that stores the
> contractId. For all other fields, set the "store" flag to false while
> indexing.
> 
> You can now safely retrieve the value of this contractId field based on
> your search results.
> 
> Regards,
> kapilChhabra
> 
> 
> -Original Message-
> From: makkhar [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, August 02, 2007 2:26 PM
> To: java-user@lucene.apache.org
> Subject: Getting only the Ids, not the whole documents.
> 
> 
> Hi all,
> 
>Can I get just a list of document Ids given a search criteria ? To
> elaborate here is my situation:
> 
> I store 2 contracts in the file system index each with some
> parameterName and Value. Given a search criterion -
> (paramValue='draft'). I
> need to get just an ArrayList of Strings containing contract Ids. I dont
> need the lucene documents, just the Ids.
> 
> Can this be done ?
> 
> -thanks
> 
> -- 
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
> f4204907.html#a11960750
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar


Hi,

   The solution you suggested will definitely work but will definitely slow
down my search by an order of magnitude. The problem I am trying to solve is
performance, thats why I need the collection of IDs and not the whole
documents.

- thanks for the prompt reply.


is_maximum wrote:
> 
> yes if you extend your class from HitCollector and override the collect()
> mthod with following signature you can get IDs
> 
> public void collect(int id, float score)
> 
> On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi all,
>>
>>Can I get just a list of document Ids given a search criteria ? To
>> elaborate here is my situation:
>>
>> I store 2 contracts in the file system index each with some
>> parameterName and Value. Given a search criterion - (paramValue='draft').
>> I
>> need to get just an ArrayList of Strings containing contract Ids. I dont
>> need the lucene documents, just the Ids.
>>
>> Can this be done ?
>>
>> -thanks
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi
yes if you extend your class from HitCollector and override the collect()
mthod with following signature you can get IDs

public void collect(int id, float score)

On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
> Hi all,
>
>Can I get just a list of document Ids given a search criteria ? To
> elaborate here is my situation:
>
> I store 2 contracts in the file system index each with some
> parameterName and Value. Given a search criterion - (paramValue='draft').
> I
> need to get just an ArrayList of Strings containing contract Ids. I dont
> need the lucene documents, just the Ids.
>
> Can this be done ?
>
> -thanks
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/


Do AND + OR Search in Lucene

2007-08-02 Thread Askar Zaidi
Hey Guys,

Quick question:

I do this in my code for searching:

queryParser.setDefaultOperator(QueryParser.Operator.AND);

Lucene is OR by default so I change it to AND for my requirements. Now, I
have a requirement to do OR as well. I mean while doing AND I'd like to
include results from OR too ... but they'll be much lower ranked than the
AND results.

Is there a way to do this ?

thanks,
AZ


Re: Do AND + OR Search in Lucene

2007-08-02 Thread testn

You can create two queries from two query parser, one with AND and the other
one with OR. After you create both of them, you call setBoost() to give
different boost level and then join them together using BooleanQuery with
option BooleanClause.Occur.SHOULD. That should do the trick.


askarzaidi wrote:
> 
> Hey Guys,
> 
> Quick question:
> 
> I do this in my code for searching:
> 
> queryParser.setDefaultOperator(QueryParser.Operator.AND);
> 
> Lucene is OR by default so I change it to AND for my requirements. Now, I
> have a requirement to do OR as well. I mean while doing AND I'd like to
> include results from OR too ... but they'll be much lower ranked than the
> AND results.
> 
> Is there a way to do this ?
> 
> thanks,
> AZ
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Do-AND-%2B-OR-Search-in-Lucene-tf4205268.html#a11962340
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: High CPU usage duing index and search

2007-08-02 Thread testn

20,000 queries continuously? Sounds a bit too much. Can you elaborate more
what you need to do? Probably you won't need that many queries.



Chew Yee Chuang wrote:
> 
> Hi,
> 
> Thanks for the link provided, actually I've go through those article when
> I
> developing the index and search function for my application. I haven’t try
> profiler yet, but I monitor the CPU usage and notice that whatever index
> or
> search performing, the CPU usage raise to 100%. Below I will try to
> elaborate more on what my application is doing and how I index and search.
> 
> There are many concurrent process running, first, the application will
> write
> records that received into a text file with tab separated each different
> field. Application will point to a new file every 10mins and start writing
> to it. So every file will contains only 10mins record, approximate 600,000
> records per file. Then, the indexing process will check whether there is a
> text file to be index, if it is, the thread will wake up and start perform
> indexing.
>  
> The indexing process will first add documents to RAMDir, Then later, add
> RAMDir into FSDir by calling addIndexNoOptimize() when there is 100,000
> documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter(FSDir)
> was created but a few IndexWriter(RAMDir) was created during the whole
> process. Below are some configuration for IndexWriters that I mentioned:-
> 
> IndexWriter (RAMDir)
> - SimpleAnalyzer
> - setMaxBufferedDocs(1)
> - Filed.Store.YES
> - Field.Index.NO_NORMS
> 
> IndexWriter (FSDir)
> - SimpleAnalyzer
> - setMergeFactor(20)
> - addIndexesNoOptimize()
> 
> For the searching, because there are many queries(20,000) run continuously
> to generate the aggregate table for reporting purpose. All this queries is
> run in nested loop, and there is only 1 Searcher created, I try searcher
> and
> filter as well, filter give me better result, but both also utilize lots
> of
> CPU resources.
> 
> Hope this info will help and sorry for my bad English.
> 
> Thanks
> eChuang, Chew
> 
> -Original Message-
> From: karl wettin [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, July 31, 2007 5:54 PM
> To: java-user@lucene.apache.org
> Subject: Re: High CPU usage duing index and search
> 
> 
> 31 jul 2007 kl. 05.25 skrev Chew Yee Chuang:
>> But just notice that when Lucene performing search or index,
>> the CPU usage on my machine raise to 100%, because of this issue,  
>> some of my
>> others backend process will slow down eventually. Just want to know  
>> does
>> anyone face this problem before ? and is it any idea on how to  
>> overcome this
>> problem ?
> 
> Did you run a profiler to see what it is that consume all the resources?
> It is very hard to guess based on the information you supplied. Start  
> here:
> 
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> 
> 
> -- 
> karl
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.476 / Virus Database: 269.11.0/927 - Release Date: 7/30/2007
> 5:02 PM
>  
> 
> No virus found in this outgoing message.
> Checked by AVG Free Edition. 
> Version: 7.5.476 / Virus Database: 269.11.0/929 - Release Date: 7/31/2007
> 5:26 PM
>  
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/High-CPU-usage-duing-index-and-search-tf4190756.html#a11962524
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread testn

If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/



heybluez wrote:
> 
> I know how to do english text with POI and PDFBox and so on.  Now, I want
> to start indexing non-english language such as french and spanish.  Which
> extraction libs are available for me?
> 
> I want to do:
> 
> Excel
> Word
> PowerPoint
> PDF
> HTML
> RTF
> 
> Thanks!
> Michael
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc---tf4198171.html#a11962580
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE-843 Release

2007-08-02 Thread testn

Mike, as a committer, what do you think?

Thanks!


Peter Keegan wrote:
> 
> I've built a production index with this patch and done some query stress
> testing with no problems.
> I'd give it a thumbs up.
> 
> Peter
> 
> On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi guys,
>>
>> Do you think LUCENE-843 is stable enough? If so, do you think it's worth
>> to
>> release it with probably LUCENE 2.2.1? It would be nice so that people
>> can
>> take the advantage of it right away without risking other breaking
>> changes
>> in the HEAD branch or waiting until 2.3 release.
>>
>> Thanks,
>> --
>> View this message in context:
>> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread testn

Hi,

Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an
ability to help you load only the first field in the document without
loading all the fields. After that, you can keep the whole document if you
like. It should help improve performance better.



is_maximum wrote:
> 
> yes it decrease the performance but the only solution.
> I've spent many weeks to find best way to retrive my own IDs but find this
> way as last one
> 
> now I am storing the ids in a BitSet structure and it's fast enough
> 
> public void collect(...){
> idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));
> 
> }
> 
> On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>> Hi,
>>
>>The solution you suggested will definitely work but will definitely
>> slow
>> down my search by an order of magnitude. The problem I am trying to solve
>> is
>> performance, thats why I need the collection of IDs and not the whole
>> documents.
>>
>> - thanks for the prompt reply.
>>
>>
>> is_maximum wrote:
>> >
>> > yes if you extend your class from HitCollector and override the
>> collect()
>> > mthod with following signature you can get IDs
>> >
>> > public void collect(int id, float score)
>> >
>> > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >> Hi all,
>> >>
>> >>Can I get just a list of document Ids given a search criteria ? To
>> >> elaborate here is my situation:
>> >>
>> >> I store 2 contracts in the file system index each with some
>> >> parameterName and Value. Given a search criterion -
>> (paramValue='draft').
>> >> I
>> >> need to get just an ArrayList of Strings containing contract Ids. I
>> dont
>> >> need the lucene documents, just the Ids.
>> >>
>> >> Can this be done ?
>> >>
>> >> -thanks
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> > Mohammad
>> > --
>> > see my blog: http://brainable.blogspot.com/
>> > another in Persian: http://fekre-motefavet.blogspot.com/
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11962465
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Nutch APIs in Lucene

2007-08-02 Thread Grant Ingersoll
Just use Nutch.  If you look in the Crawl.java class in Nutch, you  
can pretty easily tear out the appropriate pieces.  Question is, do  
you really need all of that?  If so, why not just use Nutch?


-Grant

On Aug 2, 2007, at 2:32 AM, Srinivasarao Vundavalli wrote:

How can we use nutch APIs in Lucene? For example using  
FetchedSegments , we

can get ParseText from which we can
get the content of the document. So can we use these classes
(FetchedSegments, ParseText ) in lucene. If so, how to use them?
Thank You


--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE-843 Release

2007-08-02 Thread Michael McCandless

Honestly I don't really think this is a good idea.

While LUCENE-843 has proven stable so far (knock on wood!), it is
still a major change and I do worry (less with time :) that maybe I
broke something subtle somewhere.

While a few brave people have tested the trunk in their production
worlds and seen good performance gains, that testing is still limited
compared to a real release.

A point release (2.4.1) really is not supposed to contain major
changes, just bug fixes, and so I don't think we should violate that
accepted practice.

I would rather see us finish up 2.3 and release it, and going forwards
do more frequent releases, instead of porting big changes back onto
point releases.

Mike

"testn" <[EMAIL PROTECTED]> wrote:
> 
> Mike, as a committer, what do you think?
> 
> Thanks!
> 
> 
> Peter Keegan wrote:
> > 
> > I've built a production index with this patch and done some query stress
> > testing with no problems.
> > I'd give it a thumbs up.
> > 
> > Peter
> > 
> > On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >> Hi guys,
> >>
> >> Do you think LUCENE-843 is stable enough? If so, do you think it's worth
> >> to
> >> release it with probably LUCENE 2.2.1? It would be nice so that people
> >> can
> >> take the advantage of it right away without risking other breaking
> >> changes
> >> in the HEAD branch or waiting until 2.3 release.
> >>
> >> Thanks,
> >> --
> >> View this message in context:
> >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> > 
> > 
> 
> -- 
> View this message in context:
> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Solr's NumberUtils doesnt work

2007-08-02 Thread testn

How did you encode your integer into String? Did you use int2sortableStr?



is_maximum wrote:
> 
> Hi
> I am using NumberUtils to encode and decode numbers while indexing and
> searching, when I am going to decode the number retrieved from an index it
> throws exception for some fields
> the exception message is:
> 
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: 1
> at java.lang.String.charAt(Unknown Source)
> at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java
> :125)
> at
> org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java:37)
> at com.payvand.lucene.util.ExtendedNumberUtils.decodeInteger(
> ExtendedNumberUtils.java:123)
> 
> 
> I dont know why this happen, I am wondering if it has something to do with
> character encoding. have you had such problem?
> 
> thanks
> 
> -- 
> Regards,
> Mohammad Norouzi
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr%27s-NumberUtils-doesnt-work-tf4204371.html#a11963213
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: LUCENE-843 Release

2007-08-02 Thread testn

Thanks! Will look forward to 2.3 then.


Michael McCandless-2 wrote:
> 
> 
> Honestly I don't really think this is a good idea.
> 
> While LUCENE-843 has proven stable so far (knock on wood!), it is
> still a major change and I do worry (less with time :) that maybe I
> broke something subtle somewhere.
> 
> While a few brave people have tested the trunk in their production
> worlds and seen good performance gains, that testing is still limited
> compared to a real release.
> 
> A point release (2.4.1) really is not supposed to contain major
> changes, just bug fixes, and so I don't think we should violate that
> accepted practice.
> 
> I would rather see us finish up 2.3 and release it, and going forwards
> do more frequent releases, instead of porting big changes back onto
> point releases.
> 
> Mike
> 
> "testn" <[EMAIL PROTECTED]> wrote:
>> 
>> Mike, as a committer, what do you think?
>> 
>> Thanks!
>> 
>> 
>> Peter Keegan wrote:
>> > 
>> > I've built a production index with this patch and done some query
>> stress
>> > testing with no problems.
>> > I'd give it a thumbs up.
>> > 
>> > Peter
>> > 
>> > On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >> Hi guys,
>> >>
>> >> Do you think LUCENE-843 is stable enough? If so, do you think it's
>> worth
>> >> to
>> >> release it with probably LUCENE 2.2.1? It would be nice so that people
>> >> can
>> >> take the advantage of it right away without risking other breaking
>> >> changes
>> >> in the HEAD branch or waiting until 2.3 release.
>> >>
>> >> Thanks,
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> > 
>> > 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11963778
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: IndexReader deletes more that expected

2007-08-02 Thread Ridwan Habbal
Yes, you are right, thanks for the great reply! I skimmed it so quickly today, 
so re-read it now, and got the point you mean. I just tried Lucene 2.2.0 (I was 
using 2.0.0) and i could do add, delete and update docs so smoothly! Based on 
my tests i did so far, similar to tests I presented in my first email, that i 
don't have to worry who added and who deleted, and i can get rid of 
Synchronized java methods ant lead to so slow app performance. 
 
I kept maintaining only one open instance of indexWrtier for the whole app. As 
i stated b4, i suffered of having lock exception. I use flush() instead of 
close(). In contrast, I create new IndexSearcher instance every time i search. 
I dislike to open and close then reopen the index searcher over and over. I 
don't use Indexreader directly anymore, since i do have to use it indirectly 
using IndexSearcher. I won't try IndexModifier since you told me that 
IndexWriter in 2.2.0 is much better. 
Do you think i'm doing good this way i use IndexWriter (one instance for the 
whole app)?
 
One thing still remaining pending, however I need compass guys for it, is that 
they do use the new version of lucene or not yet.. i will check with them 
anyways. I can't have two different versions of jars for the same classes in 
same package. 
 
Final question, I still haven't seen Solr in details, but is it strongly 
recommended to use it when i have webapps? 
 
please write back! 
 
cya
 
Rid







> Date: Wed, 1 Aug 2007 13:14:04 -0400> From: [EMAIL PROTECTED]> To: 
> java-user@lucene.apache.org> Subject: Re: IndexReader deletes more that 
> expected> On 8/1/07, Ridwan Habbal <[EMAIL PROTECTED]> wrote:>>  but what 
> about runing it on mutiThread app like web application?  There> you are the 
> code:  If you are targeting a multi threaded webapp than I strongly suggest 
> youlook into using either Solr or the LuceneIndexAccessor code. You will 
> wantto use some form of reference counting to manage your Readers and 
> Writers. Also, you can now use IndexWriter (Lucene 2.0 and greater I think) 
> todelete. This allows for efficient mixing of deletes and adds by 
> bufferingthe deletes, and then opening an IndexReader to commit them later. 
> This ismuch more efficient than IndexModifier. - Mark
_
PC Magazine’s 2007 editors’ choice for best web mail—award-winning Windows Live 
Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HMWL_mini_pcmag_0707

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Michael J. Prichard
Yea, I have seen those.  I guess the question is what do you all use to 
extract text from Word, Excel, PPT and PDF?  Can I use POI, PDFBox and 
so on?  This is what I use now to extract english.


Thanks,
Michael

testn wrote:

If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/



heybluez wrote:
  

I know how to do english text with POI and PDFBox and so on.  Now, I want
to start indexing non-english language such as french and spanish.  Which
extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  




Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread testn

Check out..
http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e



heybluez wrote:
> 
> Yea, I have seen those.  I guess the question is what do you all use to 
> extract text from Word, Excel, PPT and PDF?  Can I use POI, PDFBox and 
> so on?  This is what I use now to extract english.
> 
> Thanks,
> Michael
> 
> testn wrote:
>> If you can extract token stream from those files already, you can simply
>> use
>> different analyzers to analyze those token stream appropriately. Check
>> out
>> Lucen-contrib analyzers at
>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
>>
>>
>>
>> heybluez wrote:
>>   
>>> I know how to do english text with POI and PDFBox and so on.  Now, I
>>> want
>>> to start indexing non-english language such as french and spanish. 
>>> Which
>>> extraction libs are available for me?
>>>
>>> I want to do:
>>>
>>> Excel
>>> Word
>>> PowerPoint
>>> PDF
>>> HTML
>>> RTF
>>>
>>> Thanks!
>>> Michael
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>> 
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc---tf4198171.html#a11964422
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Do AND + OR Search in Lucene

2007-08-02 Thread Erick Erickson
Alternatively, construct a parenthesized query that
reflects what you want. If you do, make sure that OR is capitalized,
or make REAL SURE you understand the Lucene syntax and construct
your query with that syntax.

Erick

On 8/2/07, testn <[EMAIL PROTECTED]> wrote:
>
>
> You can create two queries from two query parser, one with AND and the
> other
> one with OR. After you create both of them, you call setBoost() to give
> different boost level and then join them together using BooleanQuery with
> option BooleanClause.Occur.SHOULD. That should do the trick.
>
>
> askarzaidi wrote:
> >
> > Hey Guys,
> >
> > Quick question:
> >
> > I do this in my code for searching:
> >
> > queryParser.setDefaultOperator(QueryParser.Operator.AND);
> >
> > Lucene is OR by default so I change it to AND for my requirements. Now,
> I
> > have a requirement to do OR as well. I mean while doing AND I'd like to
> > include results from OR too ... but they'll be much lower ranked than
> the
> > AND results.
> >
> > Is there a way to do this ?
> >
> > thanks,
> > AZ
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Do-AND-%2B-OR-Search-in-Lucene-tf4205268.html#a11962340
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Grant Ingersoll

Hey Michael,

Have you given it a try?  I would think they would work, but haven't  
actually done it.   Setup a small test that reads in a PDF in French  
or Spanish and give it a try.  You might have to worry about  
encodings or something, but the structure of the files should be the  
same, i.e. they are valid Word, etc. documents.


-Grant

On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:

Yea, I have seen those.  I guess the question is what do you all  
use to extract text from Word, Excel, PPT and PDF?  Can I use POI,  
PDFBox and so on?  This is what I use now to extract english.


Thanks,
Michael

testn wrote:
If you can extract token stream from those files already, you can  
simply use
different analyzers to analyze those token stream appropriately.  
Check out

Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/ 
analyzers/src/java/org/apache/lucene/analysis/




heybluez wrote:

I know how to do english text with POI and PDFBox and so on.   
Now, I want
to start indexing non-english language such as french and  
spanish.  Which

extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]











--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Ben Litchfield

In terms of PDF documents...

PDFBox should work just fine with any latin based languages; at this  
time certain PDFs that have CJK characters can pose some issues.  In  
general english/french/spanish should be fine.


Some PDFs use custom encodings that make it impossible to extract text  
and it comes out as gibberish.  As a simple test if Acrobat can  
extract the text then PDFBox should be able to as well.


Ben


Quoting Grant Ingersoll <[EMAIL PROTECTED]>:


Hey Michael,

Have you given it a try?  I would think they would work, but haven't
actually done it.   Setup a small test that reads in a PDF in French or
Spanish and give it a try.  You might have to worry about encodings or
something, but the structure of the files should be the same, i.e. they
are valid Word, etc. documents.

-Grant

On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:

Yea, I have seen those.  I guess the question is what do you all   
use to extract text from Word, Excel, PPT and PDF?  Can I use POI,   
PDFBox and so on?  This is what I use now to extract english.


Thanks,
Michael

testn wrote:
If you can extract token stream from those files already, you can   
simply use

different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/



heybluez wrote:


I know how to do english text with POI and PDFBox and so on.  Now, I want
to start indexing non-english language such as french and spanish.  Which
extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]











--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey

Hi,

It's been a couple of days now and I haven't heard anything on this 
topic, while there has been substantial list traffic otherwise.


Am I asking in the wrong place? Was I unclear?

I know there are people out there that have used/are using Lucene in a 
clustered environment. I am just looking for any sort of feedback 
(general or specific) about clustering lucene as well as filesystem 
compatibility (windows shares, NFS, etc.).


Thanks again,
-Zach

Zach Bailey wrote:

Hello all,

First a little background - we are developing a clustered application 
that will in part leverage Lucene to provide index and search 
capabilities. We have already spent time investigating various index 
storage implementations (database vs. filesystem) and we've decided for 
performance reasons to go with a filesystem index storage scheme.


That said, I have read back through the archives a bit and noticed that 
the support for index storage on NFS is still experimental (e.g. the 
latest bugfixes have not made it out to an official, stable release). I 
realize most of the issues related to using a shared file system revolve 
around locking, and I haven't seen much about the maturity of locking 
for other network filesystems.


I was wondering if anyone has tried any other networked filesystems or 
had any recommendations. We have clients who would be doing this on both 
Windows and Unix/Linux so any insight there would be appreciated as well 
- it can be assumed that across any cluster the operating system use 
would be homogeneous (i.e. all nodes are on windows and would use 
windows shares, or all nodes are on linux and would use xyz filesystem).


Thanks in advance,
-Zach Bailey



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread testn

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


Zach Bailey wrote:
> 
> Hi,
> 
> It's been a couple of days now and I haven't heard anything on this 
> topic, while there has been substantial list traffic otherwise.
> 
> Am I asking in the wrong place? Was I unclear?
> 
> I know there are people out there that have used/are using Lucene in a 
> clustered environment. I am just looking for any sort of feedback 
> (general or specific) about clustering lucene as well as filesystem 
> compatibility (windows shares, NFS, etc.).
> 
> Thanks again,
> -Zach
> 
> Zach Bailey wrote:
>> Hello all,
>> 
>> First a little background - we are developing a clustered application 
>> that will in part leverage Lucene to provide index and search 
>> capabilities. We have already spent time investigating various index 
>> storage implementations (database vs. filesystem) and we've decided for 
>> performance reasons to go with a filesystem index storage scheme.
>> 
>> That said, I have read back through the archives a bit and noticed that 
>> the support for index storage on NFS is still experimental (e.g. the 
>> latest bugfixes have not made it out to an official, stable release). I 
>> realize most of the issues related to using a shared file system revolve 
>> around locking, and I haven't seen much about the maturity of locking 
>> for other network filesystems.
>> 
>> I was wondering if anyone has tried any other networked filesystems or 
>> had any recommendations. We have clients who would be doing this on both 
>> Windows and Unix/Linux so any insight there would be appreciated as well 
>> - it can be assumed that across any cluster the operating system use 
>> would be homogeneous (i.e. all nodes are on windows and would use 
>> windows shares, or all nodes are on linux and would use xyz filesystem).
>> 
>> Thanks in advance,
>> -Zach Bailey
>> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Clustered-Indexing-on-common-network-filesystem-tf4194135.html#a11966423
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey

Thanks for your response --

Based on my understanding, hadoop and nutch are essentially the same 
thing, with nutch being derived from hadoop, and are primarily intended 
to be standalone applications.


We are not looking for a standalone application, rather we must use a 
framework to implement search inside our current content management 
application. Currently the application search functionality is designed 
and built around Lucene, so migrating frameworks at this point is not 
feasible.


We are currently re-working our back-end to support clustering (in 
tomcat) and we are looking for information on the migration of Lucene 
from a single node filesystem index (which is what we use now and hope 
to continue to use for clients with a single-node deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we do not have to have 
two disparate methods of managing indexes for clients who run in a 
single-node, non-clustered environment versus clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone could shed some light on:

Is this not a recommended method of managing indexes across multiple nodes?

At this point would people recommend storing an individual index on each 
node and propagating index updates via a JMS framework rather than 
attempting to handle it transparently with a single shared index?


Is the Lucene index code so intimately tied to filesystem semantics that 
using a shared/networked file system is infeasible at this point in time?


What would be the quickest time-to-implementation of these strategies 
(JMS vs. shared FS)? The most robust/least error-prone?


I really appreciate any insight or response anyone can provide, even if 
it is a short answer to any of the related topics, "i.e. we implemented 
clustered search using per-node indexing with JMS update propagation and 
it works great", or even something as simple as "don't use a shared 
filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Mark Miller

Some quick info:

NFS should work, but I think youll want to be working off the trunk. 
Also, Sharing an index over NFS is supposed to be slow. The standard so 
far, if you are not partitioning the index, is to use a unix/linux 
filesystem and hardlinks + rsync to efficiently share index changes 
across nodes (hard links for instant copy, rsync to only transfer 
changed index files, search the mailing list). If you look at solr you 
can see scripts that give an example of this. I don't think the scripts 
rely on solr. This kind of setup should be quick and simple to 
implement. Same with NFS. An RMI solution that allowed for index 
partitioning would probably be the longest to do.


-Mark



Zach Bailey wrote:

Thanks for your response --

Based on my understanding, hadoop and nutch are essentially the same 
thing, with nutch being derived from hadoop, and are primarily 
intended to be standalone applications.


We are not looking for a standalone application, rather we must use a 
framework to implement search inside our current content management 
application. Currently the application search functionality is 
designed and built around Lucene, so migrating frameworks at this 
point is not feasible.


We are currently re-working our back-end to support clustering (in 
tomcat) and we are looking for information on the migration of Lucene 
from a single node filesystem index (which is what we use now and hope 
to continue to use for clients with a single-node deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we do not have to have 
two disparate methods of managing indexes for clients who run in a 
single-node, non-clustered environment versus clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone could shed some 
light on:


Is this not a recommended method of managing indexes across multiple 
nodes?


At this point would people recommend storing an individual index on 
each node and propagating index updates via a JMS framework rather 
than attempting to handle it transparently with a single shared index?


Is the Lucene index code so intimately tied to filesystem semantics 
that using a shared/networked file system is infeasible at this point 
in time?


What would be the quickest time-to-implementation of these strategies 
(JMS vs. shared FS)? The most robust/least error-prone?


I really appreciate any insight or response anyone can provide, even 
if it is a short answer to any of the related topics, "i.e. we 
implemented clustered search using per-node indexing with JMS update 
propagation and it works great", or even something as simple as "don't 
use a shared filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How do YOU detect corrupt indexes?

2007-08-02 Thread Joe R

Hello,

I've been asked to devise some way to discover and correct data in Lucene
indexes that have been "corrupted."  The word "corrupt", in this case, has a
few different meanings, some of which strike me as exceedingly difficult to
grok.  What concerns me are the cases where we don't know that an index has
been changed:  A bit error in a stored field, for instance, is a form of
corruption that we (ideally) should be able to identify, at the very least, and
hopefully correct.  This case in particular seems particularly onerous, since
this isn't going to throw an exception of any sort, any time.

We have a fairly good handle on how to remedy problems that throw exceptions,
so we should be alright with corruption where (say) an operator logs in and
overwrites a file.

I'm wondering how other Lucene users have tackled this problem in the past. 
Calculating checksums on the documents seems like one way to go about it:
compute a checksum on the document and, in a background thread, compare the
checksum to the data.  Unfortunately we're building a large, federated system
and it would take months to exhaustively check every document this way. 
Checksumming the files themselves might be too much: We're storing gigabytes of
data per index and there is some churn to the data; in other words, the
overhead for this method might be too high.

Thanks for any help you might have.


-Joseph Rose



   

Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Rajesh parab
One more alternative, though I am not sure if anyone
is using it.

Apache Compass has added a plug-in to allow storing
Lucene index files inside the database. This should
work in clustered environment as all nodes will share
the same database instance.

I am not sure the impact it will have on performance.

Is anyone using DB for index storage? Any drawbacks of
this approach?

Regards,
Rajesh

--- Zach Bailey <[EMAIL PROTECTED]> wrote:

> Thanks for your response --
> 
> Based on my understanding, hadoop and nutch are
> essentially the same 
> thing, with nutch being derived from hadoop, and are
> primarily intended 
> to be standalone applications.
> 
> We are not looking for a standalone application,
> rather we must use a 
> framework to implement search inside our current
> content management 
> application. Currently the application search
> functionality is designed 
> and built around Lucene, so migrating frameworks at
> this point is not 
> feasible.
> 
> We are currently re-working our back-end to support
> clustering (in 
> tomcat) and we are looking for information on the
> migration of Lucene 
> from a single node filesystem index (which is what
> we use now and hope 
> to continue to use for clients with a single-node
> deployment) to a 
> shared filesystem index on a mounted network share.
> 
> We prefer to use this strategy because it means we
> do not have to have 
> two disparate methods of managing indexes for
> clients who run in a 
> single-node, non-clustered environment versus
> clients who run in a 
> multiple-node, clustered environment.
> 
> So, hopefully here are some easy questions someone
> could shed some light on:
> 
> Is this not a recommended method of managing indexes
> across multiple nodes?
> 
> At this point would people recommend storing an
> individual index on each 
> node and propagating index updates via a JMS
> framework rather than 
> attempting to handle it transparently with a single
> shared index?
> 
> Is the Lucene index code so intimately tied to
> filesystem semantics that 
> using a shared/networked file system is infeasible
> at this point in time?
> 
> What would be the quickest time-to-implementation of
> these strategies 
> (JMS vs. shared FS)? The most robust/least
> error-prone?
> 
> I really appreciate any insight or response anyone
> can provide, even if 
> it is a short answer to any of the related topics,
> "i.e. we implemented 
> clustered search using per-node indexing with JMS
> update propagation and 
> it works great", or even something as simple as
> "don't use a shared 
> filesystem at this point".
> 
> Cheers,
> -Zach
> 
> testn wrote:
> > Why don't you check out Hadoop and Nutch? It
> should provide what you are
> > looking for.
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



   

Building a website is a piece of cake. Yahoo! Small Business gives you all the 
tools to get online.
http://smallbusiness.yahoo.com/webhosting 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey

Mark,

Thanks so much for your response.

Unfortunately, I am not sure the leader of the project would feel good 
about running code from trunk, save without an explicit endorsement from 
a majority of the developers or contributors for that particular code 
(do those people keep up with this list, anyway?) Is there any word on 
the possible timeframe the code required to work with NFS might be released?


Thanks for your other insight about hardlinks and rsync. I will look 
into that; unfortunately it does not cover our userbase who may be 
clustering in a Windows Server environment. I still have not heard/seen 
any evidence (anecdotal or otherwise) about how well lucene might work 
sharing indexes over a mounted Windows share.


-Zach

Mark Miller wrote:

Some quick info:

NFS should work, but I think youll want to be working off the trunk. 
Also, Sharing an index over NFS is supposed to be slow. The standard so 
far, if you are not partitioning the index, is to use a unix/linux 
filesystem and hardlinks + rsync to efficiently share index changes 
across nodes (hard links for instant copy, rsync to only transfer 
changed index files, search the mailing list). If you look at solr you 
can see scripts that give an example of this. I don't think the scripts 
rely on solr. This kind of setup should be quick and simple to 
implement. Same with NFS. An RMI solution that allowed for index 
partitioning would probably be the longest to do.


-Mark



Zach Bailey wrote:

Thanks for your response --

Based on my understanding, hadoop and nutch are essentially the same 
thing, with nutch being derived from hadoop, and are primarily 
intended to be standalone applications.


We are not looking for a standalone application, rather we must use a 
framework to implement search inside our current content management 
application. Currently the application search functionality is 
designed and built around Lucene, so migrating frameworks at this 
point is not feasible.


We are currently re-working our back-end to support clustering (in 
tomcat) and we are looking for information on the migration of Lucene 
from a single node filesystem index (which is what we use now and hope 
to continue to use for clients with a single-node deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we do not have to have 
two disparate methods of managing indexes for clients who run in a 
single-node, non-clustered environment versus clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone could shed some 
light on:


Is this not a recommended method of managing indexes across multiple 
nodes?


At this point would people recommend storing an individual index on 
each node and propagating index updates via a JMS framework rather 
than attempting to handle it transparently with a single shared index?


Is the Lucene index code so intimately tied to filesystem semantics 
that using a shared/networked file system is infeasible at this point 
in time?


What would be the quickest time-to-implementation of these strategies 
(JMS vs. shared FS)? The most robust/least error-prone?


I really appreciate any insight or response anyone can provide, even 
if it is a short answer to any of the related topics, "i.e. we 
implemented clustered search using per-node indexing with JMS update 
propagation and it works great", or even something as simple as "don't 
use a shared filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Zach Bailey

Rajesh,

I forgot to mention this, but we did investigate this option as well and 
even prototyped it for an internal project. It ended up being too slow 
for us.


It was adding a lot of overhead even to small updates, IIRC, mainly due 
to the fact that the index was essentially stored as a filesystem in the 
database. As you can probably imagine, using a database as a filesystem 
is not very performant.


Rajesh parab wrote:

One more alternative, though I am not sure if anyone
is using it.

Apache Compass has added a plug-in to allow storing
Lucene index files inside the database. This should
work in clustered environment as all nodes will share
the same database instance.

I am not sure the impact it will have on performance.

Is anyone using DB for index storage? Any drawbacks of
this approach?

Regards,
Rajesh

--- Zach Bailey <[EMAIL PROTECTED]> wrote:


Thanks for your response --

Based on my understanding, hadoop and nutch are
essentially the same 
thing, with nutch being derived from hadoop, and are
primarily intended 
to be standalone applications.


We are not looking for a standalone application,
rather we must use a 
framework to implement search inside our current
content management 
application. Currently the application search
functionality is designed 
and built around Lucene, so migrating frameworks at
this point is not 
feasible.


We are currently re-working our back-end to support
clustering (in 
tomcat) and we are looking for information on the
migration of Lucene 
from a single node filesystem index (which is what
we use now and hope 
to continue to use for clients with a single-node
deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we
do not have to have 
two disparate methods of managing indexes for
clients who run in a 
single-node, non-clustered environment versus
clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone
could shed some light on:

Is this not a recommended method of managing indexes
across multiple nodes?

At this point would people recommend storing an
individual index on each 
node and propagating index updates via a JMS
framework rather than 
attempting to handle it transparently with a single

shared index?

Is the Lucene index code so intimately tied to
filesystem semantics that 
using a shared/networked file system is infeasible

at this point in time?

What would be the quickest time-to-implementation of
these strategies 
(JMS vs. shared FS)? The most robust/least

error-prone?

I really appreciate any insight or response anyone
can provide, even if 
it is a short answer to any of the related topics,
"i.e. we implemented 
clustered search using per-node indexing with JMS
update propagation and 
it works great", or even something as simple as
"don't use a shared 
filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It

should provide what you are

looking for.



-

To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]






   


Building a website is a piece of cake. Yahoo! Small Business gives you all the 
tools to get online.
http://smallbusiness.yahoo.com/webhosting 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Michael McCandless

I have been meaning to write up a Wiki page on this general topic but
have not quite made time yet ...

Sharing an index with a shared filesystem will work, however there are
some caveats:

  * This is somewhat unchartered territory because it's fairly recent
fixes to Lucene that have enabled the things below to work, and,
it's not a heavily tested area.  Please share your experience so
we all can learn...

  * If the filesystem does not protect against deletion of open files
(notably NFS does not, however SMB/CIFS does) then you will need
to create a custom DeletionPolicy based on your app logic so
writer & readers "agree" on when it's safe to delete prior commit
points.

This can be something simple like "readers always refresh at least
once per hour so any commit point older than 1 hour may be safely
deleted".

  * Locking: if your app can ensure only one writer is active at a
time, you can disable locking in Lucene entirely.  Else, it's best
to use NativeFSLockFactory, if you can.

  * If you are using a filesystem that does not have coherent caching
on directory listing (NFS clients often do not), and, different
nodes can "become" the writer (vs a single dedicated writer node)
then there is one known open issue that you'll hit once you make
your own DeletionPolicy which I still have to port to trunk:

  http://issues.apache.org/jira/browse/LUCENE-948

But as Mark said, performance is likely quite poor and so you may want
to take an approach like Solr (or, use Solr) whereby a single writer
makes changes to the index.  Then these changes are efficiently
propagated to multiple hosts (hard link & rsync is one way but not the
only way), and these hosts then search their private copy via their
local filesystem.

Mike

"Zach Bailey" <[EMAIL PROTECTED]> wrote:
> Mark,
> 
> Thanks so much for your response.
> 
> Unfortunately, I am not sure the leader of the project would feel good 
> about running code from trunk, save without an explicit endorsement from 
> a majority of the developers or contributors for that particular code 
> (do those people keep up with this list, anyway?) Is there any word on 
> the possible timeframe the code required to work with NFS might be
> released?
> 
> Thanks for your other insight about hardlinks and rsync. I will look 
> into that; unfortunately it does not cover our userbase who may be 
> clustering in a Windows Server environment. I still have not heard/seen 
> any evidence (anecdotal or otherwise) about how well lucene might work 
> sharing indexes over a mounted Windows share.
> 
> -Zach
> 
> Mark Miller wrote:
> > Some quick info:
> > 
> > NFS should work, but I think youll want to be working off the trunk. 
> > Also, Sharing an index over NFS is supposed to be slow. The standard so 
> > far, if you are not partitioning the index, is to use a unix/linux 
> > filesystem and hardlinks + rsync to efficiently share index changes 
> > across nodes (hard links for instant copy, rsync to only transfer 
> > changed index files, search the mailing list). If you look at solr you 
> > can see scripts that give an example of this. I don't think the scripts 
> > rely on solr. This kind of setup should be quick and simple to 
> > implement. Same with NFS. An RMI solution that allowed for index 
> > partitioning would probably be the longest to do.
> > 
> > -Mark
> > 
> > 
> > 
> > Zach Bailey wrote:
> >> Thanks for your response --
> >>
> >> Based on my understanding, hadoop and nutch are essentially the same 
> >> thing, with nutch being derived from hadoop, and are primarily 
> >> intended to be standalone applications.
> >>
> >> We are not looking for a standalone application, rather we must use a 
> >> framework to implement search inside our current content management 
> >> application. Currently the application search functionality is 
> >> designed and built around Lucene, so migrating frameworks at this 
> >> point is not feasible.
> >>
> >> We are currently re-working our back-end to support clustering (in 
> >> tomcat) and we are looking for information on the migration of Lucene 
> >> from a single node filesystem index (which is what we use now and hope 
> >> to continue to use for clients with a single-node deployment) to a 
> >> shared filesystem index on a mounted network share.
> >>
> >> We prefer to use this strategy because it means we do not have to have 
> >> two disparate methods of managing indexes for clients who run in a 
> >> single-node, non-clustered environment versus clients who run in a 
> >> multiple-node, clustered environment.
> >>
> >> So, hopefully here are some easy questions someone could shed some 
> >> light on:
> >>
> >> Is this not a recommended method of managing indexes across multiple 
> >> nodes?
> >>
> >> At this point would people recommend storing an individual index on 
> >> each node and propagating index updates via a JMS framework rather 
> >> than attempting to ha

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Michael McCandless

"Zach Bailey" <[EMAIL PROTECTED]> wrote:

> Unfortunately, I am not sure the leader of the project would feel good 
> about running code from trunk, save without an explicit endorsement from 
> a majority of the developers or contributors for that particular code 
> (do those people keep up with this list, anyway?) Is there any word on 
> the possible timeframe the code required to work with NFS might be
> released?

This person does keep up with the list :)

On timframe ... there are tentative discussions now on the dev
list on releasing 2.3 in a few months time, but by no means is
this a hard schedule.  I'll make sure LUCENE-948 is included in 2.3.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mark Miller
If you are just retrieving your custom id and you have more stored 
fields (and they are not tiny) you certainly do want to use a field 
selector. I would suggest SetBasedFieldSelector.


- Mark

testn wrote:

Hi,

Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an
ability to help you load only the first field in the document without
loading all the fields. After that, you can keep the whole document if you
like. It should help improve performance better.



is_maximum wrote:
  

yes it decrease the performance but the only solution.
I've spent many weeks to find best way to retrive my own IDs but find this
way as last one

now I am storing the ids in a BitSet structure and it's fast enough

public void collect(...){
idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));

}

On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:



Hi,

   The solution you suggested will definitely work but will definitely
slow
down my search by an order of magnitude. The problem I am trying to solve
is
performance, thats why I need the collection of IDs and not the whole
documents.

- thanks for the prompt reply.


is_maximum wrote:
  

yes if you extend your class from HitCollector and override the


collect()
  

mthod with following signature you can get IDs

public void collect(int id, float score)

On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:


Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 2 contracts in the file system index each with some
parameterName and Value. Given a search criterion -
  

(paramValue='draft').
  

I
need to get just an ArrayList of Strings containing contract Ids. I
  

dont
  

need the lucene documents, just the Ids.

Can this be done ?

-thanks

--
View this message in context:

  

http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
  

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/




--
View this message in context:
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  

--
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/





  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Daniel Noll
On Thursday 02 August 2007 19:28:48 Mohammad Norouzi wrote:
> you should not store them in an Array structure since they will take up the
> memory.
> the BitSet is the best structure to store them

You can't store strings in a BitSet.

What I would do is return a List but make a custom subclass of 
AbstractList which creates the strings on demand from the Hits 
object.  We use similar tricks to convert Hits into a List of another more 
complex object type and it works great.  You can cache the strings as they're 
retrieved if you're planning to use some strings much more than others.

Daniel


-- 
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Can I do boosting based on term postions?

2007-08-02 Thread Shailendra Sharma
I am doing implementation of SpanTermQuery for you, give me today. Sorry, I
was out for meetings for 2 days.

Enjoy,
Shailendra

On 8/3/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
>
> Hi Paul,
>
> Isn't SpanFirstQuery only match those with position less than a
> certain end position?
>
> I am rather looking for a query that would score a document higher for
> terms appear near the start but not totally discard those with terms
> appear near the end.
>
> Regards,
> Cedric
>
> On 8/2/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > Cedric,
> >
> > SpanFirstQuery could be a solution without payloads.
> > You may want to give it your own Similarity.sloppyFreq() .
> >
> > Regards,
> > Paul Elschot
> >
> > On Thursday 02 August 2007 04:07, Cedric Ho wrote:
> > > Thanks for the quick response =)
> > >
> > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote:
> > > > Yes, it is easily doable through "Payload" facility. During indexing
> > process
> > > > (mainly tokenization), you need to push this extra information in
> each
> > > > token. And then you can use BoostingTermQuery for using Payload
> value to
> > > > include Payload in the score. You also need to implement Similarity
> for
> > this
> > > > (mainly scorePayload method).
> > >
> > > If I store, say a custom boost factor as Payload, does it means that I
> > > will store one more byte per term per document in the index file? So
> > > the index file would be much larger?
> > >
> > > >
> > > > Other way can be to extend SpanTermQuery, this already calculates
> the
> > > > position of match. You just need to do something to use this
> position
> > value
> > > > in the score calculation.
> > >
> > > I see that SpanTermQuery takes a TermPositions from the indexReader
> > > and I can get the term position from there. However I am not sure how
> > > to incorporate it into the score calculation. Would you mind give a
> > > little more detail on this?
> > >
> > > >
> > > > One possible advantage of SpanTermQuery approach is that you can
> play
> > > > around, without re-creating indices everytime.
> > > >
> > > > Thanks,
> > > > Shailendra Sharma,
> > > > CTO, Ver se' Innovation Pvt. Ltd.
> > > > Bangalore, India
> > > >
> > > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I was wondering if it is possible to do boosting by search terms'
> > > > > position in the document.
> > > > >
> > > > > for example:
> > > > > search terms appear in the first 100 words, or first 10% words, or
> in
> > > > > first two paragraphs would be given higher score.
> > > > >
> > > > > Is it achievable through using the new Payload function in lucene
> 2.2?
> > > > > Or are there any easier ways to achieve these ?
> > > > >
> > > > >
> > > > > Regards,
> > > > > Cedric
> > > > >
> > > > >
> -
> > > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > > >
> > > > >
> > > >
> > >
> > > Thanks,
> > > Cedric
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
> --
> [EMAIL PROTECTED]
>


Re: Can I do boosting based on term postions?

2007-08-02 Thread Cedric Ho
Hi Paul,

Isn't SpanFirstQuery only match those with position less than a
certain end position?

I am rather looking for a query that would score a document higher for
terms appear near the start but not totally discard those with terms
appear near the end.

Regards,
Cedric

On 8/2/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
> Cedric,
>
> SpanFirstQuery could be a solution without payloads.
> You may want to give it your own Similarity.sloppyFreq() .
>
> Regards,
> Paul Elschot
>
> On Thursday 02 August 2007 04:07, Cedric Ho wrote:
> > Thanks for the quick response =)
> >
> > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]> wrote:
> > > Yes, it is easily doable through "Payload" facility. During indexing
> process
> > > (mainly tokenization), you need to push this extra information in each
> > > token. And then you can use BoostingTermQuery for using Payload value to
> > > include Payload in the score. You also need to implement Similarity for
> this
> > > (mainly scorePayload method).
> >
> > If I store, say a custom boost factor as Payload, does it means that I
> > will store one more byte per term per document in the index file? So
> > the index file would be much larger?
> >
> > >
> > > Other way can be to extend SpanTermQuery, this already calculates the
> > > position of match. You just need to do something to use this position
> value
> > > in the score calculation.
> >
> > I see that SpanTermQuery takes a TermPositions from the indexReader
> > and I can get the term position from there. However I am not sure how
> > to incorporate it into the score calculation. Would you mind give a
> > little more detail on this?
> >
> > >
> > > One possible advantage of SpanTermQuery approach is that you can play
> > > around, without re-creating indices everytime.
> > >
> > > Thanks,
> > > Shailendra Sharma,
> > > CTO, Ver se' Innovation Pvt. Ltd.
> > > Bangalore, India
> > >
> > > On 8/1/07, Cedric Ho <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I was wondering if it is possible to do boosting by search terms'
> > > > position in the document.
> > > >
> > > > for example:
> > > > search terms appear in the first 100 words, or first 10% words, or in
> > > > first two paragraphs would be given higher score.
> > > >
> > > > Is it achievable through using the new Payload function in lucene 2.2?
> > > > Or are there any easier ways to achieve these ?
> > > >
> > > >
> > > > Regards,
> > > > Cedric
> > > >
> > > > -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> >
> > Thanks,
> > Cedric
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
[EMAIL PROTECTED]


Performance improvements using writer.delete vs reader.delete

2007-08-02 Thread Andreas Knecht

Hi,

We're considering to use the new IndexWriter.deleteDocuments call rather 
than the IndexReader.delete call.  Are there any performance 
improvements that this may provide, other than the benefit of not having 
to switch between readers/writers?


We've looked at LUCENE-565, but there's no clear view of performance 
enhancements over the old IndexReader call.


Cheers,
Andreas

--
ATLASSIAN
Our products help over 7,000 organisations in more than 88 countries to 
collaborate. http://www.atlassian.com/


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance improvements using writer.delete vs reader.delete

2007-08-02 Thread Doron Cohen
Andreas Knecht wrote:

> We're considering to use the new IndexWriter.deleteDocuments call rather
> than the IndexReader.delete call.  Are there any performance
> improvements that this may provide, other than the benefit of not having
> to switch between readers/writers?
>
> We've looked at LUCENE-565, but there's no clear view of performance
> enhancements over the old IndexReader call.

I think Yonik's comment in 565 holds here -
http://issues.apache.org/jira/browse/LUCENE-565#action_12432155
- if your application is buffering deletes/updates and
then batch the deletes you probably won't see a large
improvement. But if your application does not buffer
the deletes and does not batch them, then I believe
moving to IndexWriter.delete() (and update()) should
buy you performance improvement, because IndexWriter
would now buffer the deletes for you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Dmitry
Not sure how exactly understand corrupted indexes in the sense that could 
not read / use indexes or something else..


thanks
DT
www.ejinz.com
EjinZ Search Engine

- Original Message - 
From: "Doron Cohen" <[EMAIL PROTECTED]>

To: 
Sent: Friday, August 03, 2007 1:03 AM
Subject: Re: How do YOU detect corrupt indexes?



What is the anticipated cause of corruption? Malicious?
Hardware fault? This somewhat reminds of discussions in
the list about encrypting the index. See LUCENE-737
and a discussion pointed by it. One of the opinions
there was that encryption should be handled at a lower
level (OS/FS). Wouldn't that hold here as well?

Joe R wrote:



Hello,

I've been asked to devise some way to discover and correct datain Lucene
indexes that have been "corrupted."  The word "corrupt", in
this case, has a
few different meanings, some of which strike me as exceedingly
difficult to
grok.  What concerns me are the cases where we don't know that
an index has
been changed:  A bit error in a stored field, for instance, is a form of
corruption that we (ideally) should be able to identify, at the
very least, and
hopefully correct.  This case in particular seems particularly
onerous, since
this isn't going to throw an exception of any sort, any time.

We have a fairly good handle on how to remedy problems that
throw exceptions,
so we should be alright with corruption where (say) an operator
logs in and
overwrites a file.

I'm wondering how other Lucene users have tackled this problem
in the past.
Calculating checksums on the documents seems like one way to goabout it:
compute a checksum on the document and, in a background thread,
compare the
checksum to the data.  Unfortunately we're building a large,
federated system
and it would take months to exhaustively check every document this way.
Checksumming the files themselves might be too much: We're
storing gigabytes of
data per index and there is some churn to the data; in other words, the
overhead for this method might be too high.

Thanks for any help you might have.


-Joseph Rose



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Doron Cohen
What is the anticipated cause of corruption? Malicious?
Hardware fault? This somewhat reminds of discussions in
the list about encrypting the index. See LUCENE-737
and a discussion pointed by it. One of the opinions
there was that encryption should be handled at a lower
level (OS/FS). Wouldn't that hold here as well?

Joe R wrote:

>
> Hello,
>
> I've been asked to devise some way to discover and correct datain Lucene
> indexes that have been "corrupted."  The word "corrupt", in
> this case, has a
> few different meanings, some of which strike me as exceedingly
> difficult to
> grok.  What concerns me are the cases where we don't know that
> an index has
> been changed:  A bit error in a stored field, for instance, is a form of
> corruption that we (ideally) should be able to identify, at the
> very least, and
> hopefully correct.  This case in particular seems particularly
> onerous, since
> this isn't going to throw an exception of any sort, any time.
>
> We have a fairly good handle on how to remedy problems that
> throw exceptions,
> so we should be alright with corruption where (say) an operator
> logs in and
> overwrites a file.
>
> I'm wondering how other Lucene users have tackled this problem
> in the past.
> Calculating checksums on the documents seems like one way to goabout it:
> compute a checksum on the document and, in a background thread,
> compare the
> checksum to the data.  Unfortunately we're building a large,
> federated system
> and it would take months to exhaustively check every document this way.
> Checksumming the files themselves might be too much: We're
> storing gigabytes of
> data per index and there is some churn to the data; in other words, the
> overhead for this method might be too high.
>
> Thanks for any help you might have.
>
>
> -Joseph Rose


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How do YOU detect corrupt indexes?

2007-08-02 Thread Daniel Noll
On Friday 03 August 2007 16:03:22 Doron Cohen wrote:
> What is the anticipated cause of corruption? Malicious?
> Hardware fault? This somewhat reminds of discussions in
> the list about encrypting the index. See LUCENE-737
> and a discussion pointed by it. One of the opinions
> there was that encryption should be handled at a lower
> level (OS/FS). Wouldn't that hold here as well?

That's actually a good point.  These days we have filesystems like ZFS which 
check for corruption automatically.  This should remove a lot of the extra 
digesting work people would otherwise need to do to ensure consistency.

Daniel


-- 
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]