date:20070802












> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > 
> If I'm reading this correctly, there's something a little wonky here. In> 
> your example code, you close the IndexWriter and then, without creating> a 
> new IndexWriter, you call addDocument again. This shouldn't be> possible 
> (what version of Lucene are you using?)   Yes, you are correct, I close 
> indexWriter and then add more docs. What's wrong? it worked out fine, and add 
> docs i add will appear to NEW INSTANCES OF INDEX SEARCHERS after calling 
> close on the indexWriter.As for creating new IndexWriter, I tried to, 
> however i suffered of the lock exception i got, although i was closing the 
> IndexWriter instance, before creating a new IndexWriter instance! I don't 
> know why! furthermore, this is useless, for multiThreaded app, cause you 
> can't know who is still writing to your index, and who has colsed his 
> IndexWriter. Even checking that the index is locked b4, leads to unnecessary 
> overhead which can be avoided since it works for me and i can write with one 
> single instance of IndexWriter. > > Assuming for the time being that you are 
> creating the IndexWriter again,> the other issue here is that you shouldn't 
> be able to have a reader and> a writer changing an index at the same time. 
> There should be a lock> failure. This should occur either in the Index Well, 
> I think that i don't get the problems you expect cause i use the Lucene 
> version that is shiped by compass distribution (www.compassframework.org) In 
> short, compass to Lucene is the same of ORM server like hibernate for DBMS 
> like oracle. It's really works fine but i couldn't understand why compass 
> hides the deleteDocuments(Term) method on IndexWriter 
> classhttp://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term).
>  This is why i used delete on a reader rather than using it on the same 
> writer instance, and the only one i have. I couldn't manage my index in one 
> particular situation using compass, because i had to store data not in the 
> usual way we do (every row in table is a record). So i think i have to ask 
> The compass team about that. Anyways, if you have coments, you or the others 
> please do. > > Might you be creating your IndexWriters (which you don't show) 
> with the> create flag always set to true? That will wipe your index each 
> time,> ignoring the locks and cause all sorts of weird results.No, I don't 
> create a new Instance of indexWriter, the only one I create is in the Service 
> constructor, so i create a new clean (has no docs) index only when program 
> starts up. public LuceneServiceSHImp(String indexDirectory) throws 
> IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new 
> StandardAnalyzer();indexWriter = new IndexWriter(new 
> java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}> 
> > -Original Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > 
> Sent: Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> 
> Subject: IndexReader deletes more that expected> > Hi, I got unexpected 
> behavior while testing lucene. To shortly address> the problem: Using 
> IndexWriter I add docs with fields named ID with a> consecutive order 
> (1,2,3,4, etc) then close that index. I get new> IndexReader, and call 
> IndexReader.deleteDocuments(Term). The term is> simply new Term("ID", "1"). 
> and then class close on IndexReader. Things> work out fine. But if i add docs 
> using IndexWriter, close writer, then> create new IndexReader to delete one 
> of the docs already inserted, but> without closing index. while the 
> indexReader that perform deletion is> still not closed, I add more docs, and 
> then commit the IndexWriter, so> when i search I get all added docs in the 
> two phases (before using> deleteDocuments() on IndexReader and after because 
> i haven't closed> IndexReader, howerer, closed IndexWriter). I close 
> IndexReader and then> query the index, so i deletes all docs after opening it 
> till closing it,> in addition to the specified doc in the Term object (in 
> this test case:> ID=1). I know that i can avoid this by close IndexReader 
> directly after> deleting docs, but what about runing it on mutiThread app 
> like web> application? There you are the code: > IndexSearcher indexSearcher 
> = new IndexSearcher(this.indexDirectory);> Hits hitsB4InsertAndClose = null;> 
> hitsB4InsertAndClose = getAllAsHits(indexSearcher);> int beforeInsertAndClose 
> = hitsB4InsertAndClose.length();> > 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> indexWriter.close();> 
> IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> 
> indexSe

RE: IndexReader deletes more that expected









 




> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > 
> If I'm reading this correctly, there's something a little wonky here. In> 
> your example code, you close the IndexWriter and then, without creating> a 
> new IndexWriter, you call addDocument again. This shouldn't be> possible 
> (what version of Lucene are you using?)
Yes, you are correct, I close indexWriter and then add more docs. What's wrong? 
it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX 
SEARCHERS after calling close on the indexWriter. 
   As for creating new IndexWriter, I tried to, however i suffered of the lock 
exception i got, although i was closing the IndexWriter instance, before 
creating a new IndexWriter instance! I don't know why! furthermore, this is 
useless, for multiThreaded app, cause you can't know who is still writing to 
your index, and who has colsed his IndexWriter. Even checking that the index is 
locked b4, leads to unnecessary overhead which can be avoided since it works 
for me and i can write with one single instance of IndexWriter. >
> > Assuming for the time being that you are creating the IndexWriter again,> 
> > the other issue here is that you shouldn't be able to have a reader and> a 
> > writer changing an index at the same time. There should be a lock> failure. 
> > This should occur either in the Index 
Well, I think that i don't get the problems you expect cause i use the Lucene 
version that is shiped by compass distribution (www.compassframework.org) In 
short, compass to Lucene is the same of ORM server like hibernate for DBMS like 
oracle. It's really works fine but i couldn't understand why compass hides the 
deleteDocuments(Term) method on IndexWriter class
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term)
. This is why i used delete on a reader rather than using it on the same writer 
instance, and the only one i have. I couldn't manage my index in one particular 
situation using compass, because i had to store data not in the usual way we do 
(every row in table is a record). So i think i have to ask The compass team 
about that. Anyways, if you have coments, you or the others please do. >
> > Might you be creating your IndexWriters (which you don't show) with the> 
> > create flag always set to true? That will wipe your index each time,> 
> > ignoring the locks and cause all sorts of weird results.
No, I don't create a new Instance of indexWriter, the only one I create is in 
the Service constructor, so i create a new clean (has no docs) index only when 
program starts up. public LuceneServiceSHImp(String indexDirectory) throws 
IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new 
StandardAnalyzer();indexWriter = new IndexWriter(new 
java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}
> > -Original Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > 
> > Sent: Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> 
> > Subject: IndexReader deletes more that expected> > Hi, I got unexpected 
> > behavior while testing lucene. To shortly address> the problem: Using 
> > IndexWriter I add docs with fields named ID with a> consecutive order 
> > (1,2,3,4, etc) then close that index. I get new> IndexReader, and call 
> > IndexReader.deleteDocuments(Term). The term is> simply new Term("ID", "1"). 
> > and then class close on IndexReader. Things> work out fine. But if i add 
> > docs using IndexWriter, close writer, then> create new IndexReader to 
> > delete one of the docs already inserted, but> without closing index. while 
> > the indexReader that perform deletion is> still not closed, I add more 
> > docs, and then commit the IndexWriter, so> when i search I get all added 
> > docs in the two phases (before using> deleteDocuments() on IndexReader and 
> > after because i haven't closed> IndexReader, howerer, closed IndexWriter). 
> > I close IndexReader and then> query the index, so i deletes all docs after 
> > opening it till closing it,> in addition to the specified doc in the Term 
> > object (in this test case:> ID=1). I know that i can avoid this by close 
> > IndexReader directly after> deleting docs, but what about runing it on 
> > mutiThread app like web> application? There you are the code: > 
> > IndexSearcher indexSearcher = new IndexSearcher(this.indexDirectory);> Hits 
> > hitsB4InsertAndClose = null;> hitsB4InsertAndClose = 
> > getAllAsHits(indexSearcher);> int beforeInsertAndClose = 
> > hitsB4InsertAndClose.length();> > 
> > indexWriter.addDocument(getNewElement());> 
> > indexWriter.addDocument(getNewElement());> 
> > indexWriter.addDocument(getNewElement());> indexWriter.close();> 
> > IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);>

RE: IndexReader deletes more that expected

Yes, you are correct, I close indexWriter and then add more docs. What's wrong? 
it worked out fine, and add docs i add will appear to NEW INSTANCES OF INDEX 
SEARCHERS after calling close on the indexWriter. 
   As for creating new IndexWriter, I tried to, however i suffered of the lock 
exception i got, although i was closing the IndexWriter instance, before 
creating a new IndexWriter instance! I don't know why! furthermore, this is 
useless, for multiThreaded app, cause you can't know who is still writing to 
your index, and who has colsed his IndexWriter. Even checking that the index is 
locked b4, leads to unnecessary overhead which can be avoided since it works 
for me and i can write with one single instance of IndexWriter. 

Well, I think that i don't get the problems you expect cause i use the Lucene 
version that is shiped by compass distribution (www.compassframework.org) In 
short, compass to Lucene is the same of ORM server like hibernate for DBMS like 
oracle. It's really works fine but i couldn't understand why compass hides the 
deleteDocuments(Term) method on IndexWriter class
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/IndexWriter.html#deleteDocuments(org.apache.lucene.index.Term)
. This is why i used delete on a reader rather than using it on the same writer 
instance, and the only one i have. I couldn't manage my index in one particular 
situation using compass, because i had to store data not in the usual way we do 
(every row in table is a record). So i think i have to ask The compass team 
about that. Anyways, if you have coments, you or the others please do. 
 

No, I don't create a new Instance of indexWriter, the only one I create is in 
the Service constructor, so i create a new clean (has no docs) index only when 
program starts up. public LuceneServiceSHImp(String indexDirectory) throws 
IOException{this.indexDirectory = indexDirectory;standardAnalyzer = new 
StandardAnalyzer();indexWriter = new IndexWriter(new 
java.io.File(indexDirectory), standardAnalyzer, true);indexWriter.close();}







> Subject: RE: IndexReader deletes more that expected> Date: Wed, 1 Aug 2007 
> 09:07:32 -0700> From: [EMAIL PROTECTED]> To: java-user@lucene.apache.org> > 
> If I'm reading this correctly, there's something a little wonky here. In> 
> your example code, you close the IndexWriter and then, without creating> a 
> new IndexWriter, you call addDocument again. This shouldn't be> possible 
> (what version of Lucene are you using?)> > Assuming for the time being that 
> you are creating the IndexWriter again,> the other issue here is that you 
> shouldn't be able to have a reader and> a writer changing an index at the 
> same time. There should be a lock> failure. This should occur either in the 
> Index > > Might you be creating your IndexWriters (which you don't show) with 
> the> create flag always set to true? That will wipe your index each time,> 
> ignoring the locks and cause all sorts of weird results.> > -Original 
> Message-> From: Ridwan Habbal [mailto:[EMAIL PROTECTED] > Sent: 
> Wednesday, August 01, 2007 8:48 AM> To: java-user@lucene.apache.org> Subject: 
> IndexReader deletes more that expected> > Hi, I got unexpected behavior while 
> testing lucene. To shortly address> the problem: Using IndexWriter I add docs 
> with fields named ID with a> consecutive order (1,2,3,4, etc) then close that 
> index. I get new> IndexReader, and call IndexReader.deleteDocuments(Term). 
> The term is> simply new Term("ID", "1"). and then class close on IndexReader. 
> Things> work out fine. But if i add docs using IndexWriter, close writer, 
> then> create new IndexReader to delete one of the docs already inserted, but> 
> without closing index. while the indexReader that perform deletion is> still 
> not closed, I add more docs, and then commit the IndexWriter, so> when i 
> search I get all added docs in the two phases (before using> 
> deleteDocuments() on IndexReader and after because i haven't closed> 
> IndexReader, howerer, closed IndexWriter). I close IndexReader and then> 
> query the index, so i deletes all docs after opening it till closing it,> in 
> addition to the specified doc in the Term object (in this test case:> ID=1). 
> I know that i can avoid this by close IndexReader directly after> deleting 
> docs, but what about runing it on mutiThread app like web> application? There 
> you are the code: > IndexSearcher indexSearcher = new 
> IndexSearcher(this.indexDirectory);> Hits hitsB4InsertAndClose = null;> 
> hitsB4InsertAndClose = getAllAsHits(indexSearcher);> int beforeInsertAndClose 
> = hitsB4InsertAndClose.length();> > 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> 
> indexWriter.addDocument(getNewElement());> indexWriter.close();> 
> IndexSearcher indexSearcherDel = new IndexSearcher(this.indexDirectory);> 
> indexSearcherDel.getIndexReader().deleteDocuments(new Term("ID",

Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar


Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 2 contracts in the file system index each with some
parameterName and Value. Given a search criterion - (paramValue='draft'). I
need to get just an ArrayList of Strings containing contract Ids. I dont
need the lucene documents, just the Ids.

Can this be done ?

-thanks

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Getting only the Ids, not the whole documents.

2007-08-02 Thread Chhabra, Kapil

What is the structure of your index?
If you havnt already, then add a new field to your index that stores the
contractId. For all other fields, set the "store" flag to false while
indexing.

You can now safely retrieve the value of this contractId field based on
your search results.

Regards,
kapilChhabra


-Original Message-
From: makkhar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 02, 2007 2:26 PM
To: java-user@lucene.apache.org
Subject: Getting only the Ids, not the whole documents.


Hi all,

   Can I get just a list of document Ids given a search criteria ? To
elaborate here is my situation:

I store 2 contracts in the file system index each with some
parameterName and Value. Given a search criterion -
(paramValue='draft'). I
need to get just an ArrayList of Strings containing contract Ids. I dont
need the lucene documents, just the Ids.

Can this be done ?

-thanks

-- 
View this message in context:
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
f4204907.html#a11960750
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi

you should not store them in an Array structure since they will take up the
memory.
the BitSet is the best structure to store them


On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
> Heres my index structure :
>
> Document -> contract ID   -id (index AND store)
>   -> paramName   -name (index AND store)
>   -> paramValue   -value (index BUT NOT store)
>
> When I get back 2 hits, each document contains ID and paramName, I
> have
> no interest in paramName (but I have to STORE it for some other reason),
> can
> I not just get a plain java String Array of the contract IDs that matched
> ?
> !
>
> -thanks for the prompt reply.
>
>
>
> Chhabra, Kapil wrote:
> >
> > What is the structure of your index?
> > If you havnt already, then add a new field to your index that stores the
> > contractId. For all other fields, set the "store" flag to false while
> > indexing.
> >
> > You can now safely retrieve the value of this contractId field based on
> > your search results.
> >
> > Regards,
> > kapilChhabra
> >
> >
> > -Original Message-
> > From: makkhar [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, August 02, 2007 2:26 PM
> > To: java-user@lucene.apache.org
> > Subject: Getting only the Ids, not the whole documents.
> >
> >
> > Hi all,
> >
> >Can I get just a list of document Ids given a search criteria ? To
> > elaborate here is my situation:
> >
> > I store 2 contracts in the file system index each with some
> > parameterName and Value. Given a search criterion -
> > (paramValue='draft'). I
> > need to get just an ArrayList of Strings containing contract Ids. I dont
> > need the lucene documents, just the Ids.
> >
> > Can this be done ?
> >
> > -thanks
> >
> > --
> > View this message in context:
> > http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
> > f4204907.html#a11960750
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi

yes it decrease the performance but the only solution.
I've spent many weeks to find best way to retrive my own IDs but find this
way as last one

now I am storing the ids in a BitSet structure and it's fast enough

public void collect(...){
idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));

}

On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
>
> Hi,
>
>The solution you suggested will definitely work but will definitely
> slow
> down my search by an order of magnitude. The problem I am trying to solve
> is
> performance, thats why I need the collection of IDs and not the whole
> documents.
>
> - thanks for the prompt reply.
>
>
> is_maximum wrote:
> >
> > yes if you extend your class from HitCollector and override the
> collect()
> > mthod with following signature you can get IDs
> >
> > public void collect(int id, float score)
> >
> > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >> Hi all,
> >>
> >>Can I get just a list of document Ids given a search criteria ? To
> >> elaborate here is my situation:
> >>
> >> I store 2 contracts in the file system index each with some
> >> parameterName and Value. Given a search criterion -
> (paramValue='draft').
> >> I
> >> need to get just an ArrayList of Strings containing contract Ids. I
> dont
> >> need the lucene documents, just the Ids.
> >>
> >> Can this be done ?
> >>
> >> -thanks
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> > --
> > Regards,
> > Mohammad
> > --
> > see my blog: http://brainable.blogspot.com/
> > another in Persian: http://fekre-motefavet.blogspot.com/
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/

RE: Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar


Heres my index structure :

Document -> contract ID   -id (index AND store)
  -> paramName   -name (index AND store)
  -> paramValue   -value (index BUT NOT store)

When I get back 2 hits, each document contains ID and paramName, I have
no interest in paramName (but I have to STORE it for some other reason), can
I not just get a plain java String Array of the contract IDs that matched ?
!

-thanks for the prompt reply.



Chhabra, Kapil wrote:
> 
> What is the structure of your index?
> If you havnt already, then add a new field to your index that stores the
> contractId. For all other fields, set the "store" flag to false while
> indexing.
> 
> You can now safely retrieve the value of this contractId field based on
> your search results.
> 
> Regards,
> kapilChhabra
> 
> 
> -Original Message-
> From: makkhar [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, August 02, 2007 2:26 PM
> To: java-user@lucene.apache.org
> Subject: Getting only the Ids, not the whole documents.
> 
> 
> Hi all,
> 
>Can I get just a list of document Ids given a search criteria ? To
> elaborate here is my situation:
> 
> I store 2 contracts in the file system index each with some
> parameterName and Value. Given a search criterion -
> (paramValue='draft'). I
> need to get just an ArrayList of Strings containing contract Ids. I dont
> need the lucene documents, just the Ids.
> 
> Can this be done ?
> 
> -thanks
> 
> -- 
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-t
> f4204907.html#a11960750
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961211
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread makkhar



Hi,

   The solution you suggested will definitely work but will definitely slow
down my search by an order of magnitude. The problem I am trying to solve is
performance, thats why I need the collection of IDs and not the whole
documents.

- thanks for the prompt reply.


is_maximum wrote:
> 
> yes if you extend your class from HitCollector and override the collect()
> mthod with following signature you can get IDs
> 
> public void collect(int id, float score)
> 
> On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi all,
>>
>>Can I get just a list of document Ids given a search criteria ? To
>> elaborate here is my situation:
>>
>> I store 2 contracts in the file system index each with some
>> parameterName and Value. Given a search criterion - (paramValue='draft').
>> I
>> need to get just an ArrayList of Strings containing contract Ids. I dont
>> need the lucene documents, just the Ids.
>>
>> Can this be done ?
>>
>> -thanks
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting only the Ids, not the whole documents.

2007-08-02 Thread Mohammad Norouzi

yes if you extend your class from HitCollector and override the collect()
mthod with following signature you can get IDs

public void collect(int id, float score)

On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>
>
> Hi all,
>
>Can I get just a list of document Ids given a search criteria ? To
> elaborate here is my situation:
>
> I store 2 contracts in the file system index each with some
> parameterName and Value. Given a search criterion - (paramValue='draft').
> I
> need to get just an ArrayList of Strings containing contract Ids. I dont
> need the lucene documents, just the Ids.
>
> Can this be done ?
>
> -thanks
>
> --
> View this message in context:
> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/

Do AND + OR Search in Lucene

2007-08-02 Thread Askar Zaidi

Hey Guys,

Quick question:

I do this in my code for searching:

queryParser.setDefaultOperator(QueryParser.Operator.AND);

Lucene is OR by default so I change it to AND for my requirements. Now, I
have a requirement to do OR as well. I mean while doing AND I'd like to
include results from OR too ... but they'll be much lower ranked than the
AND results.

Is there a way to do this ?

thanks,
AZ

Re: Do AND + OR Search in Lucene


You can create two queries from two query parser, one with AND and the other
one with OR. After you create both of them, you call setBoost() to give
different boost level and then join them together using BooleanQuery with
option BooleanClause.Occur.SHOULD. That should do the trick.


askarzaidi wrote:
> 
> Hey Guys,
> 
> Quick question:
> 
> I do this in my code for searching:
> 
> queryParser.setDefaultOperator(QueryParser.Operator.AND);
> 
> Lucene is OR by default so I change it to AND for my requirements. Now, I
> have a requirement to do OR as well. I mean while doing AND I'd like to
> include results from OR too ... but they'll be much lower ranked than the
> AND results.
> 
> Is there a way to do this ?
> 
> thanks,
> AZ
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Do-AND-%2B-OR-Search-in-Lucene-tf4205268.html#a11962340
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: High CPU usage duing index and search


20,000 queries continuously? Sounds a bit too much. Can you elaborate more
what you need to do? Probably you won't need that many queries.



Chew Yee Chuang wrote:
> 
> Hi,
> 
> Thanks for the link provided, actually I've go through those article when
> I
> developing the index and search function for my application. I haven’t try
> profiler yet, but I monitor the CPU usage and notice that whatever index
> or
> search performing, the CPU usage raise to 100%. Below I will try to
> elaborate more on what my application is doing and how I index and search.
> 
> There are many concurrent process running, first, the application will
> write
> records that received into a text file with tab separated each different
> field. Application will point to a new file every 10mins and start writing
> to it. So every file will contains only 10mins record, approximate 600,000
> records per file. Then, the indexing process will check whether there is a
> text file to be index, if it is, the thread will wake up and start perform
> indexing.
>  
> The indexing process will first add documents to RAMDir, Then later, add
> RAMDir into FSDir by calling addIndexNoOptimize() when there is 100,000
> documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter(FSDir)
> was created but a few IndexWriter(RAMDir) was created during the whole
> process. Below are some configuration for IndexWriters that I mentioned:-
> 
> IndexWriter (RAMDir)
> - SimpleAnalyzer
> - setMaxBufferedDocs(1)
> - Filed.Store.YES
> - Field.Index.NO_NORMS
> 
> IndexWriter (FSDir)
> - SimpleAnalyzer
> - setMergeFactor(20)
> - addIndexesNoOptimize()
> 
> For the searching, because there are many queries(20,000) run continuously
> to generate the aggregate table for reporting purpose. All this queries is
> run in nested loop, and there is only 1 Searcher created, I try searcher
> and
> filter as well, filter give me better result, but both also utilize lots
> of
> CPU resources.
> 
> Hope this info will help and sorry for my bad English.
> 
> Thanks
> eChuang, Chew
> 
> -Original Message-
> From: karl wettin [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, July 31, 2007 5:54 PM
> To: java-user@lucene.apache.org
> Subject: Re: High CPU usage duing index and search
> 
> 
> 31 jul 2007 kl. 05.25 skrev Chew Yee Chuang:
>> But just notice that when Lucene performing search or index,
>> the CPU usage on my machine raise to 100%, because of this issue,  
>> some of my
>> others backend process will slow down eventually. Just want to know  
>> does
>> anyone face this problem before ? and is it any idea on how to  
>> overcome this
>> problem ?
> 
> Did you run a profiler to see what it is that consume all the resources?
> It is very hard to guess based on the information you supplied. Start  
> here:
> 
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> 
> 
> -- 
> karl
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.476 / Virus Database: 269.11.0/927 - Release Date: 7/30/2007
> 5:02 PM
>  
> 
> No virus found in this outgoing message.
> Checked by AVG Free Edition. 
> Version: 7.5.476 / Virus Database: 269.11.0/929 - Release Date: 7/31/2007
> 5:26 PM
>  
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/High-CPU-usage-duing-index-and-search-tf4190756.html#a11962524
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting non-english text from word, pdf, etc....??


If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/



heybluez wrote:
> 
> I know how to do english text with POI and PDFBox and so on.  Now, I want
> to start indexing non-english language such as french and spanish.  Which
> extraction libs are available for me?
> 
> I want to do:
> 
> Excel
> Word
> PowerPoint
> PDF
> HTML
> RTF
> 
> Thanks!
> Michael
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc---tf4198171.html#a11962580
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: LUCENE-843 Release


Mike, as a committer, what do you think?

Thanks!


Peter Keegan wrote:
> 
> I've built a production index with this patch and done some query stress
> testing with no problems.
> I'd give it a thumbs up.
> 
> Peter
> 
> On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi guys,
>>
>> Do you think LUCENE-843 is stable enough? If so, do you think it's worth
>> to
>> release it with probably LUCENE 2.2.1? It would be nice so that people
>> can
>> take the advantage of it right away without risking other breaking
>> changes
>> in the HEAD branch or waiting until 2.3 release.
>>
>> Thanks,
>> --
>> View this message in context:
>> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting only the Ids, not the whole documents.


Hi,

Why don't you consider to use FieldSelector? LoadFirstFieldSelector has an
ability to help you load only the first field in the document without
loading all the fields. After that, you can keep the whole document if you
like. It should help improve performance better.



is_maximum wrote:
> 
> yes it decrease the performance but the only solution.
> I've spent many weeks to find best way to retrive my own IDs but find this
> way as last one
> 
> now I am storing the ids in a BitSet structure and it's fast enough
> 
> public void collect(...){
> idBitSet.set(Integer.valueOf(searcher.doc(id).get("MyOwnID")));
> 
> }
> 
> On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>> Hi,
>>
>>The solution you suggested will definitely work but will definitely
>> slow
>> down my search by an order of magnitude. The problem I am trying to solve
>> is
>> performance, thats why I need the collection of IDs and not the whole
>> documents.
>>
>> - thanks for the prompt reply.
>>
>>
>> is_maximum wrote:
>> >
>> > yes if you extend your class from HitCollector and override the
>> collect()
>> > mthod with following signature you can get IDs
>> >
>> > public void collect(int id, float score)
>> >
>> > On 8/2/07, makkhar <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >> Hi all,
>> >>
>> >>Can I get just a list of document Ids given a search criteria ? To
>> >> elaborate here is my situation:
>> >>
>> >> I store 2 contracts in the file system index each with some
>> >> parameterName and Value. Given a search criterion -
>> (paramValue='draft').
>> >> I
>> >> need to get just an ArrayList of Strings containing contract Ids. I
>> dont
>> >> need the lucene documents, just the Ids.
>> >>
>> >> Can this be done ?
>> >>
>> >> -thanks
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11960750
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards,
>> > Mohammad
>> > --
>> > see my blog: http://brainable.blogspot.com/
>> > another in Persian: http://fekre-motefavet.blogspot.com/
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11961159
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Getting-only-the-Ids%2C-not-the-whole-documents.-tf4204907.html#a11962465
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Using Nutch APIs in Lucene

2007-08-02 Thread Grant Ingersoll

Just use Nutch.  If you look in the Crawl.java class in Nutch, you  
can pretty easily tear out the appropriate pieces.  Question is, do  
you really need all of that?  If so, why not just use Nutch?


-Grant

On Aug 2, 2007, at 2:32 AM, Srinivasarao Vundavalli wrote:

How can we use nutch APIs in Lucene? For example using  
FetchedSegments , we

can get ParseText from which we can
get the content of the document. So can we use these classes
(FetchedSegments, ParseText ) in lucene. If so, how to use them?
Thank You


--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: LUCENE-843 Release

2007-08-02 Thread Michael McCandless

Honestly I don't really think this is a good idea.

While LUCENE-843 has proven stable so far (knock on wood!), it is
still a major change and I do worry (less with time :) that maybe I
broke something subtle somewhere.

While a few brave people have tested the trunk in their production
worlds and seen good performance gains, that testing is still limited
compared to a real release.

A point release (2.4.1) really is not supposed to contain major
changes, just bug fixes, and so I don't think we should violate that
accepted practice.

I would rather see us finish up 2.3 and release it, and going forwards
do more frequent releases, instead of porting big changes back onto
point releases.

Mike

"testn" <[EMAIL PROTECTED]> wrote:
> 
> Mike, as a committer, what do you think?
> 
> Thanks!
> 
> 
> Peter Keegan wrote:
> > 
> > I've built a production index with this patch and done some query stress
> > testing with no problems.
> > I'd give it a thumbs up.
> > 
> > Peter
> > 
> > On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >> Hi guys,
> >>
> >> Do you think LUCENE-843 is stable enough? If so, do you think it's worth
> >> to
> >> release it with probably LUCENE 2.2.1? It would be nice so that people
> >> can
> >> take the advantage of it right away without risking other breaking
> >> changes
> >> in the HEAD branch or waiting until 2.3 release.
> >>
> >> Thanks,
> >> --
> >> View this message in context:
> >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> > 
> > 
> 
> -- 
> View this message in context:
> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Solr's NumberUtils doesnt work


How did you encode your integer into String? Did you use int2sortableStr?



is_maximum wrote:
> 
> Hi
> I am using NumberUtils to encode and decode numbers while indexing and
> searching, when I am going to decode the number retrieved from an index it
> throws exception for some fields
> the exception message is:
> 
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
> range: 1
> at java.lang.String.charAt(Unknown Source)
> at org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java
> :125)
> at
> org.apache.solr.util.NumberUtils.SortableStr2int(NumberUtils.java:37)
> at com.payvand.lucene.util.ExtendedNumberUtils.decodeInteger(
> ExtendedNumberUtils.java:123)
> 
> 
> I dont know why this happen, I am wondering if it has something to do with
> character encoding. have you had such problem?
> 
> thanks
> 
> -- 
> Regards,
> Mohammad Norouzi
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr%27s-NumberUtils-doesnt-work-tf4204371.html#a11963213
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: LUCENE-843 Release


Thanks! Will look forward to 2.3 then.


Michael McCandless-2 wrote:
> 
> 
> Honestly I don't really think this is a good idea.
> 
> While LUCENE-843 has proven stable so far (knock on wood!), it is
> still a major change and I do worry (less with time :) that maybe I
> broke something subtle somewhere.
> 
> While a few brave people have tested the trunk in their production
> worlds and seen good performance gains, that testing is still limited
> compared to a real release.
> 
> A point release (2.4.1) really is not supposed to contain major
> changes, just bug fixes, and so I don't think we should violate that
> accepted practice.
> 
> I would rather see us finish up 2.3 and release it, and going forwards
> do more frequent releases, instead of porting big changes back onto
> point releases.
> 
> Mike
> 
> "testn" <[EMAIL PROTECTED]> wrote:
>> 
>> Mike, as a committer, what do you think?
>> 
>> Thanks!
>> 
>> 
>> Peter Keegan wrote:
>> > 
>> > I've built a production index with this patch and done some query
>> stress
>> > testing with no problems.
>> > I'd give it a thumbs up.
>> > 
>> > Peter
>> > 
>> > On 7/30/07, testn <[EMAIL PROTECTED]> wrote:
>> >>
>> >>
>> >> Hi guys,
>> >>
>> >> Do you think LUCENE-843 is stable enough? If so, do you think it's
>> worth
>> >> to
>> >> release it with probably LUCENE 2.2.1? It would be nice so that people
>> >> can
>> >> take the advantage of it right away without risking other breaking
>> >> changes
>> >> in the HEAD branch or waiting until 2.3 release.
>> >>
>> >> Thanks,
>> >> --
>> >> View this message in context:
>> >> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11863644
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >>
>> > 
>> > 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11962690
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/LUCENE-843-Release-tf4170191.html#a11963778
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: IndexReader deletes more that expected

Yes, you are right, thanks for the great reply! I skimmed it so quickly today, 
so re-read it now, and got the point you mean. I just tried Lucene 2.2.0 (I was 
using 2.0.0) and i could do add, delete and update docs so smoothly! Based on 
my tests i did so far, similar to tests I presented in my first email, that i 
don't have to worry who added and who deleted, and i can get rid of 
Synchronized java methods ant lead to so slow app performance. 

I kept maintaining only one open instance of indexWrtier for the whole app. As 
i stated b4, i suffered of having lock exception. I use flush() instead of 
close(). In contrast, I create new IndexSearcher instance every time i search. 
I dislike to open and close then reopen the index searcher over and over. I 
don't use Indexreader directly anymore, since i do have to use it indirectly 
using IndexSearcher. I won't try IndexModifier since you told me that 
IndexWriter in 2.2.0 is much better. 
Do you think i'm doing good this way i use IndexWriter (one instance for the 
whole app)?

One thing still remaining pending, however I need compass guys for it, is that 
they do use the new version of lucene or not yet.. i will check with them 
anyways. I can't have two different versions of jars for the same classes in 
same package. 

Final question, I still haven't seen Solr in details, but is it strongly 
recommended to use it when i have webapps? 

please write back! 

cya

Rid

> Date: Wed, 1 Aug 2007 13:14:04 -0400> From: [EMAIL PROTECTED]> To: 
> java-user@lucene.apache.org> Subject: Re: IndexReader deletes more that 
> expected> On 8/1/07, Ridwan Habbal <[EMAIL PROTECTED]> wrote:>>  but what 
> about runing it on mutiThread app like web application?  There> you are the 
> code:  If you are targeting a multi threaded webapp than I strongly suggest 
> youlook into using either Solr or the LuceneIndexAccessor code. You will 
> wantto use some form of reference counting to manage your Readers and 
> Writers. Also, you can now use IndexWriter (Lucene 2.0 and greater I think) 
> todelete. This allows for efficient mixing of deletes and adds by 
> bufferingthe deletes, and then opening an IndexReader to commit them later. 
> This ismuch more efficient than IndexModifier. - Mark
_
PC Magazine’s 2007 editors’ choice for best web mail—award-winning Windows Live 
Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HMWL_mini_pcmag_0707

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Michael J. Prichard

Yea, I have seen those.  I guess the question is what do you all use to 
extract text from Word, Excel, PPT and PDF?  Can I use POI, PDFBox and 
so on?  This is what I use now to extract english.


Thanks,
Michael

testn wrote:

If you can extract token stream from those files already, you can simply use
different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/



heybluez wrote:
  

I know how to do english text with POI and PDFBox and so on.  Now, I want
to start indexing non-english language such as french and spanish.  Which
extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting non-english text from word, pdf, etc....??


Check out..
http://wiki.apache.org/lucene-java/LuceneFAQ#head-e7d23f91df094d7baeceb46b04d518dc426d7d2e



heybluez wrote:
> 
> Yea, I have seen those.  I guess the question is what do you all use to 
> extract text from Word, Excel, PPT and PDF?  Can I use POI, PDFBox and 
> so on?  This is what I use now to extract english.
> 
> Thanks,
> Michael
> 
> testn wrote:
>> If you can extract token stream from those files already, you can simply
>> use
>> different analyzers to analyze those token stream appropriately. Check
>> out
>> Lucen-contrib analyzers at
>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
>>
>>
>>
>> heybluez wrote:
>>   
>>> I know how to do english text with POI and PDFBox and so on.  Now, I
>>> want
>>> to start indexing non-english language such as french and spanish. 
>>> Which
>>> extraction libs are available for me?
>>>
>>> I want to do:
>>>
>>> Excel
>>> Word
>>> PowerPoint
>>> PDF
>>> HTML
>>> RTF
>>>
>>> Thanks!
>>> Michael
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>> 
>>
>>   
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extracting-non-english-text-from-word%2C-pdf%2C-etc---tf4198171.html#a11964422
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Do AND + OR Search in Lucene

2007-08-02 Thread Erick Erickson

Alternatively, construct a parenthesized query that
reflects what you want. If you do, make sure that OR is capitalized,
or make REAL SURE you understand the Lucene syntax and construct
your query with that syntax.

Erick

On 8/2/07, testn <[EMAIL PROTECTED]> wrote:
>
>
> You can create two queries from two query parser, one with AND and the
> other
> one with OR. After you create both of them, you call setBoost() to give
> different boost level and then join them together using BooleanQuery with
> option BooleanClause.Occur.SHOULD. That should do the trick.
>
>
> askarzaidi wrote:
> >
> > Hey Guys,
> >
> > Quick question:
> >
> > I do this in my code for searching:
> >
> > queryParser.setDefaultOperator(QueryParser.Operator.AND);
> >
> > Lucene is OR by default so I change it to AND for my requirements. Now,
> I
> > have a requirement to do OR as well. I mean while doing AND I'd like to
> > include results from OR too ... but they'll be much lower ranked than
> the
> > AND results.
> >
> > Is there a way to do this ?
> >
> > thanks,
> > AZ
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Do-AND-%2B-OR-Search-in-Lucene-tf4205268.html#a11962340
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Grant Ingersoll


Hey Michael,

Have you given it a try?  I would think they would work, but haven't  
actually done it.   Setup a small test that reads in a PDF in French  
or Spanish and give it a try.  You might have to worry about  
encodings or something, but the structure of the files should be the  
same, i.e. they are valid Word, etc. documents.


-Grant

On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:

Yea, I have seen those.  I guess the question is what do you all  
use to extract text from Word, Excel, PPT and PDF?  Can I use POI,  
PDFBox and so on?  This is what I use now to extract english.


Thanks,
Michael

testn wrote:
If you can extract token stream from those files already, you can  
simply use
different analyzers to analyze those token stream appropriately.  
Check out

Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/ 
analyzers/src/java/org/apache/lucene/analysis/




heybluez wrote:

I know how to do english text with POI and PDFBox and so on.   
Now, I want
to start indexing non-english language such as french and  
spanish.  Which

extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]











--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: extracting non-english text from word, pdf, etc....??

2007-08-02 Thread Ben Litchfield

In terms of PDF documents...

PDFBox should work just fine with any latin based languages; at this
time certain PDFs that have CJK characters can pose some issues. In
general english/french/spanish should be fine.

Some PDFs use custom encodings that make it impossible to extract text
and it comes out as gibberish. As a simple test if Acrobat can
extract the text then PDFBox should be able to as well.

Ben

Quoting Grant Ingersoll <[EMAIL PROTECTED]>:

Hey Michael,

Have you given it a try? I would think they would work, but haven't
actually done it. Setup a small test that reads in a PDF in French or
Spanish and give it a try. You might have to worry about encodings or
something, but the structure of the files should be the same, i.e. they
are valid Word, etc. documents.

-Grant

On Aug 2, 2007, at 8:59 AM, Michael J. Prichard wrote:

Yea, I have seen those. I guess the question is what do you all
use to extract text from Word, Excel, PPT and PDF? Can I use POI,
PDFBox and so on? This is what I use now to extract english.

Thanks,
Michael

testn wrote:
If you can extract token stream from those files already, you can
simply use

different analyzers to analyze those token stream appropriately. Check out
Lucen-contrib analyzers at
http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/

heybluez wrote:

I know how to do english text with POI and PDFBox and so on. Now, I want
to start indexing non-english language such as french and spanish. Which
extraction libs are available for me?

I want to do:

Excel
Word
PowerPoint
PDF
HTML
RTF

Thanks!
Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem


Hi,

It's been a couple of days now and I haven't heard anything on this 
topic, while there has been substantial list traffic otherwise.


Am I asking in the wrong place? Was I unclear?

I know there are people out there that have used/are using Lucene in a 
clustered environment. I am just looking for any sort of feedback 
(general or specific) about clustering lucene as well as filesystem 
compatibility (windows shares, NFS, etc.).


Thanks again,
-Zach

Zach Bailey wrote:

Hello all,

First a little background - we are developing a clustered application 
that will in part leverage Lucene to provide index and search 
capabilities. We have already spent time investigating various index 
storage implementations (database vs. filesystem) and we've decided for 
performance reasons to go with a filesystem index storage scheme.


That said, I have read back through the archives a bit and noticed that 
the support for index storage on NFS is still experimental (e.g. the 
latest bugfixes have not made it out to an official, stable release). I 
realize most of the issues related to using a shared file system revolve 
around locking, and I haven't seen much about the maturity of locking 
for other network filesystems.


I was wondering if anyone has tried any other networked filesystems or 
had any recommendations. We have clients who would be doing this on both 
Windows and Unix/Linux so any insight there would be appreciated as well 
- it can be assumed that across any cluster the operating system use 
would be homogeneous (i.e. all nodes are on windows and would use 
windows shares, or all nodes are on linux and would use xyz filesystem).


Thanks in advance,
-Zach Bailey



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem


Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


Zach Bailey wrote:
> 
> Hi,
> 
> It's been a couple of days now and I haven't heard anything on this 
> topic, while there has been substantial list traffic otherwise.
> 
> Am I asking in the wrong place? Was I unclear?
> 
> I know there are people out there that have used/are using Lucene in a 
> clustered environment. I am just looking for any sort of feedback 
> (general or specific) about clustering lucene as well as filesystem 
> compatibility (windows shares, NFS, etc.).
> 
> Thanks again,
> -Zach
> 
> Zach Bailey wrote:
>> Hello all,
>> 
>> First a little background - we are developing a clustered application 
>> that will in part leverage Lucene to provide index and search 
>> capabilities. We have already spent time investigating various index 
>> storage implementations (database vs. filesystem) and we've decided for 
>> performance reasons to go with a filesystem index storage scheme.
>> 
>> That said, I have read back through the archives a bit and noticed that 
>> the support for index storage on NFS is still experimental (e.g. the 
>> latest bugfixes have not made it out to an official, stable release). I 
>> realize most of the issues related to using a shared file system revolve 
>> around locking, and I haven't seen much about the maturity of locking 
>> for other network filesystems.
>> 
>> I was wondering if anyone has tried any other networked filesystems or 
>> had any recommendations. We have clients who would be doing this on both 
>> Windows and Unix/Linux so any insight there would be appreciated as well 
>> - it can be assumed that across any cluster the operating system use 
>> would be homogeneous (i.e. all nodes are on windows and would use 
>> windows shares, or all nodes are on linux and would use xyz filesystem).
>> 
>> Thanks in advance,
>> -Zach Bailey
>> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Clustered-Indexing-on-common-network-filesystem-tf4194135.html#a11966423
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem


Thanks for your response --

Based on my understanding, hadoop and nutch are essentially the same 
thing, with nutch being derived from hadoop, and are primarily intended 
to be standalone applications.


We are not looking for a standalone application, rather we must use a 
framework to implement search inside our current content management 
application. Currently the application search functionality is designed 
and built around Lucene, so migrating frameworks at this point is not 
feasible.


We are currently re-working our back-end to support clustering (in 
tomcat) and we are looking for information on the migration of Lucene 
from a single node filesystem index (which is what we use now and hope 
to continue to use for clients with a single-node deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we do not have to have 
two disparate methods of managing indexes for clients who run in a 
single-node, non-clustered environment versus clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone could shed some light on:

Is this not a recommended method of managing indexes across multiple nodes?

At this point would people recommend storing an individual index on each 
node and propagating index updates via a JMS framework rather than 
attempting to handle it transparently with a single shared index?


Is the Lucene index code so intimately tied to filesystem semantics that 
using a shared/networked file system is infeasible at this point in time?


What would be the quickest time-to-implementation of these strategies 
(JMS vs. shared FS)? The most robust/least error-prone?


I really appreciate any insight or response anyone can provide, even if 
it is a short answer to any of the related topics, "i.e. we implemented 
clustered search using per-node indexing with JMS update propagation and 
it works great", or even something as simple as "don't use a shared 
filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Mark Miller


Some quick info:

NFS should work, but I think youll want to be working off the trunk. 
Also, Sharing an index over NFS is supposed to be slow. The standard so 
far, if you are not partitioning the index, is to use a unix/linux 
filesystem and hardlinks + rsync to efficiently share index changes 
across nodes (hard links for instant copy, rsync to only transfer 
changed index files, search the mailing list). If you look at solr you 
can see scripts that give an example of this. I don't think the scripts 
rely on solr. This kind of setup should be quick and simple to 
implement. Same with NFS. An RMI solution that allowed for index 
partitioning would probably be the longest to do.


-Mark



Zach Bailey wrote:

Thanks for your response --

Based on my understanding, hadoop and nutch are essentially the same 
thing, with nutch being derived from hadoop, and are primarily 
intended to be standalone applications.


We are not looking for a standalone application, rather we must use a 
framework to implement search inside our current content management 
application. Currently the application search functionality is 
designed and built around Lucene, so migrating frameworks at this 
point is not feasible.


We are currently re-working our back-end to support clustering (in 
tomcat) and we are looking for information on the migration of Lucene 
from a single node filesystem index (which is what we use now and hope 
to continue to use for clients with a single-node deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we do not have to have 
two disparate methods of managing indexes for clients who run in a 
single-node, non-clustered environment versus clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone could shed some 
light on:


Is this not a recommended method of managing indexes across multiple 
nodes?


At this point would people recommend storing an individual index on 
each node and propagating index updates via a JMS framework rather 
than attempting to handle it transparently with a single shared index?


Is the Lucene index code so intimately tied to filesystem semantics 
that using a shared/networked file system is infeasible at this point 
in time?


What would be the quickest time-to-implementation of these strategies 
(JMS vs. shared FS)? The most robust/least error-prone?


I really appreciate any insight or response anyone can provide, even 
if it is a short answer to any of the related topics, "i.e. we 
implemented clustered search using per-node indexing with JMS update 
propagation and it works great", or even something as simple as "don't 
use a shared filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

How do YOU detect corrupt indexes?

2007-08-02 Thread Joe R


Hello,

I've been asked to devise some way to discover and correct data in Lucene
indexes that have been "corrupted."  The word "corrupt", in this case, has a
few different meanings, some of which strike me as exceedingly difficult to
grok.  What concerns me are the cases where we don't know that an index has
been changed:  A bit error in a stored field, for instance, is a form of
corruption that we (ideally) should be able to identify, at the very least, and
hopefully correct.  This case in particular seems particularly onerous, since
this isn't going to throw an exception of any sort, any time.

We have a fairly good handle on how to remedy problems that throw exceptions,
so we should be alright with corruption where (say) an operator logs in and
overwrites a file.

I'm wondering how other Lucene users have tackled this problem in the past. 
Calculating checksums on the documents seems like one way to go about it:
compute a checksum on the document and, in a background thread, compare the
checksum to the data.  Unfortunately we're building a large, federated system
and it would take months to exhaustively check every document this way. 
Checksumming the files themselves might be too much: We're storing gigabytes of
data per index and there is some churn to the data; in other words, the
overhead for this method might be too high.

Thanks for any help you might have.


-Joseph Rose



   

Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem

2007-08-02 Thread Rajesh parab

One more alternative, though I am not sure if anyone
is using it.

Apache Compass has added a plug-in to allow storing
Lucene index files inside the database. This should
work in clustered environment as all nodes will share
the same database instance.

I am not sure the impact it will have on performance.

Is anyone using DB for index storage? Any drawbacks of
this approach?

Regards,
Rajesh

--- Zach Bailey <[EMAIL PROTECTED]> wrote:

> Thanks for your response --
> 
> Based on my understanding, hadoop and nutch are
> essentially the same 
> thing, with nutch being derived from hadoop, and are
> primarily intended 
> to be standalone applications.
> 
> We are not looking for a standalone application,
> rather we must use a 
> framework to implement search inside our current
> content management 
> application. Currently the application search
> functionality is designed 
> and built around Lucene, so migrating frameworks at
> this point is not 
> feasible.
> 
> We are currently re-working our back-end to support
> clustering (in 
> tomcat) and we are looking for information on the
> migration of Lucene 
> from a single node filesystem index (which is what
> we use now and hope 
> to continue to use for clients with a single-node
> deployment) to a 
> shared filesystem index on a mounted network share.
> 
> We prefer to use this strategy because it means we
> do not have to have 
> two disparate methods of managing indexes for
> clients who run in a 
> single-node, non-clustered environment versus
> clients who run in a 
> multiple-node, clustered environment.
> 
> So, hopefully here are some easy questions someone
> could shed some light on:
> 
> Is this not a recommended method of managing indexes
> across multiple nodes?
> 
> At this point would people recommend storing an
> individual index on each 
> node and propagating index updates via a JMS
> framework rather than 
> attempting to handle it transparently with a single
> shared index?
> 
> Is the Lucene index code so intimately tied to
> filesystem semantics that 
> using a shared/networked file system is infeasible
> at this point in time?
> 
> What would be the quickest time-to-implementation of
> these strategies 
> (JMS vs. shared FS)? The most robust/least
> error-prone?
> 
> I really appreciate any insight or response anyone
> can provide, even if 
> it is a short answer to any of the related topics,
> "i.e. we implemented 
> clustered search using per-node indexing with JMS
> update propagation and 
> it works great", or even something as simple as
> "don't use a shared 
> filesystem at this point".
> 
> Cheers,
> -Zach
> 
> testn wrote:
> > Why don't you check out Hadoop and Nutch? It
> should provide what you are
> > looking for.
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



   

Building a website is a piece of cake. Yahoo! Small Business gives you all the 
tools to get online.
http://smallbusiness.yahoo.com/webhosting 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem


Mark,

Thanks so much for your response.

Unfortunately, I am not sure the leader of the project would feel good 
about running code from trunk, save without an explicit endorsement from 
a majority of the developers or contributors for that particular code 
(do those people keep up with this list, anyway?) Is there any word on 
the possible timeframe the code required to work with NFS might be released?


Thanks for your other insight about hardlinks and rsync. I will look 
into that; unfortunately it does not cover our userbase who may be 
clustering in a Windows Server environment. I still have not heard/seen 
any evidence (anecdotal or otherwise) about how well lucene might work 
sharing indexes over a mounted Windows share.


-Zach

Mark Miller wrote:

Some quick info:

NFS should work, but I think youll want to be working off the trunk. 
Also, Sharing an index over NFS is supposed to be slow. The standard so 
far, if you are not partitioning the index, is to use a unix/linux 
filesystem and hardlinks + rsync to efficiently share index changes 
across nodes (hard links for instant copy, rsync to only transfer 
changed index files, search the mailing list). If you look at solr you 
can see scripts that give an example of this. I don't think the scripts 
rely on solr. This kind of setup should be quick and simple to 
implement. Same with NFS. An RMI solution that allowed for index 
partitioning would probably be the longest to do.


-Mark



Zach Bailey wrote:

Thanks for your response --

Based on my understanding, hadoop and nutch are essentially the same 
thing, with nutch being derived from hadoop, and are primarily 
intended to be standalone applications.


We are not looking for a standalone application, rather we must use a 
framework to implement search inside our current content management 
application. Currently the application search functionality is 
designed and built around Lucene, so migrating frameworks at this 
point is not feasible.


We are currently re-working our back-end to support clustering (in 
tomcat) and we are looking for information on the migration of Lucene 
from a single node filesystem index (which is what we use now and hope 
to continue to use for clients with a single-node deployment) to a 
shared filesystem index on a mounted network share.


We prefer to use this strategy because it means we do not have to have 
two disparate methods of managing indexes for clients who run in a 
single-node, non-clustered environment versus clients who run in a 
multiple-node, clustered environment.


So, hopefully here are some easy questions someone could shed some 
light on:


Is this not a recommended method of managing indexes across multiple 
nodes?


At this point would people recommend storing an individual index on 
each node and propagating index updates via a JMS framework rather 
than attempting to handle it transparently with a single shared index?


Is the Lucene index code so intimately tied to filesystem semantics 
that using a shared/networked file system is infeasible at this point 
in time?


What would be the quickest time-to-implementation of these strategies 
(JMS vs. shared FS)? The most robust/least error-prone?


I really appreciate any insight or response anyone can provide, even 
if it is a short answer to any of the related topics, "i.e. we 
implemented clustered search using per-node indexing with JMS update 
propagation and it works great", or even something as simple as "don't 
use a shared filesystem at this point".


Cheers,
-Zach

testn wrote:

Why don't you check out Hadoop and Nutch? It should provide what you are
looking for.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustered Indexing on common network filesystem