Re: Size + memory restrictions

2006-02-15 Thread Leon Chaddock

Hi Greg,
Thanks. We are actually running against 4 segments of 4gb so about 20 
million docs. We cant merge the segments as their seems to be problems with 
out linux box , with having files over about 4gb. Not sure why that is.


If I was to upgrade to 8gb of ram does it seem likely this will double the 
amount of docs we can handle, or would this provide an exponential increase?


Thanks

Leon
- Original Message - 
From: "Greg Gershman" <[EMAIL PROTECTED]>

To: 
Sent: Wednesday, February 15, 2006 12:41 AM
Subject: Re: Size + memory restrictions



You may consider incrementally adding documents to
your index; I'm not sure why there would be problems
adding to an existing index, but you can always add
additional documents.  You can optimize later to get
everything back into a single segment.

Querying is a different story; if you are using the
Sort API, you will need enough memory to store a full
sorting of your documents in memory.  If you're trying
to sort on a string or anything other than an int or
float, this could require a lot of memory.

I've used indices much bigger than 5 mil. docs/3.5 gb
with less than 4GB of RAM and had no problems.

Greg


--- Leon Chaddock <[EMAIL PROTECTED]> wrote:


Hi,
we are having tremendous problems building a large
lucene index and querying
it.

The programmers are telling me that when the index
file reaches 3.5 gb or 5
million docs the index file can no longer grow any
larger.

To rectify this they have built index files in
multiple directories. Now
apparently my 4gb memory is not enough to query.

Does this seem right to people or does anyone have
any experience on largish
scale projects.

I am completely tearing my hair out here and dont
know what to do.

Thanks




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance Feedback Lucene+Algorithms

2006-02-15 Thread Dave Kor
You might also want to look at that the LucQE project
(http://sourceforge.net/projects/lucene-qe/), which implement a couple
of automated relevance feedback methods including Rocchio's formula.

On 2/15/06, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
> Please check Grant Ingersoll's presentation at ApacheCon 2005.
> He put out great demo programs for the relevance feedback using Lucene.
>
> Thank you,
>
> Koji
>
> > -Original Message-
> > From: varun sood [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, February 15, 2006 3:36 PM
> > To: java-user@lucene.apache.org
> > Subject: Relevance Feedback Lucene+Algorithms
> >
> >
> > Hi,
> >   Can anyone share the experience of how to implement Relevance
> > Feedback in
> > Lucene?
> >
> > Can someone suggest me some algorithms and papers which can help me in
> > building an effective Relevance Feedback system?
> >
> > Thanks in advance.
> >
> > Dexter.
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
Dave Kor, Research Assistant
Center for Information Mining and Extraction
School of Computing
National University of Singapore.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser behaviour ..

2006-02-15 Thread sergiu gordea

Chris Hostetter wrote:


: Exactly this is my question, why the QueryParser creates a Phrase query
: when he gets several tokens from analyzer
: and not a BooleanQuery?

Because if it did that, there would be no way to write phrase queries :)
 


I'm not very sure about this ...


QueryParser only returns a BooleanQuery when *it* can tell you have
several clauses.  For each "chunk" of text that it thinks of as one
continuous piece of text (either because it doesn't contain whitespaces or
 

wouldn't be better to let the analyzer decide if there is a continuous 
piece of text?

and to build PhraseQueries only when the quote sign is found?


because it has quotes around it) it gives it to the analyzer, if the
analyzer says there are multiple Terms there then QueryParser makes a
PhraseQuery out of it.   or in a nutshell:
  1) if the Parser detects multiple terms, it makes a boolean query
  2) if the Analyzer detects multiple terms, it makes a phrase query
 

this is related with my comment above. From the user's point of view I 
think it will make sense to

build a phrase query only when the quotes are found in the search string.

I think there are pro and con arguments, for "unifying" the behaviour.
I would be happy if the QueryParser wouldn't create phrase queries if i 
didn't explicitly  asked to do it.


Does someone have a different opinion?


if you don't like this behavior, it can all be circumvented by overriding
getFieldQuery().  you don't even have to teal with the analyzer if you
don't want to.  just call super.getFieldQuery() and if you get back a
PhraseQuery take it apart and build TermQueries wrapped in a boolean
query.
 

Well,  there is  all  the time  a work around.  It is obvious that 
searching for word1,word2,word3 was a
silly mistake, but I needed one hour to find why a PhraseQuery is 
created when no quotes existed in the query string.


So ... my opinion is that what I suggest will improve the usability of 
lucene, I hope that  the  lucene developers  share 
my opinion.


Best,

Sergiu





-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance Feedback Lucene+Algorithms

2006-02-15 Thread Grant Ingersoll

URL is http://www.cnlp.org/apachecon2005/

Koji Sekiguchi wrote:

Please check Grant Ingersoll's presentation at ApacheCon 2005.
He put out great demo programs for the relevance feedback using Lucene.

Thank you,

Koji

  

-Original Message-
From: varun sood [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 15, 2006 3:36 PM
To: java-user@lucene.apache.org
Subject: Relevance Feedback Lucene+Algorithms


Hi,
  Can anyone share the experience of how to implement Relevance 
Feedback in

Lucene?

Can someone suggest me some algorithms and papers which can help me in
building an effective Relevance Feedback system?

Thanks in advance.

Dexter.





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  


--
--- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: QueryParser behaviour ..

2006-02-15 Thread Yonik Seeley
> From the user's point of view I think it will make sense to
> build a phrase query only when the quotes are found in the search string.

You make an interesting point Sergiu.  Your proposal would increase
the expressive power of the QueryParser by allowing the construction
of either phrase queries or boolean queries when multiple tokens are
produced by analysis.

The main downside is that it's not backward compatible, and without
quotes (and hence phrase queries) many older queries will produce
worse results.  I also think that a majority of the time, when
multiple tokens are produced, you do want a phrase search (or at least
a sloppy one).

Of course, the backward compatible thing can be fixed via a flag on
the query parser that defaults to the old behavior.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help with mass delete from large index

2006-02-15 Thread Chandramohan
> perform such a cull again, you might make several
> distinct indexes (one per 
> day, per week, per whatever) during that reindexing
> so the next time will be 
> much easier.

How would you search and consolidate the results
across multiple indexes?  Hits from each index will
have independent scoring.

CL

--- "Michael D. Curtin" <[EMAIL PROTECTED]> wrote:

> Now that it's already in 1 index, I'm afraid you
> can't just delete a few 
> files.  On the other hand, if it's only a one-time
> thing, reindexing with only 
> the docs you want shouldn't be too bad.  If you
> think you might ever need to 
> perform such a cull again, you might make several
> distinct indexes (one per 
> day, per week, per whatever) during that reindexing
> so the next time will be 
> much easier.
> 
> Good luck!
> 
> --MDC
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Size + memory restrictions

2006-02-15 Thread Leon Chaddock

Looking into the memory problems further I read

"Every time you open an IndexSearcher/IndexReader resources are used which
take up memory.  for an application pointed at a static index, you only
ever need one IndexReader/IndexSearcher that can be shared among multiple
threads issuing queries.  if your index is being incrimentally updated,
you should never need more then two searcher/reader pairs open at a time"

We may have many different segments of our index, and it seems below we are 
using one
IndexSearcher per segment. Could this explain why we run out of memory when 
using more than 2/3 segments?

Anyone else have any comments on the below?

Many thanks

Leon
ps. At the moment I think it is set to only look at 2 segements

private Searcher getSearcher() throws IOException {
 if (mSearcher == null) {
  synchronized (Monitor) {
   Searcher[] srs = new IndexSearcher[SearchersDir.size()];
   int maxI = 2;
  // Searcher[] srs = new IndexSearcher[maxI];
   int i = 0;
   for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && ii++) {

String dir = (String) iter.next();
try {
 srs[i] = new IndexSearcher(IndexDir+dir);
} catch (IOException e) {
 log.error(ClassTool.getClassNameOnly(e) + ": " + e.getMessage(), e);
}
   }
   mSearcher = new MultiSearcher(srs);
   changeTime = System.currentTimeMillis();
  }
 }
 return mSearcher;
}
- Original Message - 
From: "Leon Chaddock" <[EMAIL PROTECTED]>

To: 
Sent: Wednesday, February 15, 2006 9:28 AM
Subject: Re: Size + memory restrictions



Hi Greg,
Thanks. We are actually running against 4 segments of 4gb so about 20 
million docs. We cant merge the segments as their seems to be problems 
with out linux box , with having files over about 4gb. Not sure why that 
is.


If I was to upgrade to 8gb of ram does it seem likely this will double the 
amount of docs we can handle, or would this provide an exponential 
increase?


Thanks

Leon
- Original Message - 
From: "Greg Gershman" <[EMAIL PROTECTED]>

To: 
Sent: Wednesday, February 15, 2006 12:41 AM
Subject: Re: Size + memory restrictions



You may consider incrementally adding documents to
your index; I'm not sure why there would be problems
adding to an existing index, but you can always add
additional documents.  You can optimize later to get
everything back into a single segment.

Querying is a different story; if you are using the
Sort API, you will need enough memory to store a full
sorting of your documents in memory.  If you're trying
to sort on a string or anything other than an int or
float, this could require a lot of memory.

I've used indices much bigger than 5 mil. docs/3.5 gb
with less than 4GB of RAM and had no problems.

Greg


--- Leon Chaddock <[EMAIL PROTECTED]> wrote:


Hi,
we are having tremendous problems building a large
lucene index and querying
it.

The programmers are telling me that when the index
file reaches 3.5 gb or 5
million docs the index file can no longer grow any
larger.

To rectify this they have built index files in
multiple directories. Now
apparently my 4gb memory is not enough to query.

Does this seem right to people or does anyone have
any experience on largish
scale projects.

I am completely tearing my hair out here and dont
know what to do.

Thanks




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 
01/02/2006






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Internal Virus Database is out-of-date.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help with mass delete from large index

2006-02-15 Thread Michael D. Curtin

Chandramohan wrote:


perform such a cull again, you might make several
distinct indexes (one per 
day, per week, per whatever) during that reindexing
so the next time will be 
much easier.


How would you search and consolidate the results
across multiple indexes?  Hits from each index will
have independent scoring.


Frankly, I ignore the scores in my application.  The data itself isn't English 
prose, so the TF/IDF calcuations are stretched at best, as a measure of 
relevance.  I presort the documents to be in "relevance" order (a popularity 
metric), then specify index ordering for the results.


If that wouldn't work for your application, it seems to me that large-enough 
sub-sections *would* produce equivalent scores.  That is, if the sub-indexes 
were big enough, one could directly compare scores, so a simple merge would 
work.  If the total document corpus is small, then the need for sub-indexes 
isn't there anyhow.


--MDC

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Performance Issues

2006-02-15 Thread Urvashi Gadi

Hi All,

My system requires traversing Hits (search result) and extracting some 
data from it. If the result set is very large my system becomes very 
slow.


Is there a way to increase performance? Is there a way i can limit the 
number of most relevant documents returned?


Best regards,
Urvashi




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: index merging

2006-02-15 Thread Omar Didi
I have tried to use the isCurrent() method IndexReader to figure out if  an 
index is merging. but since I have to do this evrytime I need to add a 
document, the performance got s slow.

here is what I am doing, I create 4 indexs and I am running with 4 threads. I 
do a round robbin on the indexes when ever I process a new document. before 
adding a document I need to check if the index is merging, if it's the case 
then send this document to an index that is not merging.


is there a better to index with multi threads? or what is the fastet way to 
check that a index is not merging?

thanks for any hints,

- Omar

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Monday, February 06, 2006 10:03 AM
To: java-user@lucene.apache.org
Subject: Re: index merging


On 2/6/06, Vanlerberghe, Luc <[EMAIL PROTECTED]> wrote:
> Sorry to contradict you Yonik, but I'm pretty sure the commit lock is
> *not* locked during a merge, only while the "segments" file is being
> updated.

Oops, you're right.  Good thing too... if the commit lock was held
during merges, one couldn't even open up a new IndexReader.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Iterating hits

2006-02-15 Thread Daniel Cortes

Hi lucene users I have a strange error and I don't know to do?
My logs say this:
java.lang.ArrayIndexOutOfBoundsException: 100 >= 100
   at java.util.Vector.elementAt(Vector.java:431)
   at org.apache.lucene.search.Hits.hitDoc(Hits.java:127)
   at org.apache.lucene.search.Hits.doc(Hits.java:89)

my code is this
   PrefixQuery p = new PrefixQuery(new 
Term("TOOL_REF_ID",getINITIAL(tool)));

   Hits h = sr.search(p);
   for (int i=0;i   log.debug(h.doc(i).getField("TYPE") + " 
"+h.doc(i).getField("TOOL_REF_ID"));

   reader.delete(h.id(i));
   }   

Why? How can I do to delete all the documents that the tool_ref_if 
begins with for example "AK"?



Searching about it I find this :
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200306.mbox/[EMAIL 
PROTECTED]

thks for any reply.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Iterating hits

2006-02-15 Thread Yonik Seeley
Try using a different reader to delete the documents.
Hits can re-execute a query, and if the searcher you are using is
sharing the reader you are deleting with, it's like changing a list
you are iterating over (fewer hits will be found the next time the
query is executed).

-Yonik

On 2/15/06, Daniel Cortes <[EMAIL PROTECTED]> wrote:
> Hi lucene users I have a strange error and I don't know to do?
> My logs say this:
> java.lang.ArrayIndexOutOfBoundsException: 100 >= 100
> at java.util.Vector.elementAt(Vector.java:431)
> at org.apache.lucene.search.Hits.hitDoc(Hits.java:127)
> at org.apache.lucene.search.Hits.doc(Hits.java:89)
>
> my code is this
> PrefixQuery p = new PrefixQuery(new
> Term("TOOL_REF_ID",getINITIAL(tool)));
> Hits h = sr.search(p);
> for (int i=0;i log.debug(h.doc(i).getField("TYPE") + "
> "+h.doc(i).getField("TOOL_REF_ID"));
> reader.delete(h.id(i));
> }
>
> Why? How can I do to delete all the documents that the tool_ref_if
> begins with for example "AK"?
>
>
> Searching about it I find this :
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200306.mbox/[EMAIL 
> PROTECTED]
>
> thks for any reply.
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Size + memory restrictions

2006-02-15 Thread Chris Hostetter
: We may have many different segments of our index, and it seems below we are
: using one
: IndexSearcher per segment. Could this explain why we run out of memory when
: using more than 2/3 segments?
: Anyone else have any comments on the below?

terminology is a big issue hwere .. when you use the word "segment" it
seems like you are talking about a segment of your data, which is in a
self contained index in it's own right.  My point in in the comment you
quoted was that for a given index, you don't need more then one active
IndexSearcher open at a time, any more then that can waste resources.

I don't know what kind of memory overhead there is in a MultiSearcher, but
besides that you should also be looking at the other issues in the message
you quoted from:   who/when is calling your getSearcher() method? ... is
it getting called more often then the underlying indexes change?  who is
closing the old searchers when you open new ones?

:
: Many thanks
:
: Leon
: ps. At the moment I think it is set to only look at 2 segements
:
: private Searcher getSearcher() throws IOException {
:   if (mSearcher == null) {
:synchronized (Monitor) {
: Searcher[] srs = new IndexSearcher[SearchersDir.size()];
: int maxI = 2;
:// Searcher[] srs = new IndexSearcher[maxI];
: int i = 0;
: for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && i
: To: 
: Sent: Wednesday, February 15, 2006 9:28 AM
: Subject: Re: Size + memory restrictions
:
:
: > Hi Greg,
: > Thanks. We are actually running against 4 segments of 4gb so about 20
: > million docs. We cant merge the segments as their seems to be problems
: > with out linux box , with having files over about 4gb. Not sure why that
: > is.
: >
: > If I was to upgrade to 8gb of ram does it seem likely this will double the
: > amount of docs we can handle, or would this provide an exponential
: > increase?
: >
: > Thanks
: >
: > Leon
: > - Original Message -
: > From: "Greg Gershman" <[EMAIL PROTECTED]>
: > To: 
: > Sent: Wednesday, February 15, 2006 12:41 AM
: > Subject: Re: Size + memory restrictions
: >
: >
: >> You may consider incrementally adding documents to
: >> your index; I'm not sure why there would be problems
: >> adding to an existing index, but you can always add
: >> additional documents.  You can optimize later to get
: >> everything back into a single segment.
: >>
: >> Querying is a different story; if you are using the
: >> Sort API, you will need enough memory to store a full
: >> sorting of your documents in memory.  If you're trying
: >> to sort on a string or anything other than an int or
: >> float, this could require a lot of memory.
: >>
: >> I've used indices much bigger than 5 mil. docs/3.5 gb
: >> with less than 4GB of RAM and had no problems.
: >>
: >> Greg
: >>
: >>
: >> --- Leon Chaddock <[EMAIL PROTECTED]> wrote:
: >>
: >>> Hi,
: >>> we are having tremendous problems building a large
: >>> lucene index and querying
: >>> it.
: >>>
: >>> The programmers are telling me that when the index
: >>> file reaches 3.5 gb or 5
: >>> million docs the index file can no longer grow any
: >>> larger.
: >>>
: >>> To rectify this they have built index files in
: >>> multiple directories. Now
: >>> apparently my 4gb memory is not enough to query.
: >>>
: >>> Does this seem right to people or does anyone have
: >>> any experience on largish
: >>> scale projects.
: >>>
: >>> I am completely tearing my hair out here and dont
: >>> know what to do.
: >>>
: >>> Thanks
: >>>
: >>
: >>
: >> __
: >> Do You Yahoo!?
: >> Tired of spam?  Yahoo! Mail has the best spam protection around
: >> http://mail.yahoo.com
: >>
: >> -
: >> To unsubscribe, e-mail: [EMAIL PROTECTED]
: >> For additional commands, e-mail: [EMAIL PROTECTED]
: >>
: >>
: >>
: >>
: >>
: >> --
: >> Internal Virus Database is out-of-date.
: >> Checked by AVG Free Edition.
: >> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date:
: >> 01/02/2006
: >>
: >>
: >
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
: >
: >
: >
: > --
: > Internal Virus Database is out-of-date.
: > Checked by AVG Free Edition.
: > Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
: >
: >
:
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Size + memory restrictions

2006-02-15 Thread Leon Chaddock

Hi Chris,
Thanks, when I quoted segment I meant index file.
So if we have 10 seperate index files are you saying we should have one 
indexSearcher for the index collectively, or one per index file


Thanks

Leon


- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>

To: 
Sent: Wednesday, February 15, 2006 6:40 PM
Subject: Re: Size + memory restrictions


: We may have many different segments of our index, and it seems below we 
are

: using one
: IndexSearcher per segment. Could this explain why we run out of memory 
when

: using more than 2/3 segments?
: Anyone else have any comments on the below?

terminology is a big issue hwere .. when you use the word "segment" it
seems like you are talking about a segment of your data, which is in a
self contained index in it's own right.  My point in in the comment you
quoted was that for a given index, you don't need more then one active
IndexSearcher open at a time, any more then that can waste resources.

I don't know what kind of memory overhead there is in a MultiSearcher, but
besides that you should also be looking at the other issues in the message
you quoted from:   who/when is calling your getSearcher() method? ... is
it getting called more often then the underlying indexes change?  who is
closing the old searchers when you open new ones?

:
: Many thanks
:
: Leon
: ps. At the moment I think it is set to only look at 2 segements
:
: private Searcher getSearcher() throws IOException {
:   if (mSearcher == null) {
:synchronized (Monitor) {
: Searcher[] srs = new IndexSearcher[SearchersDir.size()];
: int maxI = 2;
:// Searcher[] srs = new IndexSearcher[maxI];
: int i = 0;
: for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && 
i
: i++) {
:  String dir = (String) iter.next();
:  try {
:   srs[i] = new IndexSearcher(IndexDir+dir);
:  } catch (IOException e) {
:   log.error(ClassTool.getClassNameOnly(e) + ": " + e.getMessage(), 
e);

:  }
: }
: mSearcher = new MultiSearcher(srs);
: changeTime = System.currentTimeMillis();
:}
:   }
:   return mSearcher;
:  }
: - Original Message -
: From: "Leon Chaddock" <[EMAIL PROTECTED]>
: To: 
: Sent: Wednesday, February 15, 2006 9:28 AM
: Subject: Re: Size + memory restrictions
:
:
: > Hi Greg,
: > Thanks. We are actually running against 4 segments of 4gb so about 20
: > million docs. We cant merge the segments as their seems to be problems
: > with out linux box , with having files over about 4gb. Not sure why 
that

: > is.
: >
: > If I was to upgrade to 8gb of ram does it seem likely this will double 
the

: > amount of docs we can handle, or would this provide an exponential
: > increase?
: >
: > Thanks
: >
: > Leon
: > - Original Message -
: > From: "Greg Gershman" <[EMAIL PROTECTED]>
: > To: 
: > Sent: Wednesday, February 15, 2006 12:41 AM
: > Subject: Re: Size + memory restrictions
: >
: >
: >> You may consider incrementally adding documents to
: >> your index; I'm not sure why there would be problems
: >> adding to an existing index, but you can always add
: >> additional documents.  You can optimize later to get
: >> everything back into a single segment.
: >>
: >> Querying is a different story; if you are using the
: >> Sort API, you will need enough memory to store a full
: >> sorting of your documents in memory.  If you're trying
: >> to sort on a string or anything other than an int or
: >> float, this could require a lot of memory.
: >>
: >> I've used indices much bigger than 5 mil. docs/3.5 gb
: >> with less than 4GB of RAM and had no problems.
: >>
: >> Greg
: >>
: >>
: >> --- Leon Chaddock <[EMAIL PROTECTED]> wrote:
: >>
: >>> Hi,
: >>> we are having tremendous problems building a large
: >>> lucene index and querying
: >>> it.
: >>>
: >>> The programmers are telling me that when the index
: >>> file reaches 3.5 gb or 5
: >>> million docs the index file can no longer grow any
: >>> larger.
: >>>
: >>> To rectify this they have built index files in
: >>> multiple directories. Now
: >>> apparently my 4gb memory is not enough to query.
: >>>
: >>> Does this seem right to people or does anyone have
: >>> any experience on largish
: >>> scale projects.
: >>>
: >>> I am completely tearing my hair out here and dont
: >>> know what to do.
: >>>
: >>> Thanks
: >>>
: >>
: >>
: >> __
: >> Do You Yahoo!?
: >> Tired of spam?  Yahoo! Mail has the best spam protection around
: >> http://mail.yahoo.com
: >>
: >> -
: >> To unsubscribe, e-mail: [EMAIL PROTECTED]
: >> For additional commands, e-mail: [EMAIL PROTECTED]
: >>
: >>
: >>
: >>
: >>
: >> --
: >> Internal Virus Database is out-of-date.
: >> Checked by AVG Free Edition.
: >> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date:
: >> 01/02/2006
: >>
: >>
: >
: >
: > -

Re: Size + memory restrictions

2006-02-15 Thread Otis Gospodnetic
Leon,

Index is typically a directory on disk with files (commonly called "index 
files") in it.
Each index can have 1 or more segments.
Each segment is comprised of several index files.

If you are using the compound index format, then the situation is a bit 
different (less index files).

Otis
P.S.
You asked about Lucene in Action... :)

- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, February 15, 2006 1:40:01 PM
Subject: Re: Size + memory restrictions

: We may have many different segments of our index, and it seems below we are
: using one
: IndexSearcher per segment. Could this explain why we run out of memory when
: using more than 2/3 segments?
: Anyone else have any comments on the below?

terminology is a big issue hwere .. when you use the word "segment" it
seems like you are talking about a segment of your data, which is in a
self contained index in it's own right.  My point in in the comment you
quoted was that for a given index, you don't need more then one active
IndexSearcher open at a time, any more then that can waste resources.

I don't know what kind of memory overhead there is in a MultiSearcher, but
besides that you should also be looking at the other issues in the message
you quoted from:   who/when is calling your getSearcher() method? ... is
it getting called more often then the underlying indexes change?  who is
closing the old searchers when you open new ones?

:
: Many thanks
:
: Leon
: ps. At the moment I think it is set to only look at 2 segements
:
: private Searcher getSearcher() throws IOException {
:   if (mSearcher == null) {
:synchronized (Monitor) {
: Searcher[] srs = new IndexSearcher[SearchersDir.size()];
: int maxI = 2;
:// Searcher[] srs = new IndexSearcher[maxI];
: int i = 0;
: for (Iterator iter = SearchersDir.iterator(); iter.hasNext() && i
: To: 
: Sent: Wednesday, February 15, 2006 9:28 AM
: Subject: Re: Size + memory restrictions
:
:
: > Hi Greg,
: > Thanks. We are actually running against 4 segments of 4gb so about 20
: > million docs. We cant merge the segments as their seems to be problems
: > with out linux box , with having files over about 4gb. Not sure why that
: > is.
: >
: > If I was to upgrade to 8gb of ram does it seem likely this will double the
: > amount of docs we can handle, or would this provide an exponential
: > increase?
: >
: > Thanks
: >
: > Leon
: > - Original Message -
: > From: "Greg Gershman" <[EMAIL PROTECTED]>
: > To: 
: > Sent: Wednesday, February 15, 2006 12:41 AM
: > Subject: Re: Size + memory restrictions
: >
: >
: >> You may consider incrementally adding documents to
: >> your index; I'm not sure why there would be problems
: >> adding to an existing index, but you can always add
: >> additional documents.  You can optimize later to get
: >> everything back into a single segment.
: >>
: >> Querying is a different story; if you are using the
: >> Sort API, you will need enough memory to store a full
: >> sorting of your documents in memory.  If you're trying
: >> to sort on a string or anything other than an int or
: >> float, this could require a lot of memory.
: >>
: >> I've used indices much bigger than 5 mil. docs/3.5 gb
: >> with less than 4GB of RAM and had no problems.
: >>
: >> Greg
: >>
: >>
: >> --- Leon Chaddock <[EMAIL PROTECTED]> wrote:
: >>
: >>> Hi,
: >>> we are having tremendous problems building a large
: >>> lucene index and querying
: >>> it.
: >>>
: >>> The programmers are telling me that when the index
: >>> file reaches 3.5 gb or 5
: >>> million docs the index file can no longer grow any
: >>> larger.
: >>>
: >>> To rectify this they have built index files in
: >>> multiple directories. Now
: >>> apparently my 4gb memory is not enough to query.
: >>>
: >>> Does this seem right to people or does anyone have
: >>> any experience on largish
: >>> scale projects.
: >>>
: >>> I am completely tearing my hair out here and dont
: >>> know what to do.
: >>>
: >>> Thanks
: >>>
: >>
: >>
: >> __
: >> Do You Yahoo!?
: >> Tired of spam?  Yahoo! Mail has the best spam protection around
: >> http://mail.yahoo.com
: >>
: >> -
: >> To unsubscribe, e-mail: [EMAIL PROTECTED]
: >> For additional commands, e-mail: [EMAIL PROTECTED]
: >>
: >>
: >>
: >>
: >>
: >> --
: >> Internal Virus Database is out-of-date.
: >> Checked by AVG Free Edition.
: >> Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date:
: >> 01/02/2006
: >>
: >>
: >
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
: >
: >
: >
: > --
: > Internal Virus Database is out-of-date.
: > Checked by AVG Free Edition.
: > Version: 7.1.375 / Virus Database: 267.15.0/248 - Release Date: 01/02/2006
: >
: >
:
:
: --

Re: Relevance Feedback Lucene+Algorithms

2006-02-15 Thread varun sood
Hi
Thanks for replying.

I read your ppt. It is good. But the code or the basic relevance feedback is
not explained there. Actually I am not familiar with JSP, JUnit, Maven, etc.
I guess It will take me lot of time to actually discover how the things work
in demo program because I have to learn all those technologies first.
Is there any documentation or some brief notes on how Relevance Feedback has
been or could be done? I am looking on manual Relevance Feedback system.


Thanks,
Dexter


On 2/15/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> URL is http://www.cnlp.org/apachecon2005/
>
> Koji Sekiguchi wrote:
> > Please check Grant Ingersoll's presentation at ApacheCon 2005.
> > He put out great demo programs for the relevance feedback using Lucene.
> >
> > Thank you,
> >
> > Koji
> >
> >
> >> -Original Message-
> >> From: varun sood [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, February 15, 2006 3:36 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Relevance Feedback Lucene+Algorithms
> >>
> >>
> >> Hi,
> >>   Can anyone share the experience of how to implement Relevance
> >> Feedback in
> >> Lucene?
> >>
> >> Can someone suggest me some algorithms and papers which can help me in
> >> building an effective Relevance Feedback system?
> >>
> >> Thanks in advance.
> >>
> >> Dexter.
> >>
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> --
> ---
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> School of Information Studies
> 335 Hinds Hall
> Syracuse, NY 13244
>
> http://www.cnlp.org
> Voice:  315-443-5484
> Fax: 315-443-6886
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: index merging

2006-02-15 Thread Daniel Noll

Omar Didi wrote:

I have tried to use the isCurrent() method IndexReader to figure out
if  an index is merging. but since I have to do this evrytime I need
to add a document, the performance got s slow.

here is what I am doing, I create 4 indexs and I am running with 4
threads. I do a round robbin on the indexes when ever I process a new
document. before adding a document I need to check if the index is
merging, if it's the case then send this document to an index that is
not merging.

is there a better to index with multi threads? or what is the fastet
way to check that a index is not merging?


I've done this before by having a single work queue of documents which 
need adding.  Each of the four indexing threads refer to that queue and 
can pull documents off that queue.


The concurrency utility classes in java.util.concurrent may help with 
this approach.


Daniel

--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance Feedback Lucene+Algorithms

2006-02-15 Thread Grant Ingersoll
In the example code, take a look at the SearchServlet.java code and the 
performFeedback and getTopTerms() methods, which demonstrate the use of 
the term vectors.  It is fairly well commented.  You don't need maven, 
JSP or JUnit for this.  On the indexing side, look at the TVHTMLDocument 
for how to construct the term vectors.


As for how to do Rel. Feedback, you can search the mailing list archive, 
there have been many discussions in the past that will offer insights 
into RF in Lucene.  I also like the book "Modern Information Retrieval" 
by Baeza-Yates, et.al as a text for the theory behind RF.  You may also 
find the MoreLikeThis implementation (again, search this mailing list 
and look in the Lucene contrib section) satisfies your needs.


Hope this helps,
Grant

varun sood wrote:

Hi
Thanks for replying.

I read your ppt. It is good. But the code or the basic relevance feedback is
not explained there. Actually I am not familiar with JSP, JUnit, Maven, etc.
I guess It will take me lot of time to actually discover how the things work
in demo program because I have to learn all those technologies first.
Is there any documentation or some brief notes on how Relevance Feedback has
been or could be done? I am looking on manual Relevance Feedback system.


Thanks,
Dexter


On 2/15/06, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
  

URL is http://www.cnlp.org/apachecon2005/

Koji Sekiguchi wrote:


Please check Grant Ingersoll's presentation at ApacheCon 2005.
He put out great demo programs for the relevance feedback using Lucene.

Thank you,

Koji


  

-Original Message-
From: varun sood [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 15, 2006 3:36 PM
To: java-user@lucene.apache.org
Subject: Relevance Feedback Lucene+Algorithms


Hi,
  Can anyone share the experience of how to implement Relevance
Feedback in
Lucene?

Can someone suggest me some algorithms and papers which can help me in
building an effective Relevance Feedback system?

Thanks in advance.

Dexter.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  

--
---
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice:  315-443-5484
Fax: 315-443-6886


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


--
--- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Hardware Requirements for a large index?

2006-02-15 Thread Chun Wei Ho
Hi,

I am in the process of deciding specs for a crawling machine and a
searching machine (two machines), which will support merging/indexing
and searching operations on a single Lucene index that may scale to
about several million pages (at which it would be about 2-10 GB,
assuming linear growth with pages).

What is the range of hardware that I should be looking at? Could
anyone share their deployment/hardware specs for a large index size?
I'm looking for RAM and CPU considerations.

Also what is the preferred platform - Java has a max memory allocation
of 4GB on Solaris and 2GB on linux? -> Does it make sense to get more
RAM than this?

Thanks!

CW

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to index numeric fields

2006-02-15 Thread Shivani Sawhney
Hi,

What is the best way to index numeric decimal fields, like experience, when
I want to use a range search on this field?

 

Thanks in advance.

 

Regards,

Shivani

 

 



Re: How to index numeric fields

2006-02-15 Thread Otis Gospodnetic
Here are a few bits:
http://www.lucenebook.com/search?query=indexing+numbers
The Wiki and the FAQ also have some information about indexing numbers/dates.
Basically, you want them small (ints, faster sorting, if you need sorting), and 
you don't want them too fine, if you'll be expanding them into Boolean OR query.

Otis


- Original Message 
From: Shivani Sawhney <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, February 15, 2006 11:36:37 PM
Subject: How to index numeric fields

Hi,

What is the best way to index numeric decimal fields, like experience, when
I want to use a range search on this field?

 

Thanks in advance.

 

Regards,

Shivani

 

 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ArrayIndexOutOfBoundsException while closing the index writer

2006-02-15 Thread Otis Gospodnetic
Who knows what else the app is doing.
However, I can quickly suggest that you add a finally block and close your 
writer in there if writer != null.

Otis

- Original Message 
From: Shivani Sawhney <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, February 15, 2006 11:31:12 PM
Subject: ArrayIndexOutOfBoundsException while closing the index writer

Hi,

I have used Lucene in my application and am just indexing and searching on
some documents. The code that indexes the documents was working fine till
yesterday and suddenly stopped working.

I get an error when I am trying to close the index writer. The code is as
follows:

 

 

.

IndexWriter indexwriter = new IndexWriter(indexDirFile, new
StandardAnalyzer(), flag);

indexFile(indexwriter, resumeFile);

indexwriter.close(); //causing errors

} catch (IOException e)

{   

e.printStackTrace();

throw new Error(e);

}

.

 

And the error log is as follows:

 

2006-02-15 18:47:48,748 WARN  [org.apache.struts.action.RequestProcessor]
Unhandled Exception thrown: class java.lang.ArrayIndexOutOfBoundsException

2006-02-15 18:47:48,748 ERROR [org.jboss.web.localhost.Engine]
StandardWrapperValve[action]: Servlet.service() for servlet action threw
exception

java.lang.ArrayIndexOutOfBoundsException: 105 >= 25

at java.util.Vector.elementAt(Vector.java:432)

at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)

at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)

at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237)

at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:169)

at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:97)

at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:425)

at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:373)

at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:193)

at rd.admin.NewIndexer.indexTextFile(NewIndexer.java:108)

at rd.admin.AddResume.indexOneRow(AddResume.java:38)

at
rd.admin.LuceneGateway.buildMapAndIndex(LuceneGateway.java:46)

at rd.admin.LuceneGateway.indexResume(LuceneGateway.java:30)

at
rd.admin.UploadResumeAgainstRequisition.npExecute(UploadResumeAgainstRequisi
tion.java:106)

at np.core.BaseNPAction.execute(BaseNPAction.java:116)

at
org.apache.struts.action.RequestProcessor.processActionPerform(RequestProces
sor.java:421)

at
org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:226)

at
org.apache.struts.action.ActionServlet.process(ActionServlet.java:1164)

at
org.apache.struts.action.ActionServlet.doPost(ActionServlet.java:415)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:810)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:237)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:157)

at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.ja
va:75)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:186)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:157)

at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:214)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:104)

at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)

at
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContext
Valve.java:198)

at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:152)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:104)

at
org.jboss.web.tomcat.security.CustomPrincipalValve.invoke(CustomPrincipalVal
ve.java:66)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssoci
ationValve.java:153)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase
.java:540)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.

ArrayIndexOutOfBoundsException while closing the index writer

2006-02-15 Thread Shivani Sawhney
Hi,

I have used Lucene in my application and am just indexing and searching on
some documents. The code that indexes the documents was working fine till
yesterday and suddenly stopped working.

I get an error when I am trying to close the index writer. The code is as
follows:

 

 

.

IndexWriter indexwriter = new IndexWriter(indexDirFile, new
StandardAnalyzer(), flag);

indexFile(indexwriter, resumeFile);

indexwriter.close(); //causing errors

} catch (IOException e)

{   

e.printStackTrace();

throw new Error(e);

}

.

 

And the error log is as follows:

 

2006-02-15 18:47:48,748 WARN  [org.apache.struts.action.RequestProcessor]
Unhandled Exception thrown: class java.lang.ArrayIndexOutOfBoundsException

2006-02-15 18:47:48,748 ERROR [org.jboss.web.localhost.Engine]
StandardWrapperValve[action]: Servlet.service() for servlet action threw
exception

java.lang.ArrayIndexOutOfBoundsException: 105 >= 25

at java.util.Vector.elementAt(Vector.java:432)

at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)

at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)

at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237)

at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:169)

at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:97)

at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:425)

at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:373)

at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:193)

at rd.admin.NewIndexer.indexTextFile(NewIndexer.java:108)

at rd.admin.AddResume.indexOneRow(AddResume.java:38)

at
rd.admin.LuceneGateway.buildMapAndIndex(LuceneGateway.java:46)

at rd.admin.LuceneGateway.indexResume(LuceneGateway.java:30)

at
rd.admin.UploadResumeAgainstRequisition.npExecute(UploadResumeAgainstRequisi
tion.java:106)

at np.core.BaseNPAction.execute(BaseNPAction.java:116)

at
org.apache.struts.action.RequestProcessor.processActionPerform(RequestProces
sor.java:421)

at
org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:226)

at
org.apache.struts.action.ActionServlet.process(ActionServlet.java:1164)

at
org.apache.struts.action.ActionServlet.doPost(ActionServlet.java:415)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:810)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:237)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:157)

at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.ja
va:75)

at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:186)

at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:157)

at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:214)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:104)

at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)

at
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContext
Valve.java:198)

at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:152)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:104)

at
org.jboss.web.tomcat.security.CustomPrincipalValve.invoke(CustomPrincipalVal
ve.java:66)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssoci
ationValve.java:153)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase
.java:540)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:
54)

at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContex
t.java:102)

at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)

at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137
)

at
org.apache.catalina.core.StandardValveContext.invokeNext(St

RE: ArrayIndexOutOfBoundsException while closing the index writer

2006-02-15 Thread Shivani Sawhney
Hi Otis,

Thanks for such a quick reply. I tried using finally, but it didn't help.

I guess if I explain the integration of lucene with my app in little detail
then you probably can help me better.

I allow users to upload documents, which are then indexed, and search on
them. Now I am getting this error when I am trying to index the document and
particularly while closing the index writer.
If we look closely at the error log, it's giving an error at 

org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)

i.e., when lucene tries to get something by the field name, 

return (FieldInfo) byName.get(fieldName);

, now what beats me is that, indexing on fields has already been done by the
time we want to close the index writer, so how come I don't get an error
while indexing, what goes wrong when I am trying to close the index writer.


Please see if you can help me with this

Thanks in advance.

The code used for indexing is as follows:

public void indexFile(IndexWriter indexwriter, File resumeFile)
{
Document document = new Document();   
try
{
File afile[] = indexDirFile.listFiles();
boolean flag = false;

if (afile.length <= 0)
flag = true;

indexwriter = new IndexWriter(indexDirFile, new
StandardAnalyzer(), flag);

try
{
document.add(Field.Text(IndexerColumns.contents, new
FileReader(resumeFile)));
} catch (FileNotFoundException e)
{
e.printStackTrace();
throw new MyRuntimeException(e.getMessage(), e);
}

document.add(Field.Keyword( IndexerColumns.id,
String.valueOf(mapLuceneParams.get(IndexerColumns.id)) ));

for (int i = 0; i < this.columnInfos.length; i++)
{
ColumnInfo columnInfo = columnInfos[i];
String value =
String.valueOf(mapLuceneParams.get(columnInfo.columnName));

if (value != null)
{
value = value.trim();
if (value.length() != 0)
{
if (columnInfo.istokenized)
{   
document.add(Field.Text(columnInfo.columnName,
value));
} else
{
document.add(Field.Keyword(columnInfo.columnName,
value));
}
}
}
}
document.add(Field.Keyword(IndexerColumns.filePath,
String.valueOf(mapLuceneParams.get(IndexerColumns.filePath;

try
{
indexwriter.addDocument(document); 
} catch (IOException e)
{
e.printStackTrace();
throw new MyRuntimeException(e.getMessage(), e);
} 

  indexwriter.close();
} catch (IOException e)
{   
e.printStackTrace();
throw new Error(e);
}finally
{
if(indexwriter != null)
{
indexwriter.close();
}
}



Regards,
Shivani




-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 16 February, 2006 10:16 AM
To: java-user@lucene.apache.org
Subject: Re: ArrayIndexOutOfBoundsException while closing the index writer

Who knows what else the app is doing.
However, I can quickly suggest that you add a finally block and close your
writer in there if writer != null.

Otis

- Original Message 
From: Shivani Sawhney <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, February 15, 2006 11:31:12 PM
Subject: ArrayIndexOutOfBoundsException while closing the index writer

Hi,

I have used Lucene in my application and am just indexing and searching on
some documents. The code that indexes the documents was working fine till
yesterday and suddenly stopped working.

I get an error when I am trying to close the index writer. The code is as
follows:

 

 

.

IndexWriter indexwriter = new IndexWriter(indexDirFile, new
StandardAnalyzer(), flag);

indexFile(indexwriter, resumeFile);

indexwriter.close(); //causing errors

} catch (IOException e)

{   

e.printStackTrace();

throw new Error(e);

}

.

 

And the error log is as follows:

 

2006-02-15 18:47:48,748 WARN  [org.apache.struts.action.RequestProcessor]
Unhandled Exception thrown: class java.lang.ArrayIndexOutOfBoundsException

2006-02-15 18:47:48,748 ERROR [org.jboss.web.localhost.Engine]
StandardWrapperValve[action]: Servlet.service() for servlet action threw
exception

java.lang.ArrayIndexOutOfBoundsException: 105 >= 25

at java.util.Vector.elementAt(Vector.java:432)

at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)

at
org.apache.lucene.inde