addIndexes causing IndexOutOfBoundsException

2006-02-07 Thread Chris Fraschetti
I have an index of roughly 2 million docs making up almost 200GB and I
can't seem to merge any additional indexes into it. Here is the error
I continuously get, always with the Index: 85, Size: 13

I couldn't find much in the previous mailing list posts nor on 'ol
faithful Google.

help/ideas?

java.lang.IndexOutOfBoundsException: Index: 85, Size: 13
at java.util.ArrayList.RangeCheck(ArrayList.java:546)
at java.util.ArrayList.get(ArrayList.java:321)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66)
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237)
at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:185)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:92)
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:399)
--
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene - PDFBox

2005-05-25 Thread Chris Fraschetti
"case2") >=0)
> >   field = new Field("caseid", "2", false, true, false);
> >   else if (file.getPath().indexOf("case3") >=0)
> >   field = new Field("caseid", "3", false, true, false);
> >   else
> >   field = new Field("caseid", "0", false, true, false);
> >
> >   doc.add(field);
> >
> >   writer.addDocument(doc);
> > }
> > // at least on windows, some temporary files raise this exception 
> > with an "access denied" message
> > // checking if the file can be read doesn't help
> > catch (FileNotFoundException fnfe) {
> >   ;
> > }
> >   }
> > }
> >   }
> > }
> >
> >
> > Here is the SearchFiles class with some minor modifications...
> >
> > import java.io.IOException;
> > import java.io.BufferedReader;
> > import java.io.InputStreamReader;
> > import java.util.StringTokenizer;
> >
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.search.Searcher;
> > import org.apache.lucene.search.IndexSearcher;
> > import org.apache.lucene.search.Query;
> > import org.apache.lucene.search.BooleanQuery;
> > import org.apache.lucene.search.PhraseQuery;
> > import org.apache.lucene.search.Hits;
> > import org.apache.lucene.index.Term;
> > import org.apache.lucene.queryParser.QueryParser;
> > import org.apache.lucene.queryParser.ParseException;
> >
> > class SearchFiles {
> >
> >   private static Query getCaseQuery(String line, Analyzer analyzer)
> >   throws ParseException {
> >   BooleanQuery bq = new BooleanQuery();
> >   StringTokenizer st = new StringTokenizer(line);
> >   Query query = QueryParser.parse(line, "contents", analyzer);
> >   String caseId = null;
> >   while (st.hasMoreTokens()) {
> >   caseId = st.nextToken();
> >   System.out.println("build case query for " + caseId);
> >
> >   query = QueryParser.parse(caseId, "caseid", analyzer);
> >   bq.add(query, false, false);
> >   }
> >
> >   return bq;
> >   }
> >   public static void main(String[] args) {
> > try {
> >   Searcher searcher = new IndexSearcher("index");
> >   Analyzer analyzer = new StandardAnalyzer();
> >
> >   BufferedReader in = new BufferedReader(new 
> > InputStreamReader(System.in));
> >   while (true) {
> > System.out.print("Query: ");
> > String line = in.readLine();
> > System.out.print("Cases: ");
> > String caseLine = in.readLine();
> > Query caseQuery = getCaseQuery(caseLine, analyzer);
> >
> > if (line.length() == -1)
> >   break;
> >
> >
> > Query query = QueryParser.parse(line, "contents", analyzer);
> > // PhraseQuery query = new PhraseQuery();
> > // query.add(new Term("contents",line));
> > System.out.println("Searching for: " + query.toString("contents"));
> > /*
> > BooleanQuery wholeQuery = new BooleanQuery();
> > wholeQuery.add(caseQuery, true, false);
> > wholeQuery.add(query, true, false);
> > Hits hits = searcher.search(wholeQuery);
> > */
> > Hits hits = searcher.search(query);
> >     System.out.println(hits.length() + " total matching documents");
> >
> > final int HITS_PER_PAGE = 10;
> > for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) {
> >   int end = Math.min(hits.length(), start + HITS_PER_PAGE);
> >   for (int i = start; i < end; i++) {
> > Document doc = hits.doc(i);
> > String path = doc.get("path");
> > if (path != null) {
> >   System.out.println(i + ". " + path);
> > } else {
> >   String url = doc.get("url");
> >   if (url != null) {
> > System.out.println(i + ". " + url);
> > System.out.println("   - " + doc.get("title"));
> >   } else {
> > System.out.println(i + ". " + "No path nor URL for this 
> > document");
> >   }
> > }
> >   }
> >
> >   if (hits.length() > end) {
> > System.out.print("more (y/n) ? ");
> > line = in.readLine();
> > if (line.length() == 0 || line.charAt(0) == 'n')
> >   break;
> >   }
> > }
> >   }
> >   searcher.close();
> >
> > } catch (Exception e) {
> >   System.out.println(" caught a " + e.getClass() +
> >  "\n with message: " + e.getMessage());
> > }
> >   }
> > }
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



URL Stemmer

2005-07-27 Thread Chris Fraschetti
Writing simple code to trim down a URL is trivial, but to actually
trim it down to its most meaningful state is very hard. In same cases
the URL parameters actually define the page in others they are useless
babble. I'd like to use the hash of a page's URL as well as a hash of
the content data to help me eliminate duplicates... is there any good
methods that are commonly used for URL stemming?

-- 
_______
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index merge and java heap space

2005-07-28 Thread Chris Fraschetti
I've read of people combining smaller indexer to help distribute
indexing and such, but I've been unable to find any descriptions of
large index merges. I've seen a post of two in regards to a merge
taking a nice amount of heap space (I've also observed this) but I
wanted to poll you folks to see how merging two indexes actually
performs on a large set of data. Does there read a point to where the
indexes are similar too large for the JVM to handle or does index size
(either of them) not have that great of an affect on merge memory
usage? Any success stories out there with using merge on a large
scale?

Thanks in advance

-- 
_______
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



indexed document id

2005-07-29 Thread Chris Fraschetti
I've got an index which I rebuild each time and don't do any deletes
until the end, so doc ids shouldn't change... at index time, is there
a better way to discover the id of the document i just added than
docCount() ?
-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexed document id

2005-08-01 Thread Chris Fraschetti
If i'm using multiple threads to add documents to the index, can it be
assumed that they will be added to the index in the order they are
presented to the indexwriter? and thus keeping my local doc id count
would hold true?

-Chris Fraschetti

On 7/29/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> 
> On Jul 29, 2005, at 4:40 PM, Chris Fraschetti wrote:
> > I've got an index which I rebuild each time and don't do any deletes
> > until the end, so doc ids shouldn't change... at index time, is there
> > a better way to discover the id of the document i just added than
> > docCount() ?
> 
> When building a new index by strictly adding documents, you could
> keep a zero-based counter which would reflect document id at that
> time.  They are simply in ascending order.
> 
> Erik
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



search caching

2005-08-03 Thread Chris Fraschetti
I've got an application that performs millions of searches against a
lucene index, can someone  give me a bit of insight as to the memory
consumption of these searches? Is there a cap on how many are kept
around? is there any way I can disable caching for this type of
search?

-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



max number of documents

2005-08-10 Thread Chris Fraschetti
maybe this is a stupid question, maybe not...

hits.id returns an int.. which would lead me to assume the obvious
limitations of the size of the index (size meaning number of docs) ...
assuming I reach this limit, can I expect lucene to throw some sort of
exception?

What is the best practice for this? watch my count and switch to a new
index when the time comes, then search across both indexes?


-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



post-normalization score filter

2005-08-13 Thread Chris Fraschetti
I've seen a few posts which reference using a HitCollector to help
filter the returned document's scores.. but are there any
implementations out there that perform this filter of results by score
after they've already been normalized, short of walking the hits
array If i was to walk hits until i reach my threshold, just
checking the hits.score() ... what am I looking performance wise, are
the scores all calculated at this point and I'm simply accessing them
or are they normalized on the fly?

Obviously I can trim my results a bit for results that score a million
documents only allow the first few thousand as results, but even in
the case of where I might get two thousand results, the lower end
results have such minimal scores in comparison to the earlier that it
seems unintuitive for them to be accessible.

thanks as always!

-- 
_______
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]