addIndexes causing IndexOutOfBoundsException
I have an index of roughly 2 million docs making up almost 200GB and I can't seem to merge any additional indexes into it. Here is the error I continuously get, always with the Index: 85, Size: 13 I couldn't find much in the previous mailing list posts nor on 'ol faithful Google. help/ideas? java.lang.IndexOutOfBoundsException: Index: 85, Size: 13 at java.util.ArrayList.RangeCheck(ArrayList.java:546) at java.util.ArrayList.get(ArrayList.java:321) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:66) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:237) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:185) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:92) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:487) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:399) -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene - PDFBox
"case2") >=0) > > field = new Field("caseid", "2", false, true, false); > > else if (file.getPath().indexOf("case3") >=0) > > field = new Field("caseid", "3", false, true, false); > > else > > field = new Field("caseid", "0", false, true, false); > > > > doc.add(field); > > > > writer.addDocument(doc); > > } > > // at least on windows, some temporary files raise this exception > > with an "access denied" message > > // checking if the file can be read doesn't help > > catch (FileNotFoundException fnfe) { > > ; > > } > > } > > } > > } > > } > > > > > > Here is the SearchFiles class with some minor modifications... > > > > import java.io.IOException; > > import java.io.BufferedReader; > > import java.io.InputStreamReader; > > import java.util.StringTokenizer; > > > > import org.apache.lucene.analysis.Analyzer; > > import org.apache.lucene.analysis.standard.StandardAnalyzer; > > import org.apache.lucene.document.Document; > > import org.apache.lucene.search.Searcher; > > import org.apache.lucene.search.IndexSearcher; > > import org.apache.lucene.search.Query; > > import org.apache.lucene.search.BooleanQuery; > > import org.apache.lucene.search.PhraseQuery; > > import org.apache.lucene.search.Hits; > > import org.apache.lucene.index.Term; > > import org.apache.lucene.queryParser.QueryParser; > > import org.apache.lucene.queryParser.ParseException; > > > > class SearchFiles { > > > > private static Query getCaseQuery(String line, Analyzer analyzer) > > throws ParseException { > > BooleanQuery bq = new BooleanQuery(); > > StringTokenizer st = new StringTokenizer(line); > > Query query = QueryParser.parse(line, "contents", analyzer); > > String caseId = null; > > while (st.hasMoreTokens()) { > > caseId = st.nextToken(); > > System.out.println("build case query for " + caseId); > > > > query = QueryParser.parse(caseId, "caseid", analyzer); > > bq.add(query, false, false); > > } > > > > return bq; > > } > > public static void main(String[] args) { > > try { > > Searcher searcher = new IndexSearcher("index"); > > Analyzer analyzer = new StandardAnalyzer(); > > > > BufferedReader in = new BufferedReader(new > > InputStreamReader(System.in)); > > while (true) { > > System.out.print("Query: "); > > String line = in.readLine(); > > System.out.print("Cases: "); > > String caseLine = in.readLine(); > > Query caseQuery = getCaseQuery(caseLine, analyzer); > > > > if (line.length() == -1) > > break; > > > > > > Query query = QueryParser.parse(line, "contents", analyzer); > > // PhraseQuery query = new PhraseQuery(); > > // query.add(new Term("contents",line)); > > System.out.println("Searching for: " + query.toString("contents")); > > /* > > BooleanQuery wholeQuery = new BooleanQuery(); > > wholeQuery.add(caseQuery, true, false); > > wholeQuery.add(query, true, false); > > Hits hits = searcher.search(wholeQuery); > > */ > > Hits hits = searcher.search(query); > > System.out.println(hits.length() + " total matching documents"); > > > > final int HITS_PER_PAGE = 10; > > for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) { > > int end = Math.min(hits.length(), start + HITS_PER_PAGE); > > for (int i = start; i < end; i++) { > > Document doc = hits.doc(i); > > String path = doc.get("path"); > > if (path != null) { > > System.out.println(i + ". " + path); > > } else { > > String url = doc.get("url"); > > if (url != null) { > > System.out.println(i + ". " + url); > > System.out.println(" - " + doc.get("title")); > > } else { > > System.out.println(i + ". " + "No path nor URL for this > > document"); > > } > > } > > } > > > > if (hits.length() > end) { > > System.out.print("more (y/n) ? "); > > line = in.readLine(); > > if (line.length() == 0 || line.charAt(0) == 'n') > > break; > > } > > } > > } > > searcher.close(); > > > > } catch (Exception e) { > > System.out.println(" caught a " + e.getClass() + > > "\n with message: " + e.getMessage()); > > } > > } > > } > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
URL Stemmer
Writing simple code to trim down a URL is trivial, but to actually trim it down to its most meaningful state is very hard. In same cases the URL parameters actually define the page in others they are useless babble. I'd like to use the hash of a page's URL as well as a hash of the content data to help me eliminate duplicates... is there any good methods that are commonly used for URL stemming? -- _______ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index merge and java heap space
I've read of people combining smaller indexer to help distribute indexing and such, but I've been unable to find any descriptions of large index merges. I've seen a post of two in regards to a merge taking a nice amount of heap space (I've also observed this) but I wanted to poll you folks to see how merging two indexes actually performs on a large set of data. Does there read a point to where the indexes are similar too large for the JVM to handle or does index size (either of them) not have that great of an affect on merge memory usage? Any success stories out there with using merge on a large scale? Thanks in advance -- _______ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexed document id
I've got an index which I rebuild each time and don't do any deletes until the end, so doc ids shouldn't change... at index time, is there a better way to discover the id of the document i just added than docCount() ? -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexed document id
If i'm using multiple threads to add documents to the index, can it be assumed that they will be added to the index in the order they are presented to the indexwriter? and thus keeping my local doc id count would hold true? -Chris Fraschetti On 7/29/05, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > On Jul 29, 2005, at 4:40 PM, Chris Fraschetti wrote: > > I've got an index which I rebuild each time and don't do any deletes > > until the end, so doc ids shouldn't change... at index time, is there > > a better way to discover the id of the document i just added than > > docCount() ? > > When building a new index by strictly adding documents, you could > keep a zero-based counter which would reflect document id at that > time. They are simply in ascending order. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
search caching
I've got an application that performs millions of searches against a lucene index, can someone give me a bit of insight as to the memory consumption of these searches? Is there a cap on how many are kept around? is there any way I can disable caching for this type of search? -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
max number of documents
maybe this is a stupid question, maybe not... hits.id returns an int.. which would lead me to assume the obvious limitations of the size of the index (size meaning number of docs) ... assuming I reach this limit, can I expect lucene to throw some sort of exception? What is the best practice for this? watch my count and switch to a new index when the time comes, then search across both indexes? -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
post-normalization score filter
I've seen a few posts which reference using a HitCollector to help filter the returned document's scores.. but are there any implementations out there that perform this filter of results by score after they've already been normalized, short of walking the hits array If i was to walk hits until i reach my threshold, just checking the hits.score() ... what am I looking performance wise, are the scores all calculated at this point and I'm simply accessing them or are they normalized on the fly? Obviously I can trim my results a bit for results that score a million documents only allow the first few thousand as results, but even in the case of where I might get two thousand results, the lower end results have such minimal scores in comparison to the earlier that it seems unintuitive for them to be accessible. thanks as always! -- _______ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]