RE: best practice: 1.4 billions documents
> of course I will distribute my index over many machines: > store everything on > one computer is just crazy, 1.4B docs is going to be an index > of almost 2T > (in my case) billion = giga in english billion = tera in non-english 2T docs = 2.000.000.000.000 docs... ;) AFAIK 2 ^ 32 - 1 docs is still the max for a lucene instance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Searching sets of documents
Hi, I want to search for sets of documents. For instance I index some folders with documents in it and now I do not want to find certain documents but folders. Sample: folder A doc 1, contains X, Y doc 2, contains Y, Z folder B doc 3, contains X, Y doc 4, contains A, Z Now I want to find all folders which match "A AND Y" -> folder B. How can this be done? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching sets of documents
The docs are already indexed. > -Original Message- > From: ??? [mailto:[EMAIL PROTECTED] > Sent: Montag, 13. Oktober 2008 02:28 > To: java-user@lucene.apache.org > Subject: Re: Searching sets of documents > > all folders which match "A AND Y", do you search for file name? > If yes, A or Y in "A AND Y" is a Strring too, so you can do it by: > construct a Lucene Document for each folder, and name of > files under the > folder is the search data. > > 2008/10/13 <[EMAIL PROTECTED]> > > > Hi, > > > > I want to search for sets of documents. For instance I > index some folders > > with documents in it and now I do not want to find certain > documents but > > folders. > > > > Sample: > > > > folder A > > doc 1, contains X, Y > > doc 2, contains Y, Z > > > > folder B > > doc 3, contains X, Y > > doc 4, contains A, Z > > > > Now I want to find all folders which match "A AND Y" -> folder B. > > > > How can this be done? > > > > Thank you > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > Sorry for my English!! ? > Please help me correct my English expression and error in syntax > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching sets of documents
The folder name and the document name are stored for each document. Original-Nachricht > Datum: Tue, 14 Oct 2008 14:11:09 +0530 > Von: "Ganesh" <[EMAIL PROTECTED]> > An: java-user@lucene.apache.org > Betreff: Re: Searching sets of documents > You should have stored the foldername or fullpath of the file as part of > Lucene document otherwise it is difficult to retrieve. > > Regards > Ganesh > > - Original Message - > From: "叶双明" <[EMAIL PROTECTED]> > To: > Sent: Tuesday, October 14, 2008 6:13 AM > Subject: Re: Searching sets of documents > > > >I don't understand your problem? > > do you index file but want to search folder which contain the files? > > > > when you want to search folder, you can index folder, the data is all > > files > > under it. > > > > 2008/10/13 <[EMAIL PROTECTED]> > > > >> The docs are already indexed. > >> > >> > -Original Message- > >> > From: ??? [mailto:[EMAIL PROTECTED] > >> > Sent: Montag, 13. Oktober 2008 02:28 > >> > To: java-user@lucene.apache.org > >> > Subject: Re: Searching sets of documents > >> > > >> > all folders which match "A AND Y", do you search for file name? > >> > If yes, A or Y in "A AND Y" is a Strring too, so you can do it by: > >> > construct a Lucene Document for each folder, and name of > >> > files under the > >> > folder is the search data. > >> > > >> > 2008/10/13 <[EMAIL PROTECTED]> > >> > > >> > > Hi, > >> > > > >> > > I want to search for sets of documents. For instance I > >> > index some folders > >> > > with documents in it and now I do not want to find certain > >> > documents but > >> > > folders. > >> > > > >> > > Sample: > >> > > > >> > > folder A > >> > > doc 1, contains X, Y > >> > > doc 2, contains Y, Z > >> > > > >> > > folder B > >> > > doc 3, contains X, Y > >> > > doc 4, contains A, Z > >> > > > >> > > Now I want to find all folders which match "A AND Y" -> folder B. > >> > > > >> > > How can this be done? > >> > > > >> > > Thank you > >> > > > >> > > > >> > > > >> > > > >> > - > >> > > To unsubscribe, e-mail: [EMAIL PROTECTED] > >> > > For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > >> > > > >> > > >> > > >> > -- > >> > Sorry for my English!! ? > >> > Please help me correct my English expression and error in syntax > >> > > >> > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > > -- > > Sorry for my English!! 明 > > Please help me correct my English expression and error in syntax > > > > Send instant messages to your online friends http://in.messenger.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching sets of documents
The problem is the logical combination of documents in folders not of terms in documents. See original post. Original-Nachricht > Datum: Tue, 14 Oct 2008 16:29:15 +0530 > Von: "Ganesh" <[EMAIL PROTECTED]> > An: java-user@lucene.apache.org > Betreff: Re: Searching sets of documents > What is your problem? > > If the foldernames are already stored then it could be retrieved from > search. Use DuplicateFilter on field "foldername" to get the unique list > of > folders. > Hope this helps. > > Regards > Ganesh > > > - Original Message - > From: <[EMAIL PROTECTED]> > To: > Sent: Tuesday, October 14, 2008 2:41 PM > Subject: Re: Searching sets of documents > > > > The folder name and the document name are stored for each document. > > > > Original-Nachricht > >> Datum: Tue, 14 Oct 2008 14:11:09 +0530 > >> Von: "Ganesh" <[EMAIL PROTECTED]> > >> An: java-user@lucene.apache.org > >> Betreff: Re: Searching sets of documents > > > >> You should have stored the foldername or fullpath of the file as part > of > >> Lucene document otherwise it is difficult to retrieve. > >> > >> Regards > >> Ganesh > >> > >> - Original Message - > >> From: "叶双明" <[EMAIL PROTECTED]> > >> To: > >> Sent: Tuesday, October 14, 2008 6:13 AM > >> Subject: Re: Searching sets of documents > >> > >> > >> >I don't understand your problem? > >> > do you index file but want to search folder which contain the files? > >> > > >> > when you want to search folder, you can index folder, the data is all > >> > files > >> > under it. > >> > > >> > 2008/10/13 <[EMAIL PROTECTED]> > >> > > >> >> The docs are already indexed. > >> >> > >> >> > -Original Message- > >> >> > From: ??? [mailto:[EMAIL PROTECTED] > >> >> > Sent: Montag, 13. Oktober 2008 02:28 > >> >> > To: java-user@lucene.apache.org > >> >> > Subject: Re: Searching sets of documents > >> >> > > >> >> > all folders which match "A AND Y", do you search for file name? > >> >> > If yes, A or Y in "A AND Y" is a Strring too, so you can do it by: > >> >> > construct a Lucene Document for each folder, and name of > >> >> > files under the > >> >> > folder is the search data. > >> >> > > >> >> > 2008/10/13 <[EMAIL PROTECTED]> > >> >> > > >> >> > > Hi, > >> >> > > > >> >> > > I want to search for sets of documents. For instance I > >> >> > index some folders > >> >> > > with documents in it and now I do not want to find certain > >> >> > documents but > >> >> > > folders. > >> >> > > > >> >> > > Sample: > >> >> > > > >> >> > > folder A > >> >> > > doc 1, contains X, Y > >> >> > > doc 2, contains Y, Z > >> >> > > > >> >> > > folder B > >> >> > > doc 3, contains X, Y > >> >> > > doc 4, contains A, Z > >> >> > > > >> >> > > Now I want to find all folders which match "A AND Y" -> folder > B. > >> >> > > > >> >> > > How can this be done? > >> >> > > > >> >> > > Thank you > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > - > >> >> > > To unsubscribe, e-mail: [EMAIL PROTECTED] > >> >> > > For additional commands, e-mail: > [EMAIL PROTECTED] > >> >> > > > >> >> > > > >> >> > > >> >> > > >> >> > -- > >> >> > Sorry for my English!! ? > >> >> > Please help me correct my English expression and error in syntax > >> >> > > >> >> > >> >> > >> >> > - > >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> >> For additional commands, e-mail: [EMAIL PROTECTED] > >> >> > >> >> > >> > > >> > > >> > -- > >> > Sorry for my English!! 明 > >> > Please help me correct my English expression and error in syntax > >> > > >> > >> Send instant messages to your online friends > >> http://in.messenger.yahoo.com > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > Send instant messages to your online friends http://in.messenger.yahoo.com > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re-combining already indexed documents
Hi, I have already indexed documents. I want to recombine them into new documents. Is this possible without the original documents - only with the index? Example: doc1, doc2, doc3 are indexed. I want a new indexed doc4 which is indexed as if I had concatenated doc1, doc2, doc3 into doc4 and then indexed doc4. Thank you - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Re-combining already indexed documents
> The fastest way to reconstruct the token > stream would > be to use the TermFreqVector but if you didn't store it at > index time > you would have traverse the inverted index using TermEnum and > TermPositions in order to pick up the term values and > positions. This > can be a rather time consuming process if you have a large index. OK, then I better reindex from source. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Multiple indexes vs single index
Hi, We have have an application which manages the data of multiple customers. A customer can only search its own data, never the data of other customers. So what is more efficent in respect of performance and resources: One big single index filtered by an index field (customer-Id) or multiple smaller indexes, one per customer? I think there will be 10 million docs max. for all customers together. Thank you - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Multiple indexes vs single index
Hi, > You get one answer if each document is 1K, another if it's > 1G. If you have 2 users or 10,000 users. If you require > 100 queries/sec response time or 1 query can take 10 > seconds. If you require an update to the index every > second or month... Each doc has up to 10 A4 pages text. There will be about 100 customers/clients/companies (not users, every customer will have about 10 users). I would expect 1 query/s not more. No updates to the index. > You have two problems with maintaining one index/user. > 1> Trying to maintain N indexes is much harder than one, > especially when you factor in backups, etc. This is the biggest problem I see. > 2> There is a cost to opening an index. If you look at the > Wiki you'll see that the recommendation is that you > open an index, and run a few warmup queries to fill > caches etc. before, for instance, measuring performance. > So if you maintain an index/user, how do you expect > to handle this issue? I would open the index on demand and close it after a period of inactivity. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
search(Query query, HitCollector results)
Hi, in what order does search(Query query, HitCollector results) return the results? By relevance? Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: search(Query query, HitCollector results)
> The HitCollector used will determine how things are ordered. > In 2.4, the > TopDocCollector will order by relevancy and the > TopFieldDocCollector can > order by > relevancy, index order, or by field. Lucene delivers the hit > ids to the > HitCollector and it can order as it pleases. So HitCollector#collect(int doc, float score) is not called in a special (default) order and must order the docs itself by score if one needs the hits sorted by relevance? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
FieldSelector
Hi, what kind of fields loads IndexSearcher.Document doc(int i)? Only those with Field.Store.YES? I'm asking because I do not need to load the tokens - should I use a FieldSelector or are these fields not loaded? Thank you - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
TopDocCollector
Looking into TopDocCollector code, I have some questions: * How can a hit have a score of <=0? * What happens if the first hit has the highest score of all hits? It seems that topDocs whould then contain only this doc!? public void collect(int doc, float score) { 57 if (score > 0.0f) { 58 totalHits++; 59 if (hq.size() < numHits || score >= minScore) { 60 hq.insert(new ScoreDoc(doc, score)); 61 minScore = ((ScoreDoc)hq.top()).score; // maintain minScore 62 } 63 } 64 } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: TopDocCollector
> That works fine, because hq.size() is still less than numHits. So > nomatter what, the first numHits hits will be added to the queue. > > > public void collect(int doc, float score) { > > 57 if (score > 0.0f) { > > 59 if (hq.size() < numHits || score >= minScore) { Oh damned... it's an || not an &&... Sorry for the question ;) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: TopDocCollector
> > * How can a hit have a score of <=0? > > A function query, or a negative boost would do it. Ah ok. > Solr has always allowed all scores through w/o screening out <=0 Why? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Merging database index with fulltext index
Hi, what is the best approach to merge a database index with a lucene fulltext index? Both databases store a unique ID per doc. This is the join criteria. requirements: * both resultsets may be very big (100.000 and much more) * the merged resultset must be sorted by database index and/or relevance * optional paging the merged resultset, a page has a size of 1000 docs max. example: select a, b from dbtable where c = 'foo' and content='bar' order by relevance, a desc, d I would split this into: database: select ID, a, b from dbtable where c = 'foo' order by a desc, d lucene: content:bar (sort:relevance) merge: loop over the lucene resultset and add the db record into a new list if the ID matches. If the resultset must be paged: database: select ID from dbtable where c = 'foo' order by a desc, d lucene: content:bar (sort:relevance) merge: loop over the lucene resultset and add the db record into a new list if the ID matches. page 1: select a,b from dbtable where ID IN (list of the ID's of page 1) page 2: select a,b from dbtable where ID IN (list of the ID's of page 2) ... Is there a better way? Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Merging database index with fulltext index
> I feel this may not be a good example. It was a very simple example. The real database query is very complex and joins serveral tables. It would be an absolute nightmare to copy all these tables into lucene and keep both in sync. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Merging database index with fulltext index
> Contrariwise, look for anything by Marcelo Ochoa on the user list > about embedding Lucene in Oracle (which I confess I haven't looked > into at all, but seems interesting). I know this lucene-oracle text cartridge. But my solution has to work with any of the big databases (MS, IBM, Oracle). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Merging database index with fulltext index
> Actually you can use DBSight(disclaimer:I work on it) to > collect the data > and keep them in sync. Hm... it fulltext-indexes a database? It supports document content outside the database (custom crawler)? What query-syntax it supports? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Merging database index with fulltext index
> Yes. DBSight helps to flatten database objects into Lucene's > documents. OK, thx for the advice. But back to my original question. When I have to merge both resultsets, what is the best approach to do this? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
relevance vs. score
Hi, When I say: sorted by relevance or sorted by score - are relevance and score synonym for each other or what is the difference in relation to sorting? Thank you - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: relevance vs. score
> I think for "ordinary" Lucene queries, "score" and "relevance" mean > the same thing. > > But if you do eg function queries, or you "mixin" recency into your > scoring, etc., then "score" could be anything you computed, a value > from a field, etc. Hm, how is relevance then defined? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: relevance vs. score
> It's the similarity scoring formula. EG see here: > >http://lucene.apache.org/java/2_4_0/scoring.html > > and here: > > > http://lucene.apache.org/java/2_4_0/api/core/org/apache/lucene > /search/Similarity.html OK; thank you - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Unable to improve performance
> Are you opening your IndexReader with readOnly=true? If not, you're > likely hitting contention on the "isDeleted" method. How can I open it "readonly"? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Unable to improve performance
> > How can I open it "readonly"? > > See the javadocs for IndexReader. I did it already for 2.3 - cannot find readonly - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Design questions
Hi, I have to index (tokenized) documents which may have very much pages, up to 10.000. I also have to know on which pages the search phrase occurs. I have to update some stored index fields for my document. The content is never changed. Thus I think I have to add one lucene document with the index fields and one lucene document per page. Mapping === MyDocument -ID -Field 1-N -Page 1-N Lucene -Lucene Document with ID, page number 0 and Field1 - N (stored fields) -Lucene Document 1 with ID, page number 1 and tokenized content of Page 1 ... -Lucene Document N with ID, page number N and tokenized content of Page N Delete of MyDocument -> IndexWriter#deleteDocuments(Term:ID=foo) Update of stored index fields -> IndexWriter#updateDocument(Term: ID=foo, page number = 0) Search with index and content. Step 1: Search on stored index fields -> List of IDs Step 2: Search on ID field (list from above OR'ed together) and content -> List of IDs and page numbers Does this work? What drawbacks has this approch? Is there another way to achieve what I want? Thank you. P.S. There are millions of documents with a page range from 1 to 10.000. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
Hi, > You could even store all of the page offsets in your > meta-data document > in a special field if you wanted, then lazy-load that field > rather than > dynamically counting. How can I lazy load a field? > You'd have to be careful that your offsets > corresponded to the data *after* it was analyzed, but that shouldn't > be too hard. The TermPosition is the position after analyzing? > You'd have to read this field before deleting the doc > and make sure it was stored with the replacement. Don't understand... > And, since I'm getting random ideas anyway, here's another. > The PositionIncrementGap is the "distance" (measured in > terms) between two tokens. Let's claim that you have no > page with more than 10,000 (or whatever) tokens. Just > bump the positionincrementgap to the next 10,000 at the > start of each page. So, the first term on the first page > has an offset of 0. the first term on the second page > has an offset of 10,000. The first term on the third > page has an offset of 20,000. > > Now, with the SpanNearQuery trick from above, your > term position modulo 10,000 is also your page. This would > also NOT match across pages. H, I kind of like that > idea. But I have to know, how many tokens each page has!? Thank you! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
OK, thank you! I will try this out. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexWriter minMergeDocs
Hi, http://wiki.apache.org/lucene-java/PainlessIndexing says that I shall use setMinMergeDocs. But I cannot find this method in lucene 2.2. What is wrong here? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
When to use which Analyzer
Hi, I have some doubts about Analyzer usage. I read that one shall always use the same analyzer for searching and indexing. Why? How does the Analyzer effect the search process? What is analyzed here again? I have tried this out. I used a SimpleAnalyzer for indexing with Field.Store.YES and Field.Index.UN_TOKENIZED. I can see with Luke that the field values are store unchanged, 1:1. OK. Now I did a search with Luke and depending on the used analyzer the Query returns results or not. I can see that when I use the SimpleAnalyzer again, the values of my search are all converted to lowercase and numbers are removed. This leads to wrong results, because my values are stored with Field.Index.UN_TOKENIZED. Why is my query changed this way? I think it has to do with QueryParsing, which uses an analyzer. Right? Can I create a query directly, without parsing? Or in other words: How can I search for fields stored with Field.Index.UN_TOKENIZED? Why do I need an analyzer for searching? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index merging and optimizing
Hi, are there any ready to use tools out there which I can use for merging and optimzing? I have seen that Luke can optimize, but not merge? Or do I have to write my own utility? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexWriter minMergeDocs
> I think that method was renamed somewhere along the way to > setMaxBufferedDocs. > > However, in 2.3 (to be released in a few weeks), it's better to use > setRAMBufferSizeMB instead. > > For more ideas on speeding up indexing, look here: > > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed OK, thank you! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Max size of index (FSDirectory )
Hi, is there any maximum size for an index? Are there any recommendations for a useful max size? I want to index in parallel. So I have to create multiple indexes. Shall I merge them together or shall I let them as they are using (Parallel)MultiSearcher? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index merging and optimizing
> See org.apache.lucene.misc.IndexMergeTool Thank you. But this uses a hardcoded analyzer and deprecated API-Calls. How does the used analyzer effect the merge process? Is everything reindexed with this new analyzer again? Does this make sense? What if the sources indexes had other analyzers used? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: When to use which Analyzer
> > How can I search for fields stored with Field.Index.UN_TOKENIZED? > > Use TermQuery. > > > Why do I need an analyzer for searching? > > Consider a full-text field that will be tokenized removing special > characters and lowercased, and then a user querying for an uppercase > word. The main thing is that queries need to jive with how things > got indexed, Analyzer in the mix or not. OK. So I have to use TermQuery when I want to search on UN_TOKENIZED fields. And what Query-Class for TOKENIZED fields? And then combine both into what Query? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: When to use which Analyzer
> The caution to use the same analyzer at index and query time is, > in my experience, simply good advice to follow until you are > familiar enough with how Lucene uses analyzers to keep from > getting really, really, really confused. Once you understand > when analyzers are used and how they effect the token > stream, you can use different ones when analyzing as opposed > to searching. But I've rarely found that I *wanted* to do that OK; thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Max size of index (FSDirectory )
> OG: again, it depends. If the index you'd get by merging is > of manageable size, then merge your indices. OK, this is what I tought. A single index should be faster than multiple indexes with a MultiSearcher, right? But what about the ParallelMultiSearcher? As I understand the docs it searches the indexes parallel (one thread per index?) and then returns the merged search results? > Otherwise, use remote flavour of PMS and spread your indices over multiple > search servers. What is PMS? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: When to use which Analyzer
> You can answer an awful lot of this much faster than waiting > for someone > to reply by getting a copy of Luke and look at the parse results using > various > analyzers. Ah cool, you mean the "explain structure" button. > Try KeywordAnalyzer for your query. > > Combine queries programmaticaly with BooleanQuery. Yes, I see, thank you! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index merging and optimizing
> I admit I've never used IndexMergeTool, I've always used > IndexWriter.AddIndexex and then execute > IndexWriter.optimize(). > > And I've seen no problems. That call takes no > analyzer. So you take the first index an add a remaining indexes via addIndexes? What happens if the indexes were created with different analyzers? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index merging and optimizing
> Then why would you want to combine them? > > I really think you need to explain what you're trying to accomplish > rather then obsess on the details. I have to create indexes in parallel because the amount of data is very high. Then I want to merge them into bigger indexes an move them to the search server. (See therad "Max size of index (FSDirectory )" too.) My question now is, what will happen if one merges indexes which were created with different analyzer (the customer can confige the analyzer depending on the data which is indexed)? I think this will produce unpredicted results when searching. If so I have to document that only indexes created with the same analyzer are allowed to merge. Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index merging and optimizing
> But it also seems that the parallel/not parallel decision is > something you control on the back end, so I'm not sure the user > is involved in the merge question at all. In other words, you could > easily split the indexing task up amongst several machines and/or > processes and combine all the results after all the sub-indexes > were build, thus making your question basically irrelevant. I'm just writing the tool. The customer (IT staff, not end user) uses it. Now I have to find out the best stategy which allows fast indexing and searching. > But you still haven't explained what the user is getting from all > this flexibility. I'm not asking about flexibility, only about indexing performance. I know that I have to index in parallel (initial data is about 150 million pages with more than 1 TByte). Now I'm thinking about what I have to take into consideration when building the merge tool. Then I saw the code of IndexMergeTool I was astonished about the hardcoded use of SimpleAnalyzer and why one uses an anylzer at all. Therefore my questions. > I have a hard time understanding the use-case > you're trying to support. If you're trying to build a generic > front-end > to allow parameterized Lucene index building, have you looked at > SOLR, which uses XML configuration files to drive the indexing > and searching process? (which I haven't used, but I'm tracking > the user's group list.). No the frontend just provides a textbox where the user can type in a text like "foo*". The query is then executed against some un_tokenized fields (fixed parameters for the current user like company etc.) and against this single tokenized field. I want to filter on these fixed fields too, to reduce the number of hits for the fulltext query. I think this is the right approach to achieve a better search performance. Isn't it? Thank you! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How?
> firstly, I submit the query like "select * from [tablename]". > And in this > table, there are around 30th columns and 40,000 rows data. > And I use the > standrandAnalyzer to generate the index. Why don't you use a database index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexWriter#addIndexes
Hi, looking into the code of IndexMergeTool I saw this: IndexWriter writer = new IndexWriter(mergedIndex, new SimpleAnalyzer(), true); Then the indexes are added to this new index. My question is: How does the Analyzer of this IndexWriter instance effect the merge process? It seems that is doesn't matter, right? Thank you. Complete source of IndexMergeTool: public static void main(String args[]) throws IOException { if (args.length < 3) { System.err.println("Usage: IndexMergeTool [index3] ..."); System.exit(1); } File mergedIndex = new File(args[0]); IndexWriter writer = new IndexWriter(mergedIndex, new SimpleAnalyzer(), true); Directory indexes[] = new Directory[args.length - 1]; for (int i = 1; i < args.length; i++) indexes[i - 1] = FSDirectory.getDirectory(args[i], false); System.out.println("Merging..."); writer.addIndexes(indexes); System.out.println("Optimizing..."); writer.optimize(); writer.close(); System.out.println("Done."); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How?
> I can use the cluster index on the table. But you can create only one > cluster index in a table. In this table , lots of data need > to search, so I > choose the Lucene to do that. Why do you need a clustered index in the database? A non-clustered would do the job as well. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How?
> A non-clustered and clustered index has resovle the problem, > but Lucene can > not do the same thing like that? Well, I bet the database solution is the best, as long as you do not search in big text fields or you need special fulltext features like fuzzy search etc. Synchronizing a lucene index with such a big database is pure overkill, as long as the database does the job. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexWriter#addIndexes
> Genau! Indices are simply merged on disk, their content is > not re-analyzed. Thank you! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Compass
Hi, compass (http://www.opensymphony.com/compass/content/lucene.html) promisses many nice things in my opinion. Has anybody production experiences with it? Especially Jdbc Directory and Updates? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Creating search query
Hi, I have an index with some fields which are indexed and un_tokenized (keywords) and one field which is indexed and tokenized (content). Now I want to create a Query-Object: TermQuery k1 = new TermQuery(new Term("foo", "some foo")); TermQuery k2 = new TermQuery(new Term("bar", "some bar")); QueryParser p = new QueryParser("content", new SomeAnalyzer());//same analyzer is used for indexing Query c =p.parse("text we are looking for"); BooleanQuery q = new BooleanQuery(); q.add(k1, Occur.MUST); q.add(k2, Occur.MUST); q.add(c, Occur.MUST); Is this the best way? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Compass
Thank you. > -Original Message- > From: Lukas Vlcek [mailto:[EMAIL PROTECTED] > Sent: Mittwoch, 23. Januar 2008 08:23 > To: java-user@lucene.apache.org > Subject: Re: Compass > > Hi, > > I am using Compass with Spring and JPA. It works pretty nice. > I don't store > index into database, I use traditional file system based Lucene index. > Updates work very well but you have to be careful about > proper mapping of > your objects into search engine (specially parent-child mappings). > > Regards, > Lukas > > On Jan 21, 2008 8:08 PM, <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > compass (http://www.opensymphony.com/compass/content/lucene.html) > > promisses > > many nice things in my opinion. > > Has anybody production experiences with it? > > > > Especially Jdbc Directory and Updates? > > > > Thank you. > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > http://blog.lukas-vlcek.com/ > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Creating search query
Yes, sorry, that's the case. Thank you! > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Donnerstag, 24. Januar 2008 19:49 > To: java-user@lucene.apache.org > Subject: Re: Creating search query > > That should work fine, assuming that foo and bar are the untokenized > fields and content is the tokenized content. > > Erick > > On Jan 24, 2008 1:18 PM, <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I have an index with some fields which are indexed and un_tokenized > > (keywords) and one field which is indexed and tokenized (content). > > > > Now I want to create a Query-Object: > > > >TermQuery k1 = new TermQuery(new Term("foo", "some foo")); > >TermQuery k2 = new TermQuery(new Term("bar", "some bar")); > >QueryParser p = new QueryParser("content", new > > SomeAnalyzer());//same analyzer is used for indexing > >Query c =p.parse("text we are looking for"); > > > >BooleanQuery q = new BooleanQuery(); > >q.add(k1, Occur.MUST); > >q.add(k2, Occur.MUST); > >q.add(c, Occur.MUST); > > > > Is this the best way? > > > > Thank you > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Freitag, 11. Januar 2008 16:16 > To: java-user@lucene.apache.org > Subject: Re: Design questions > But you could also vary this scheme by simply storing in your document > the offsets for the beginning of each page. Well, this is the best for my app I think, but... How do I find out these offsets? I'm adding the content field with: IndexWriter#add(new Field("content", myContentReader)); I have no clue how find out the offsets in this reader. Must be something with an analyzer and a TokenStream? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
OK, I will give this a try. Now I have the problem that I do not know how to get the offsets (or positions? What is the difference?) back from the searched document... There is a IndexReader#termPositions (Term t) - but this returns the positions for the whole index, not a single document. > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Donnerstag, 24. Januar 2008 20:56 > To: java-user@lucene.apache.org > Subject: Re: Design questions > > I think you'll have to implement your own Analyzer and count. > That is, every call to next() that returns a token will have to > also increment some counter by 1. > > To use this, you must have some way of knowing when a page > ends, and at that point you call your instance of your custom > analyzer to see what the count is. Or your analyzer maintains > the list and you can call for it after you've added all the pages. > > Analyzer.getPositionIncrementGap is called every time you > call document.add("field". > > So, you have something like this > while (more pages for doc) { >string pagedata = getPageText(); >doc.add("text", pagedata); > } > > Under the covers, your custom analyzer adds the current offset > (which you've kept track of) to, say, an ArrayList. And after the > last page is added, you get this arraylist and add it to your > document. > > Or, you could just do things twice. That is, send your text through > a TokenStream, then call next() and count. Then send it all > through doc.add(). > > There are probably cleverer ways, but that should do for a start. > > Best > Erick > > On Jan 24, 2008 2:33 PM, <[EMAIL PROTECTED]> wrote: > > > > -Original Message- > > > From: Erick Erickson [mailto:[EMAIL PROTECTED] > > > Sent: Freitag, 11. Januar 2008 16:16 > > > To: java-user@lucene.apache.org > > > Subject: Re: Design questions > > > > > But you could also vary this scheme by simply storing in > your document > > > the offsets for the beginning of each page. > > > > Well, this is the best for my app I think, but... > > > > How do I find out these offsets? > > > > I'm adding the content field with: > > > > IndexWriter#add(new Field("content", myContentReader)); > > > > I have no clue how find out the offsets in this reader. > Must be something > > with an analyzer and a TokenStream? > > > > Thank you > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> Or, you could just do things twice. That is, send your text through > a TokenStream, then call next() and count. Then send it all > through doc.add(). Hm. This means read the content twice, doesn't matter using an own analyzer oder overriding/wrapping the main analyzer. Is there anywhere a hook where I can grap the last token when I call Document#add? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermVector
Hi, how do I get the TermVector from a document which I have gotten from an IndexSearcher via IndexSearcher#search(Query q). Luke can do it, but I do not know how... Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TermVector
Sorry, this was a bit nonsense ;) I store a document with a content field like this: Document#add(new Field("content", someReader, TermVector.WITH_OFFSETS)); Later I search this document with an IndexSearcher and want the TermPositions from this single document. There is a IndexReader#termPositions(Term t) - but this returns the positions for the whole index, not a single document. > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Montag, 28. Januar 2008 15:28 > To: java-user@lucene.apache.org > Subject: TermVector > > Hi, > > how do I get the TermVector from a document which I have > gotten from an > IndexSearcher via IndexSearcher#search(Query q). > > Luke can do it, but I do not know how... > > Thank you. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TermVector
> Also, search the archives for Term Vector, as you will find > discussion > of it there. Ah I see, I need to cast it to TermPositionVector. OK. > You may also, eventually, be interested in the new > TermVectorMapper capabilities in 2.3 which should help speed up the > processing of term vectors by providing a callback mechanism > to allow > you to load them into data structures that make sense for your > application. Hm. What I need is are startOffsets of a special term. I use TermPositionVector#getOffsets(TermPositionVector.indexOf("foo")). Can TermVectorMapper speed this up? And how can I find the offsets of something like "foo bar"? I think this will get tokenized into 2 terms and thus I have no chance to find it, right? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TermVector
> > And how can I find the offsets of something like "foo bar"? > I think > > this > > will get tokenized into 2 terms and thus I have no chance to find > > it, right? > > I wouldn't say no chance... TermVectorMapper would be good > for this, > as you can watch the terms as they are being loaded. Just > keep track > of your last term and see if it is "foo", when you hit "bar" > > What kind of special term are you looking for? There may be other > ways of solving your problem... Well. I do not only want to find documents with a certain phrase but the positions of these phrases (e.g. "foo bar") in the document too. It must be possible I think because a highlighter can do it? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Which analyzer
Hi, I have a huge number of documents which contain mainly numbers and dates (german format dd.MM.), like this: Tgr. gilt ab 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 46X0 01 0480101080512070010 Gefahrenklass01 01 01 01 01 01 01 01 01 01 01 01 46X0 01 0490101080512070010 Bezahlte Std.152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 46X0 01 0500101080512070010 Woech.Arbzeit 35,0035,0035,0035,0035,00 35,0035,0035,0035,0035,0035,0035,00 46X0 01 0510101080512070010 Monatl.Arbzt.152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 152,25 Which anlyzer should I use when someone searches for a certain number or date? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Which analyzer
Hello, lets say the document contains 01.02.1999 and 152,45 Then I want to search for: 01.02.1999 AND 152,45 01.02.1999 152,45 1999 152 Thank you. > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Freitag, 8. Februar 2008 00:20 > To: java-user@lucene.apache.org > Subject: Re: Which analyzer > > *How* do you want to search them? If it's simply exact matches, then > WhitespaceAnalyzer should work fine. > > But if you want to, for example, look at date ranges or number > ranges, you'll have to be more clever. > > What do you want to accomplish? > > Best > Erick > > On Feb 7, 2008 3:25 PM, <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I have a huge number of documents which contain mainly > numbers and dates > > (german format dd.MM.), like this: > > > > Tgr. gilt ab 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 > > 01.01.99 > > 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 46X0 01 > > 0480101080512070010 > > Gefahrenklass01 01 01 01 01 > > 01 01 01 01 01 01 01 > 46X0 01 > > 0490101080512070010 > > Bezahlte Std.152,25 152,25 152,25 152,25 152,25 > > 152,25 152,25 152,25 152,25 152,25 152,25 > 152,25 46X0 01 > > 0500101080512070010 > > Woech.Arbzeit 35,0035,0035,0035,0035,00 > > 35,0035,0035,0035,0035,0035,0035,00 > 46X0 01 > > 0510101080512070010 > > Monatl.Arbzt.152,25 152,25 152,25 152,25 152,25 > > 152,25 152,25 152,25 152,25 152,25 152,25 152,25 > > > > Which anlyzer should I use when someone searches for a > certain number or > > date? > > > > Thank you. > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexWriter: setRAMBufferSizeMB
Hi, if I understand this property correctly every time the ram buffer is full it gets automaticaly written to disk. Something like a commit in a database. Thus if my application dies, all docs in the buffer get lost. Right? If so, is there any event/callback etc. which informs my application that such a commit happend? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Which analyzer
OK, I will try it. Thank you. > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Freitag, 8. Februar 2008 14:25 > To: java-user@lucene.apache.org > Subject: Re: Which analyzer > > WhitespaceAnalyzer should do the trick. Give it a try... > > My point was that RangeQuerys wouldn't work very well, > but since you're not trying to do that, WhitespaceAnalyzer > should handle your case. > > Erick > > On Feb 8, 2008 4:40 AM, <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > lets say the document contains > > > > 01.02.1999 > > > > and > > > > 152,45 > > > > Then I want to search for: > > > > 01.02.1999 AND 152,45 > > 01.02.1999 > > 152,45 > > 1999 > > 152 > > > > Thank you. > > > > > -Original Message- > > > From: Erick Erickson [mailto:[EMAIL PROTECTED] > > > Sent: Freitag, 8. Februar 2008 00:20 > > > To: java-user@lucene.apache.org > > > Subject: Re: Which analyzer > > > > > > *How* do you want to search them? If it's simply exact > matches, then > > > WhitespaceAnalyzer should work fine. > > > > > > But if you want to, for example, look at date ranges or number > > > ranges, you'll have to be more clever. > > > > > > What do you want to accomplish? > > > > > > Best > > > Erick > > > > > > On Feb 7, 2008 3:25 PM, <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I have a huge number of documents which contain mainly > > > numbers and dates > > > > (german format dd.MM.), like this: > > > > > > > > Tgr. gilt ab 01.01.99 01.01.99 01.01.99 > 01.01.99 01.01.99 > > > > 01.01.99 > > > > 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 01.01.99 > 46X0 01 > > > > 0480101080512070010 > > > > Gefahrenklass01 01 01 > 01 01 > > > > 01 01 01 01 01 01 01 > > > 46X0 01 > > > > 0490101080512070010 > > > > Bezahlte Std.152,25 152,25 152,25 > 152,25 152,25 > > > > 152,25 152,25 152,25 152,25 152,25 152,25 > > > 152,25 46X0 01 > > > > 0500101080512070010 > > > > Woech.Arbzeit 35,0035,0035,00 > 35,0035,00 > > > > 35,0035,0035,0035,0035,0035,0035,00 > > > 46X0 01 > > > > 0510101080512070010 > > > > Monatl.Arbzt.152,25 152,25 152,25 > 152,25 152,25 > > > > 152,25 152,25 152,25 152,25 152,25 152,25 152,25 > > > > > > > > Which anlyzer should I use when someone searches for a > > > certain number or > > > > date? > > > > > > > > Thank you. > > > > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexWriter: setRAMBufferSizeMB
OK, so there is nothing in 2.3 besides IndexWriter.close to ensure that the docs are written to disk and that the index will survive an application / machine death? > -Original Message- > From: Michael McCandless [mailto:[EMAIL PROTECTED] > Sent: Freitag, 8. Februar 2008 19:34 > To: java-user@lucene.apache.org > Subject: Re: IndexWriter: setRAMBufferSizeMB > > Well ... every time the RAM buffer is full, a new segment is flushed > to the Directory, but that is not necessarily a "commit" in that > an IndexReader would see the new segment, nor, that the segment would > survive if the machine suddenly crashed. > > You should't rely on when specifically IndexWriter makes its changes > visible to readers. The best way to be sure is to close the writer. > > There is work underway now, in this issue: > >https://issues.apache.org/jira/browse/LUCENE-1044 > > that will add an explicit "commit" call, which you would use to 1) > make the changes visible to readers, and 2) sync the index such that > if the machine crashed (after commit returns) then your changes as of > the commit will survive. But it's not committed yet ... it will be in > 2.4. > > One way for a reader to check if a new commit has happened is to > call the isCurrent method. Maybe that helps? > > Mike > > <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > if I understand this property correctly every time the ram buffer > > is full it > > gets automaticaly written to disk. Something like a commit in a > > database. > > Thus if my application dies, all docs in the buffer get lost. Right? > > > > If so, is there any event/callback etc. which informs my > > application that > > such a commit happend? > > > > Thank you. > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: IndexWriter: setRAMBufferSizeMB
Thank you. So I will call flush in 2.3 (and may lose data when machine dies) and commit() in 2.4+ (here a sync() will save the data). > -Original Message- > From: Michael McCandless [mailto:[EMAIL PROTECTED] > Sent: Freitag, 8. Februar 2008 21:01 > To: java-user@lucene.apache.org > Subject: Re: IndexWriter: setRAMBufferSizeMB > > > It's complicated. > > In 2.3, you can use IW.flush to write docs to disk. But that method > will be deprecated in 2.4 and replaced with commit. Or, you > can close. > > If application (jvm) dies or killed, the index will be fine > but won't > have any un-flushed buffered docs. > > If machine dies (os crashes, power cord pulled) then there is a real > risk that the index will become corrupt. This is because Lucene has > never explicitly sync()'d the files to ensure they are on stable > storage. LUCENE-1044 fixes that (adds syncs). > > Mike > > <[EMAIL PROTECTED]> wrote: > > > OK, so there is nothing in 2.3 besides IndexWriter.close to ensure > > that the > > docs are written to disk and that the index will survive an > > application / > > machine death? > > > >> -Original Message- > >> From: Michael McCandless [mailto:[EMAIL PROTECTED] > >> Sent: Freitag, 8. Februar 2008 19:34 > >> To: java-user@lucene.apache.org > >> Subject: Re: IndexWriter: setRAMBufferSizeMB > >> > >> Well ... every time the RAM buffer is full, a new segment > is flushed > >> to the Directory, but that is not necessarily a "commit" in that > >> an IndexReader would see the new segment, nor, that the > segment would > >> survive if the machine suddenly crashed. > >> > >> You should't rely on when specifically IndexWriter makes > its changes > >> visible to readers. The best way to be sure is to close > the writer. > >> > >> There is work underway now, in this issue: > >> > >>https://issues.apache.org/jira/browse/LUCENE-1044 > >> > >> that will add an explicit "commit" call, which you would use to 1) > >> make the changes visible to readers, and 2) sync the index > such that > >> if the machine crashed (after commit returns) then your > changes as of > >> the commit will survive. But it's not committed yet ... it will > >> be in > >> 2.4. > >> > >> One way for a reader to check if a new commit has happened is to > >> call the isCurrent method. Maybe that helps? > >> > >> Mike > >> > >> <[EMAIL PROTECTED]> wrote: > >> > >>> Hi, > >>> > >>> if I understand this property correctly every time the ram buffer > >>> is full it > >>> gets automaticaly written to disk. Something like a commit in a > >>> database. > >>> Thus if my application dies, all docs in the buffer get > lost. Right? > >>> > >>> If so, is there any event/callback etc. which informs my > >>> application that > >>> such a commit happend? > >>> > >>> Thank you. > >>> > >>> > >>> > >> > - > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >> > >> > >> > - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lukes document hitlist display
OK, understood. Maybe a little hint in the legend, like "Only for stored fields". > -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Dienstag, 12. Februar 2008 19:13 > To: java-user@lucene.apache.org > Subject: Re: Lukes document hitlist display > > [EMAIL PROTECTED] wrote: > > Hi, > > > > using Luke 0.7.1. > > > > The document hitlist has a column header ITSVop0LBC. > > > > When I add a field like this: > > > > new Field("CONTENT", contentReader, TermVector.WITH_OFFSETS) > > > > Luke shows only "--". Why? > > > > Shouldn't it be "IT-Vo-"? > > It should, but this information is not available ... Luke > populates this > screen using Document.getFields(). If a field is unstored > then it's not > returned in this list, so it's not possible to get its flags. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lukes document hitlist display
Hi, using Luke 0.7.1. The document hitlist has a column header ITSVop0LBC. When I add a field like this: new Field("CONTENT", contentReader, TermVector.WITH_OFFSETS) Luke shows only "--". Why? Shouldn't it be "IT-Vo-"? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TermPositionVector
This would be really nice! > -Original Message- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Dienstag, 12. Februar 2008 16:41 > To: java-user@lucene.apache.org > Subject: Re: TermPositionVector > > [EMAIL PROTECTED] wrote: > > Hi, > > > > could somebody please explain what the difference between > positions and > > offsets is? > > > > And: Is there a trick to show theses infos in luke? > > > Not yet :) Funny thing, I've been thinking about adding this to Luke, > but ran out of time before the last release. Perhaps I'll > include it in > a minor update. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TermPositionVector
TermA TermB TermA has position 0 and offset 0 TermB has position 1 and offset 6 Right? > -Original Message- > From: Grant Ingersoll [mailto:[EMAIL PROTECTED] > Sent: Dienstag, 12. Februar 2008 15:16 > To: java-user@lucene.apache.org > Subject: Re: TermPositionVector > > Position is just relative to other tokens > (Token.getPositionIncrement()), offsets are character offsets > (Token.startOffset(), Token.endOffset()) > > -Grant > > On Feb 12, 2008, at 8:31 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > could somebody please explain what the difference between > positions > > and > > offsets is? > > > > And: Is there a trick to show theses infos in luke? > > > > Thank you. > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Grant Ingersoll > http://lucene.grantingersoll.com > http://www.lucenebootcamp.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermPositionVector
Hi, could somebody please explain what the difference between positions and offsets is? And: Is there a trick to show theses infos in luke? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
design: merging resultset from RDBMS with lucene search results
Hi, I have the following scenario: RDBMS which contains the metadata for documents (ID, customer number, doctype etc.). Now I want to add fulltext search support. So I will index the documents content in lucene and add the documents ID as a stored field in lucene. Now somebody wants to search like this: customer number 1234 AND content "foo bar". So I go to lucene, search for content:"foo bar" and get back a hitlist containing the documents IDs. Now - how to merge these Ids with the resultset of the RDBM's search for customer number 1234? 1) select ... from ... where customer=1234 and ID in (). or 2) select ... from ... where customer=1234 and them join both resultsets in the application. or 3) no idea :) What is best practice here? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: design: merging resultset from RDBMS with lucene search results
The metadata is quite offen altered and there are millions of documents. Also document access is secured by complex sql statements which lucene might not support. So this is not an option I think. > -Original Message- > From: John Byrne [mailto:[EMAIL PROTECTED] > Sent: Mittwoch, 13. Februar 2008 18:44 > To: java-user@lucene.apache.org > Subject: Re: design: merging resultset from RDBMS with lucene > search results > > Hi, > > You might consider avoiding this problem altogether, by simply adding > the meta data to your Lucene index. Lucene can handle untokenized > fields, which is ideal for meta data. It might not be as quick as the > RDB, but you could perhaps optimize by only searching in the RDB when > you only need to search meta data, and using Lucene when you > need both. > > Regards, > JB > > [EMAIL PROTECTED] wrote: > > Hi, > > > > I have the following scenario: > > > > RDBMS which contains the metadata for documents (ID, > customer number, > > doctype etc.). > > Now I want to add fulltext search support. > > > > So I will index the documents content in lucene and add the > documents ID as > > a stored field in lucene. > > > > Now somebody wants to search like this: customer number > 1234 AND content > > "foo bar". > > > > So I go to lucene, search for content:"foo bar" and get > back a hitlist > > containing the documents IDs. > > > > Now - how to merge these Ids with the resultset of the > RDBM's search for > > customer number 1234? > > > > 1) select ... from ... where customer=1234 and ID in > (). > > > > or > > > > 2) select ... from ... where customer=1234 and them join > both resultsets in > > the application. > > > > or > > > > 3) no idea :) > > > > What is best practice here? > > > > Thank you. > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> Rather than index one doc per page, you could index a special > token between pages. Say you index $ as the special > token. I have decided to use this version, but... What token can I use? It must be a token which gets never removed by an analyzer or altered in a way that it not unique in the resulting tokenstream. Is something like $0123456789$ the way to go? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> Why not just use ? Because nearly every analyzer removes it (SimpleAnalyzer, German, Russian, French...) Just tested it with luke in the search dialog. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> Document doc = new Document() > for (int i = 0; i < pages.length; i++) { > doc.add(new Field("text", pages[i], Field.Store.NO, > Field.Index.TOKENIZED)); > doc.add(new Field("text", "$$", Field.Store.NO, > Field.Index.UN_TOKENIZED)); > } UN_TOKENIZED. Nice idea! I will check this out. > 2) if your goal is just to be able to make sure you can query > for phrases > without crossing page boundaries, it's a lot simpler just to use are > really big positionIncimentGap with your analyzer (and add > each page as a > seperate Field instance). boundary tokens like these are relaly only > neccessary if you want more complex queries (like "find X and Y on > the same page but not in the same sentence") Hm. This is what Erik already recommended. I had to store the field with TermVector.WITH_POSITIONS, right? But I do not know the maximum number of terms per page and I do not know the maximum number of pages. I already had documents with more than 50.000 pages (A4) and documents with 1 page but 100 MB data. How many terms can 100 MB have? Hm... Since positions are stored as int I could have a maximum of 40.000 terms per page (50.000 pages * 40.000 term -> nearly Integer.MAX_VALUE). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> > Document doc = new Document() > > for (int i = 0; i < pages.length; i++) { > > doc.add(new Field("text", pages[i], Field.Store.NO, > > Field.Index.TOKENIZED)); > > doc.add(new Field("text", "$$", Field.Store.NO, > > Field.Index.UN_TOKENIZED)); > > } > > UN_TOKENIZED. Nice idea! > I will check this out. Hm... when I try this, something strange happens with my offsets. When I use doc.add(new Field("text", pages[i] + "012345678901234567890123456789012345678901234567890123456789", Field.Store.NO, Field.Index.TOKENIZED)) everything is fine. Offsets are as I expect. But when I use doc.add(new Field("text", pages[i], Field.Store.NO, Field.Index.TOKENIZED)) doc.add(new Field("text", "012345678901234567890123456789012345678901234567890123456789", Field.Store.NO, Field.Index.UN_TOKENIZED)) the offsets of my terms are to high. What is the difference? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
Well, it seems that this may be a solution for me too. But I'm afraid that someone one day will change this string. And then my app will not work anymore... > -Original Message- > From: Adrian Smith [mailto:[EMAIL PROTECTED] > Sent: Freitag, 15. Februar 2008 13:02 > To: java-user@lucene.apache.org > Subject: Re: Design questions > > Hi, > > I have a similar sitaution. I also considered using $. But > for the sake of > not running into (potential) problems with Tokenisers, I just > defined a > string in a config file which for sure is never going to > occur in a document > and will never be searched for, e.g. > > dfgjkjrkruigduhfkdgjrugr > > Cheers, Adrian > -- > Java Software Developer > http://www.databasesandlife.com/ > > > > On 15/02/2008, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > > > > I haven't really been following this thread that closely, but... > > > > : Why not just use ? Check to insure that it makes > > > > : it through whatever analyzer you choose though. For instance, > > : LetterTokenizer will remove it... > > > > > > 1) i'm 99% sure you can do something like this... > > > > Document doc = new Document() > > for (int i = 0; i < pages.length; i++) { > > doc.add(new Field("text", pages[i], Field.Store.NO, > > Field.Index.TOKENIZED)); > > doc.add(new Field("text", "$$", Field.Store.NO, > > Field.Index.UN_TOKENIZED)); > > } > > > > ...and you'll get your magic token regardless of whether it > would normally > > make it through your analyzer. In fact: you want it to be > something your > > analyzer could never produce, even if it appears in the > orriginal text, so > > you don't get false boundaries (ie: if you use an Analzeer > that lowercases > > everything, then "A" makes a perfectly fine boundary token. > > > > 2) if your goal is just to be able to make sure you can > query for phrases > > without crossing page boundaries, it's a lot simpler just to use are > > really big positionIncimentGap with your analyzer (and add > each page as a > > seperate Field instance). boundary tokens like these are > relaly only > > neccessary if you want more complex queries (like "find X and Y on > > the same page but not in the same sentence") > > > > > > > > > > -Hoss > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Design questions
> You need to watch both the positionincrementgap > (which, as I remember, gets added for each new field of the > same name you add to the document). Make it 0 rather than > whatever it is currently. You may have to create a new analyzer > by subclassing your favorite analyzer and overriding the > getPositionIncrementGap (?) Well. I'm using GermanAnalyzer and this does not override getPositionIncrementGap in Analyzer. And in Analyzer getPositionIncrementGap returns 0. > Also, I'm not sure whether the term increment (see > get/setPositionIncrement) > needs to be taken into account. See the SynonymAnalyzer in > Lucene in Action. I cannot find the source of SynonymAnalyzer. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching multiple indexes
Hi, I have some questions about searching multiple indexes. 1. IndexSearcher with a MultiReader will search the indexes sequentially? 2. ParallelMultiSearcher searches in parallel. How is this done? One thread per index? When will it return? When the slowest search is fineshed? 3. When I have to search indexes created with different analyzers (maybe a french and a german analyzer), I have to search them separately by my own? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Searching multiple indexes
No ideas? :( > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Samstag, 16. Februar 2008 15:42 > To: java-user@lucene.apache.org > Subject: Searching multiple indexes > > Hi, > > I have some questions about searching multiple indexes. > > 1. IndexSearcher with a MultiReader will search the indexes > sequentially? > > 2. ParallelMultiSearcher searches in parallel. How is this > done? One thread > per index? When will it return? When the slowest search is fineshed? > > 3. When I have to search indexes created with different > analyzers (maybe a > french and a german analyzer), I have to search them > separately by my own? > > Thank you. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to construct a MultiReader?
Hi, how can I construct a MultiReader? There is only a constructor with an IndexReader-array. But IndexReader is abstract and all other IndexReader-implementations also need an IndexReader as constructor param. Now I'm a bit confused... I want to construct a MultiReader which reads multiple FDDirectories. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How to construct a MultiReader?
Thank you. > -Original Message- > From: Shai Erera [mailto:[EMAIL PROTECTED] > Sent: Donnerstag, 21. Februar 2008 14:11 > To: java-user@lucene.apache.org > Subject: Re: How to construct a MultiReader? > > Hi > > You can use IndexReader.open() static method to open a reader over > directories, file-systems etc. > Does that help? > > Shai > > On Thu, Feb 21, 2008 at 3:04 PM, <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > how can I construct a MultiReader? > > > > There is only a constructor with an IndexReader-array. But > IndexReader is > > abstract and all other IndexReader-implementations also need an > > IndexReader > > as constructor param. > > > > Now I'm a bit confused... > > > > I want to construct a MultiReader which reads multiple > FDDirectories. > > > > Thank you. > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > -- > Regards, > > Shai Erera > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Rebuilding Document from index?
You can use Luke to rebuild the document. It will show you the terms of the analyzed document, not the original content. And this is what you want, if I understood you correctly. > -Original Message- > From: Itamar Syn-Hershko [mailto:[EMAIL PROTECTED] > Sent: Freitag, 22. Februar 2008 14:02 > To: java-user@lucene.apache.org > Subject: Rebuilding Document from index? > > Hi, > > Is it possible to re-create a document from an index, if its > not stored? > What I'm looking for is a way to have a text document with > the text AFTER it > was analyzed, so I can see how my analyzer handles certain > cases. So that > means I don't care if I will not get the original document. I > want to see > the document as the index knows it. > > Thanks in advance, > > Itamar. > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Suffix search
Hi, using WildcardQuery directly it is possible to search for suffixes like "*foo". The QueryParser throws an exception that this is not allowed in a WildcardQuery. Hm, now I'm confused ;) How can I configure the QueryParser to allow a wildcard as first character? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Suffix search
> 1) See setAllowLeadingWildcard in QP. Oh damned... late in the evening ;) Hm, just tested it: Searching for "format" works. Searching for "form*" works. Searching for "*ormat" works NOT. Confused again ;) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Suffix search
> That will let you do it, be warned however there is most definitely a > significant performance degradation associated with doing this. Yes of course. Like in a relational database with a leading wildcard. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Changing wildcard characters
Hi, is it possible to change the wildcard charaters which are used by QueryParser? Or do I have to replace them myself in the query string? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Transactions in Lucene
> Then, you can call close() to commit the changes to the index, or > abort() to rollback the index to the starting state (when the writer > was opened). As I understand the docs, the index will get rolled back to the state as it was when the index was opened. How can I achieve a rollback which only goes back to the state of the last flush (2.3) / commit (2.4/3.0)? Until now I call flush to commit, but I do not know how to rollback... Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Transactions in Lucene
> In 2.4, commit() sets the rollback point. So abort() will > roll index > back to the last time you called commit() (or to when the writer was > opened if you haven't called commit). > > In 2.3, your only choice is to close & re-open the writer to reset > the rollback point. OK, thank you. For what time is the 2.4 release planned? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Transactions in Lucene
> > For what time is the 2.4 release planned? > > Not really sure at this point ... Hm. Digging into IndexWriter#init it seems that this is a really expensive operation and thus my self made "commit" too. Isn't it? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Transactions in Lucene
> I don't think creating an IndexWriter is very expensive at all. Ah ok. I tested it. Creating an IndexWriter on an index with 10.000 docs (about 15 MB) takes about 200 ms. This is a very cheap operation for me ;) I only saw the many calls in init() which reads files and so on and therefore I tought it could be expensive. Thank you! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How do i get a text summary
> If you want something from an index it has to be IN the > index. So, store a > summary field in each document and make sure that field is part of the > query. And how could one create automatically such a summary? Taking the first 2 lines of a document makes not always much sense. How does google this? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: NO_NORM and TOKENIZED
Hm, what exactly does NO_NORM mean? Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Swapping between indexes
> Since Lucene buffers in memory, you will always have the risk of > losing recently added documents that haven't been flushed yet. > Committing on every document would be too slow to be practical. Well it is not sooo slw... I have indexed 10.000 docs, resulting in 14 MB index. The index has 2 stored fields and the tokenized content field. With a commit after every add: 30 min. With a commit after 100 add: 23 min. Only one commit: 20 min. (including time to get the document from the archive) I use lucene 2.3 so a commit is a combination of closing and creating the writer. 2.4/3.0 has a commit method which may be faster. Before this test I thought it would be much slower than 30 min... So one has to decide if correctness is more important than performance. I use a batch size of 100, first committing lucene, then committing the database which holds the status of the document if it is already indexed or not. If the db commit fails it is no problem, because my app does not care about multiple indexed documents. But until now neither the lucene nor the db commit ever failed... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Swapping between indexes
> > With a commit after every add: 30 min. > > With a commit after 100 add: 23 min. > > Only one commit: 20 min. > > All of these times look pretty slow... perhaps lucene is not the > bottleneck here? Therefore I wrote: "(including time to get the document from the archive)" Not the absolute times are important, the differences are imported. They only occur due to the different batch sizes. I think it is a real world scenario because one has always the read the docs from somewhere and offen has to store the index state somewhere else. A test with docs created in memory and no state in a database would have of cause completely other results. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Swapping between indexes
> With a commit after every add: (286 sec / 10,000 docs) 28.6 ms. > With a commit after every 100 add: (12 sec / 10,000 docs) 1.2 ms. > Only one commit: (8 sec / 10,000 docs) 0.8 ms. Of couse. If you need so less time to create a document than a commit which may take, lets say 10 - 500 ms, will slow down indexing heavily. So it really depends on the use case and how long it takes to index a single document, inclusive retrieval of the document from ist source. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MultiSearcher to overcome the Integer.MAX_VALUE limit
Does this mean that I cannot search indexes with more than 2 billion docs at all with a single IndexSearcher? > -Original Message- > From: Mark Miller [mailto:[EMAIL PROTECTED] > Sent: Samstag, 8. März 2008 18:57 > To: java-user@lucene.apache.org > Subject: Re: MultiSearcher to overcome the Integer.MAX_VALUE limit > > Random text can often be pretty slow when done per word. > > I think you will have to modify the MultiSearcher a bit. The > MultiSearcher takes a global id space and converts to and from an > individual Searcher id space. The MultiSearcher's id space is > limited to > an int as well, but I think if you change it to a float/double, you > should be all set. > > - Mark > > Toke Eskildsen wrote: > > On Fri, 2008-03-07 at 00:03 +0100, Ray wrote: > > > >> I am currently running a small random text indexer with > 400 docs/second. > >> It will reach 2 billion in around 45 days. > >> > > > > If you are just doing it to test large indexes (in terms of document > > count), then you need to look into your index-generation > code. I tried > > making an ultra-simple index builder, where each document contains a > > unique id and one of nine fixed strings. The index-building > speed on my > > desktop computer is 40.000 documents/second (tested with 100 million > > documents). > > > > I would suspect that your random text generator is where all the > > time-intensive processing occurs. Either that or you're > flushing after > > each document addition (which lowers my execution speed to about 100 > > documents/second). > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MultiSearcher to overcome the Integer.MAX_VALUE limit
> Right... but trust me, you really wouldn't want to. You need > distributed search at that level anyway. Hm, 2 billion small docs are not so much. Why do I need distributed search and what exactly do you means with distributed search? Multiple IndexSearchers? Multiple processes? Multiple machines? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Biggest index
Hi, I have some question about the index size on a single machine: What is your biggest index you use in production? Do you use MultiReader/Searcher? What hardware do you need to serve it? What kind of application is it? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Biggest index
Yes of course, the answers to your questions are important too. But no anwser at all until now :( For me I can say (not production yet): 2 ID-Fields and one content field per doc. Seach on content field only. Simple searches like "content:foo" or "content:foo*". 1,5 GB index per 1 million docs. About 50 million docs now. Max. 10 million docs per year increase. So I will have 75 GB index soon. Can searching this index be handled by a single machine? Thank you. > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Dienstag, 11. März 2008 20:07 > To: java-user@lucene.apache.org > Subject: Re: Biggest index > > Questions like these are always hard to answer well. > Actually, no, they are easy, right Erik: "It depends" ;) > > Just kidding...partially. Anyhow, you should ask a few more > questions then: > > - what is the response latency? (average, median, Nth percentile...) > - are stored fields involved, if so how many and how big are they? > - what kind of queries are involves (some are costlier than others) > - what is the search rate? > ... > > > Otis > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > - Original Message > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, March 10, 2008 5:06:04 PM > Subject: Biggest index > > Hi, > > I have some question about the index size on a single machine: > > What is your biggest index you use in production? > Do you use MultiReader/Searcher? > What hardware do you need to serve it? > What kind of application is it? > > Thank you. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Does Lucene Supports Billions of data
> Even if they're in multiple indexes, the doc IDs being ints > will still prevent > it going past 2Gi unless you wrap your own framework around it. Hm. Does this mean that a MultiReader has the int-limit too? I thought that this limit applies to a single index only... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing questions
Hi, I have some questions about indexing: 1. Is it possible to open indexes with Multireader+IndexSearcher and add documents to these indexes simultaneously? 2. Is it possible to open indexes with Multireader+IndexSearcher and optimize these indexes simultaneously? 3. Is it possible to open indexes with Multireader+IndexSearcher and merge these indexes simultaneously? Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]