Re: ANN: UweSays Query Operator
that's nice! Tommaso 2012/11/19 Uwe Schindler > Lol! > > Many thanks for this support! > > Uwes > > > > Otis Gospodnetic schrieb: > > >Hi, > > > >Quick announcement for Uwe & Friends. > > > >UweSays is now a super-duper-special query operator over on > >http://search-lucene.com/ . Now whenever you want to know what Uwe > >says > >about something just start the query with UweSays. > > > >Example: > > http://search-lucene.com/?q=UweSays+mmap > > > >It's not case sensitive, so you can lay off the shift key. > >There are some other similar Easter eggs in there if you want to hunt. > > > >Otis > >-- > >Performance Monitoring - http://sematext.com/spm/index.html > >Search Analytics - http://sematext.com/search-analytics/index.html > > -- > Uwe Schindler > H.-H.-Meier-Allee 63, 28213 Bremen > http://www.thetaphi.de
Re: TokenStreamComponents in Lucene 4.0
Am 19.11.2012 17:44, schrieb Carsten Schnober: Hi, > However, after switching to Lucene 4 and TokenStreamComponents, I'm > getting a strange behaviour: only the first document in the collection > is tokenized properly. The others do appear in the index, but > un-tokenized, although I have tried not to change anything in the logic. > The Analyzer now has this createComponents() method calling the custom > TokenStreamComponents class with my custom Tokenizer: After some debugging, it turns out that the Analyer method createComponents() is called only once, for the first document. This seems to be the problem, the other documents are just not analyzed. Here's the loop that creates the fields and supposedly calls the analyzer. Does anyone have a hint why this does only happend for the first document; the loop itself runs once for every document though: --- List documents; Version lucene_version = Version.LUCENE_40; Analyzer analyzer = new KoraAnalyzer(); IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new IndexWriter(dir, config); [...] for (de.ids_mannheim.korap.main.Document doc : documents) { luceneDocument = new Document(); /* Store document name/ID */ Field idField = new StringField(titleFieldName, doc.getDocid(), Field.Store.YES); /* Store tokens */ String layerFile = layer.getFile(); Field textFieldAnalyzed = new TextField(textFieldName, layerFile, Field.Store.YES); luceneDocument.add(textFieldAnalyzed); luceneDocument.add(idField); try { writer.addDocument(luceneDocument); } catch (IOException e) { jlog.error("Error adding document "+doc.getDocid()+":\n"+e.getLocalizedMessage()); } } [...] writer.close(); --- The class de.ids_mannheim.korap.main.Document defines our own document objects from which the relevant information can be read as shown in the loop. The list 'documents' is filled in in intermediately called method. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: TokenStreamComponents in Lucene 4.0
Hi, all the components of your Tokenstream in Lucene 4.0 are *required* tob e reuseable, see the documentation: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Analyzer.html All your components must implement reset() according to the Tokenstream contract: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html The createComponents() method of Analyzers is only called *once* for each thread and the Tokenstream is *reused* for later documents. The Analyzer will call the final method Tokenizer#setReader() to notify the Tokenizer of a new Reader (this method will update the protected "input" field in the Tokenizer base class) and then it will reset() the whole tokenization chain. The custom TokenStream components must "initialize" themselves with the new settings on the reset() method. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Carsten Schnober [mailto:schno...@ids-mannheim.de] > Sent: Tuesday, November 20, 2012 10:15 AM > To: java-user@lucene.apache.org > Subject: Re: TokenStreamComponents in Lucene 4.0 > > Am 19.11.2012 17:44, schrieb Carsten Schnober: > > Hi, > > > However, after switching to Lucene 4 and TokenStreamComponents, I'm > > getting a strange behaviour: only the first document in the collection > > is tokenized properly. The others do appear in the index, but > > un-tokenized, although I have tried not to change anything in the logic. > > The Analyzer now has this createComponents() method calling the custom > > TokenStreamComponents class with my custom Tokenizer: > > After some debugging, it turns out that the Analyer method > createComponents() is called only once, for the first document. This seems > to be the problem, the other documents are just not analyzed. > Here's the loop that creates the fields and supposedly calls the analyzer. > Does anyone have a hint why this does only happend for the first document; > the loop itself runs once for every document though: > > --- > > List documents; Version > lucene_version = Version.LUCENE_40; Analyzer analyzer = new > KoraAnalyzer(); IndexWriterConfig config = new > IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new > IndexWriter(dir, config); [...] > > for (de.ids_mannheim.korap.main.Document doc : documents) { > luceneDocument = new Document(); > > /* Store document name/ID */ > Field idField = new StringField(titleFieldName, doc.getDocid(), > Field.Store.YES); > > /* Store tokens */ > String layerFile = layer.getFile(); > Field textFieldAnalyzed = new TextField(textFieldName, layerFile, > Field.Store.YES); > > luceneDocument.add(textFieldAnalyzed); > luceneDocument.add(idField); > > try { > writer.addDocument(luceneDocument); > } catch (IOException e) { > jlog.error("Error adding document > "+doc.getDocid()+":\n"+e.getLocalizedMessage()); > } > } > [...] > writer.close(); > --- > > The class de.ids_mannheim.korap.main.Document defines our own > document objects from which the relevant information can be read as shown > in the loop. The list 'documents' is filled in in intermediately called > method. > Best, > Carsten > > -- > Institut für Deutsche Sprache | http://www.ids-mannheim.de > Projekt KorAP | http://korap.ids-mannheim.de > Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de > Korpusanalyseplattform der nächsten Generation Next Generation Corpus > Analysis Platform > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene 2.3 indices with Lucene 4.0
You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work fine as far as I know. -- Ian. On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < youngestachie...@gmail.com> wrote: > I understand lucene 2.x indexes are not compatible with the latest version > of lucene 4.0. However we have all our indexes indexed with lucene 2.3. > > Now that we are planning to migrate to Lucene 4.0, is there any work > around/hack I can do, so that I can still read the 2.3 indices? Or is > forgoing the older indices the only option? > > P.S : Am afraid, Re-indexing is not feasible. > > -- > With Thanks and Regards, > Ramprakash Ramamoorthy, > Chennai, > India. >
Re: Using Lucene 2.3 indices with Lucene 4.0
However behavior of some analyzers changed. So even after upgrade the old index is readable with 4.0, it doesn't mean everything still works as before. On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. > You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work > fine as far as I know. > > > -- > Ian. > > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < > youngestachie...@gmail.com> wrote: > > > I understand lucene 2.x indexes are not compatible with the latest > version > > of lucene 4.0. However we have all our indexes indexed with lucene 2.3. > > > > Now that we are planning to migrate to Lucene 4.0, is there any work > > around/hack I can do, so that I can still read the 2.3 indices? Or is > > forgoing the older indices the only option? > > > > P.S : Am afraid, Re-indexing is not feasible. > > > > -- > > With Thanks and Regards, > > Ramprakash Ramamoorthy, > > Chennai, > > India. > > >
Re: Using Lucene 2.3 indices with Lucene 4.0
Sure - read all the release notes, migration guides, everything, test and test again. -- Ian. On Tue, Nov 20, 2012 at 10:24 AM, Danil ŢORIN wrote: > However behavior of some analyzers changed. > > So even after upgrade the old index is readable with 4.0, it doesn't mean > everything still works as before. > > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > > > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. > > You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work > > fine as far as I know. > > > > > > -- > > Ian. > > > > > > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < > > youngestachie...@gmail.com> wrote: > > > > > I understand lucene 2.x indexes are not compatible with the latest > > version > > > of lucene 4.0. However we have all our indexes indexed with lucene 2.3. > > > > > > Now that we are planning to migrate to Lucene 4.0, is there any work > > > around/hack I can do, so that I can still read the 2.3 indices? Or is > > > forgoing the older indices the only option? > > > > > > P.S : Am afraid, Re-indexing is not feasible. > > > > > > -- > > > With Thanks and Regards, > > > Ramprakash Ramamoorthy, > > > Chennai, > > > India. > > > > > >
Re: Using Lucene 2.3 indices with Lucene 4.0
On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN wrote: > However behavior of some analyzers changed. > > So even after upgrade the old index is readable with 4.0, it doesn't mean > everything still works as before. > Thank you Torin, I am using the standard analyzer only and both the systems use Unicode 4.0 and I don't smell any problems here. > > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > > > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. > > You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work > > fine as far as I know. > > > > > > -- > > Ian. > > > Thank you Ian, this is giving me some head starts. > > > > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < > > youngestachie...@gmail.com> wrote: > > > > > I understand lucene 2.x indexes are not compatible with the latest > > version > > > of lucene 4.0. However we have all our indexes indexed with lucene 2.3. > > > > > > Now that we are planning to migrate to Lucene 4.0, is there any work > > > around/hack I can do, so that I can still read the 2.3 indices? Or is > > > forgoing the older indices the only option? > > > > > > P.S : Am afraid, Re-indexing is not feasible. > > > > > > -- > > > With Thanks and Regards, > > > Ramprakash Ramamoorthy, > > > Chennai, > > > India. > > > > > > -- With Thanks and Regards, Ramprakash Ramamoorthy, Engineer Trainee, Zoho Corporation. +91 9626975420
Re: TokenStreamComponents in Lucene 4.0
Am 20.11.2012 10:22, schrieb Uwe Schindler: Hi, > The createComponents() method of Analyzers is only called *once* for each > thread and the Tokenstream is *reused* for later documents. The Analyzer will > call the final method Tokenizer#setReader() to notify the Tokenizer of a new > Reader (this method will update the protected "input" field in the Tokenizer > base class) and then it will reset() the whole tokenization chain. The custom > TokenStream components must "initialize" themselves with the new settings on > the reset() method. Thanks, Uwe! I think what changed in comparison to Lucene 3.6 is that reset() is called upon initialization, too, instead of after processing the first document only, right? Apart from the fact that it used not to be obligatory to make all components reuseable, I suppose. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: TokenStreamComponents in Lucene 4.0
On Tue, Nov 20, 2012 at 6:26 AM, Carsten Schnober wrote: > > Thanks, Uwe! > I think what changed in comparison to Lucene 3.6 is that reset() is > called upon initialization, too, instead of after processing the first > document only, right? There is no such change: this step was always mandatory!
Re: Grouping on multiple shards possible in lucene?
On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan wrote: > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am able > to iterate postings in reverse doc-id order. Is this possible in lucene? Alas that is not easy to do in Lucene: the posting lists are encoded in forward docID order. But, I think it should be possible with some fun codec & merge policy & MultiReader magic, to have docIDs assigned in "reverse chronological order" ... > Also, for a TopN query sorted by doc-id will the query terminate early? Actually, it won't! But it really should ... you could make a Collector that throws an exception once the N docs have been collected? Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene 2.3 indices with Lucene 4.0
Ironically most of the changes are in unicode handling and standard analyzer ;) On Tue, Nov 20, 2012 at 12:31 PM, Ramprakash Ramamoorthy < youngestachie...@gmail.com> wrote: > On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN wrote: > > > However behavior of some analyzers changed. > > > > So even after upgrade the old index is readable with 4.0, it doesn't mean > > everything still works as before. > > > > Thank you Torin, I am using the standard analyzer only and both the systems > use Unicode 4.0 and I don't smell any problems here. > > > > > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > > > > > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader. > > > You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work > > > fine as far as I know. > > > > > > > > > -- > > > Ian. > > > > > > Thank you Ian, this is giving me some head starts. > > > > > > > > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < > > > youngestachie...@gmail.com> wrote: > > > > > > > I understand lucene 2.x indexes are not compatible with the latest > > > version > > > > of lucene 4.0. However we have all our indexes indexed with lucene > 2.3. > > > > > > > > Now that we are planning to migrate to Lucene 4.0, is there any work > > > > around/hack I can do, so that I can still read the 2.3 indices? Or is > > > > forgoing the older indices the only option? > > > > > > > > P.S : Am afraid, Re-indexing is not feasible. > > > > > > > > -- > > > > With Thanks and Regards, > > > > Ramprakash Ramamoorthy, > > > > Chennai, > > > > India. > > > > > > > > > > > > > -- > With Thanks and Regards, > Ramprakash Ramamoorthy, > Engineer Trainee, > Zoho Corporation. > +91 9626975420 >
Re: Grouping on multiple shards possible in lucene?
On 11/20/2012 6:49 AM, Michael McCandless wrote: On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan wrote: Also, for a TopN query sorted by doc-id will the query terminate early? Actually, it won't! But it really should ... you could make a Collector that throws an exception once the N docs have been collected? I've never much liked this exception-throwing for early termination - IMO Lucene should really expose an Iterator-style API for pulling matches so that callers can choose when to terminate. I've been writing an XQuery service that uses Lucene as its data storage and retrieval engine. XQuery is entirely design to be lazily evaluated - everything is iterators from top to bottom, and the entire language is designed to be streamed so that all expressions can be terminated early. For this case I really needed early termination to be controlled *by the caller*, since the conditions for early termination are unknowable. So I wrote the attached class, which provides that by extending IndexSearcher. Of course it would be nice if someone up to speed w/Lucene 4 would like to provide something similar built in to Lucene... -Mike package lux.search; import java.io.IOException; import org.apache.lucene.search.DocIdSetIterator; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Scorer; import org.apache.lucene.search.Weight; import org.apache.lucene.store.Directory; public class LuxSearcher extends IndexSearcher { public LuxSearcher (Directory dir) throws IOException { super (dir); } public LuxSearcher (IndexSearcher searcher) { super (searcher.getIndexReader()); } /** * @param query the Lucene query * @return the unordered results of the query as a Lucene DocIdSetIterator. Unordered means the order * is not predictable and may change with subsequent calls. * @throws IOException */ public DocIdSetIterator search (Query query) throws IOException { return new DocIterator (query, false); } /** * @param query the Lucene query * @return the results of the query as a Lucene DocIdSetIterator in docID order * @throws IOException */ public DocIdSetIterator searchOrdered (Query query) throws IOException { return new DocIterator (query, true); } class DocIterator extends DocIdSetIterator { private final Weight weight; private final boolean ordered; private int nextReader; private int docID; private int docBase; // add to docID which is relative to each sub-reader private Scorer scorer; /** * @param query the lucene query whose results will be iterated * @param ordered whether the docs must be scored in order * @throws IOException */ DocIterator (Query query, boolean ordered) throws IOException { weight = createNormalizedWeight(query); this.ordered = ordered; nextReader = 0; docID = -1; advanceScorer(); } private void advanceScorer () throws IOException { while (nextReader < subReaders.length) { docBase = docStarts[nextReader]; scorer = weight.scorer(subReaders[nextReader++], ordered, true); if (scorer != null) { return; } } scorer = null; } @Override public int docID() { return docID; } @Override public int nextDoc() throws IOException { while (scorer != null) { docID = scorer.nextDoc(); if (docID != NO_MORE_DOCS) { return docID + docBase; } advanceScorer(); } return NO_MORE_DOCS; } @Override public int advance(int target) throws IOException { while (scorer != null) { docID = scorer.advance(target - docBase); if (docID != NO_MORE_DOCS) { return docID + docBase; } advanceScorer(); } return NO_MORE_DOCS; } } } /* This Source Code Form is subject to the terms of the Mozilla Public * License, v. 2.0. If a copy of the MPL was not distributed with this file, * You can obtain one at http://mozilla.org/MPL/2.0/. */ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using Lucene 2.3 indices with Lucene 4.0
On Tue, Nov 20, 2012 at 5:42 PM, Danil ŢORIN wrote: > Ironically most of the changes are in unicode handling and standard > analyzer ;) > Ouch! It hurts then ;) > > On Tue, Nov 20, 2012 at 12:31 PM, Ramprakash Ramamoorthy < > youngestachie...@gmail.com> wrote: > > > On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN wrote: > > > > > However behavior of some analyzers changed. > > > > > > So even after upgrade the old index is readable with 4.0, it doesn't > mean > > > everything still works as before. > > > > > > > Thank you Torin, I am using the standard analyzer only and both the > systems > > use Unicode 4.0 and I don't smell any problems here. > > > > > > > > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea wrote: > > > > > > > You can upgrade the indexes with > org.apache.lucene.index.IndexUpgrader. > > > > You'll need to do it in steps, from 2.x to 3.x to 4.x, but should > work > > > > fine as far as I know. > > > > > > > > > > > > -- > > > > Ian. > > > > > > > > > Thank you Ian, this is giving me some head starts. > > > > > > > > > > > > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy < > > > > youngestachie...@gmail.com> wrote: > > > > > > > > > I understand lucene 2.x indexes are not compatible with the latest > > > > version > > > > > of lucene 4.0. However we have all our indexes indexed with lucene > > 2.3. > > > > > > > > > > Now that we are planning to migrate to Lucene 4.0, is there any > work > > > > > around/hack I can do, so that I can still read the 2.3 indices? Or > is > > > > > forgoing the older indices the only option? > > > > > > > > > > P.S : Am afraid, Re-indexing is not feasible. > > > > > > > > > > -- > > > > > With Thanks and Regards, > > > > > Ramprakash Ramamoorthy, > > > > > Chennai, > > > > > India. > > > > > > > > > > > > > > > > > > > > -- > > With Thanks and Regards, > > Ramprakash Ramamoorthy, > > Engineer Trainee, > > Zoho Corporation. > > +91 9626975420 > > > -- With Thanks and Regards, Ramprakash Ramamoorthy, Engineer Trainee, Zoho Corporation. +91 9626975420
Re: Grouping on multiple shards possible in lucene?
But, I think it should be possible with some fun codec & merge policy & MultiReader magic, to have docIDs assigned in "reverse chronological order" Can you explain it a bit more? I was thinking perhaps we store absolute doc-ids instead of delta to do reverse traversal. But this could waste a lot of storage The default merge policy will merge adjacent segments no? Is it going to disturb the ordering? -- Ravi On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan > wrote: > > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am able > > to iterate postings in reverse doc-id order. Is this possible in lucene? > > Alas that is not easy to do in Lucene: the posting lists are encoded > in forward docID order. > > But, I think it should be possible with some fun codec & merge policy > & MultiReader magic, to have docIDs assigned in "reverse chronological > order" ... > > > Also, for a TopN query sorted by doc-id will the query terminate early? > > Actually, it won't! But it really should ... you could make a > Collector that throws an exception once the N docs have been > collected? > > Mike McCandless > > http://blog.mikemccandless.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Grouping on multiple shards possible in lucene?
Hi Ravi, I've been dealing with reverse indexing lately, so let me share with you a bit of my experience thus far. First, you need to define what does reverse indexing mean for you. If it means that docs that were indexed in the following order: d1, d2, d3 should be traversed during search in that order: d3, d2, d1 - then that's one thing. However, if it means that the traversal needs to occur by e.g. the documents' timestamp, as a means to process documents from latest to oldest, then that's a totally different thing, and way more complicated. You will need to think about an IndexReader which reverses the order of the segments that it reads, so that segments are processed from latest to oldest. Also, you might need to merge the segments in reverse order too (i.e. if segments s1, s4, s5 are merged, merge them as s5, s4, s1). If you are interested in timestamp based sorting, it gets complicated. Documents flow in from multiple producers (e.g. a parallel crawler, different processes which feed documents to the index et.c) and processed usually by multiple consumers (indexing threads). That makes sorting the index based on a timestamp difficult. Lucene used to have IndexSorter (before 4.0) which could sort an index by a field. That was an offline process and if that's what you're after -- you should do just that and forget about the rest. If however you're interested in an on-line process, where documents are fed in some order and searched in the exact true order (latest to oldest), that's a more complicated solution -- I'm still working on it :). HTH Shai On Tue, Nov 20, 2012 at 5:37 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > But, I think it should be possible with some fun codec & merge policy > & MultiReader magic, to have docIDs assigned in "reverse chronological > order" > > Can you explain it a bit more? I was thinking perhaps we store absolute > doc-ids instead of delta to do reverse traversal. But this could waste a > lot of storage > > The default merge policy will merge adjacent segments no? Is it going to > disturb the ordering? > > -- > Ravi > > On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > > > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan > > wrote: > > > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am > able > > > to iterate postings in reverse doc-id order. Is this possible in > lucene? > > > > Alas that is not easy to do in Lucene: the posting lists are encoded > > in forward docID order. > > > > But, I think it should be possible with some fun codec & merge policy > > & MultiReader magic, to have docIDs assigned in "reverse chronological > > order" ... > > > > > Also, for a TopN query sorted by doc-id will the query terminate early? > > > > Actually, it won't! But it really should ... you could make a > > Collector that throws an exception once the N docs have been > > collected? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >
Re: Line feed on windows
This doesn't sound like a Lucene issue. It's up to you to read a file and pass it as a string to Lucene. Maybe you're trying to read the file one line at a time, in which case it is up to you to supply line delimiters when combining the lines into a single string. Try reading the full file into a single string, line delimiters and all. Be careful about encoding though. -- Jack Krupansky -Original Message- From: Mansour Al Akeel Sent: Tuesday, November 20, 2012 1:19 PM To: java-user Subject: Line feed on windows Hello all, We are indexing and storing files contents in lucene index. These files contains line feed "\n" as end of line character. Lucene is storing the content as is, however when we read them, the "\n" is removed and we end up with text that is concatenated when there's no space. I can re-read the files from the filesystem to avoid this, but I like to see if there is other alternatives. Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Performance of IndexSearcher.explain(Query)
I have a feature I wanted to implement which required a quick way to check whether an individual document matched a query or not. IndexSearcher.explain seemed to be a good fit for this. The query I tested was just a BooleanQuery with two TermQuery inside it, both with MUST. I ran an empty query to match all documents and then ran the new code against each document. Within 40,743 documents, 1,072 documents matched the query. I got the times of around 15.5s doing this. After noticing that ConstantScoreQuery now works with Query in addition to Filter, I started using it as well, which further reduced this time to 13.6s. There is a comment like this on the explain method, though: "Computing an explanation is as expensive as executing the query over the entire index." So I wanted to test this. To do this, I made a collector which did nothing but look for the single item being matched. Times for searching the whole index using this collector came to around 30.9s, which is more than twice as slow as using explain (times didn't vary at all if I used ConstantScoreQuery here, which I assume is something to do with using a custom collector which is ignoring the scorer.) So I was wondering, is this comment just out of date? It seems that by using explain(), I get the same information I get by querying the whole index, *plus* information about the score which the custom collector wasn't recording, all in less than half the time it took to query the whole index. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance of IndexSearcher.explain(Query)
On Tue, Nov 20, 2012 at 6:18 PM, Trejkaz wrote: > I have a feature I wanted to implement which required a quick way to > check whether an individual document matched a query or not. > > IndexSearcher.explain seemed to be a good fit for this. > > The query I tested was just a BooleanQuery with two TermQuery inside > it, both with MUST. I ran an empty query to match all documents and > then ran the new code against each document. Within 40,743 documents, > 1,072 documents matched the query. > > I got the times of around 15.5s doing this. After noticing that > ConstantScoreQuery now works with Query in addition to Filter, I > started using it as well, which further reduced this time to 13.6s. > > There is a comment like this on the explain method, though: > > "Computing an explanation is as expensive as executing > the query over the entire index." > > So I wanted to test this. To do this, I made a collector which did > nothing but look for the single item being matched. > > Times for searching the whole index using this collector came to > around 30.9s, which is more than twice as slow as using explain (times > didn't vary at all if I used ConstantScoreQuery here, which I assume > is something to do with using a custom collector which is ignoring the > scorer.) > > So I was wondering, is this comment just out of date? It seems that by > using explain(), I get the same information I get by querying the > whole index, *plus* information about the score which the custom > collector wasn't recording, all in less than half the time it took to > query the whole index. > > Explain is not performant... but the comment is fair I think? Its more of a worst-case, depends on the query. Explain is going to rewrite the query/create the weight and so on just to advance() the scorer to that single doc So if this is e.g. a wildcard query then it could definitely be almost as slow as searching the whole index since the rewrite involves scanning through the term dictionary or whatever.
Re: Using Lucene 2.3 indices with Lucene 4.0
On Wed, Nov 21, 2012 at 12:33 AM, Ramprakash Ramamoorthy wrote: > On Tue, Nov 20, 2012 at 5:42 PM, Danil ŢORIN wrote: > >> Ironically most of the changes are in unicode handling and standard >> analyzer ;) >> > > Ouch! It hurts then ;) What we did going from 2 -> 3 (and in some cases where passing the right Version into a constructor didn't actually give the same behaviour as the old version... I'm looking at you, StandardTokenizer) was to archive copies of the classes from older versions of Lucene and layer our own backwards-compatible API on top of them. You just have to come up with a way to identify how something was indexed and support that forever (e.g. give all the Tokenizer and TokenFilter implementations unique names and never change the names.) The only time this really hurts is when Lucene change the API on something like TokenFilter and you have 20 or so implementations of it which you now have to update. It's a good example of how backwards compatibility slows down development time. The amount of work you have to do each time upstream changes something is more or less directly proportional to how long your application has been supported for. If I were making the decisions, I wouldn't support anything across major versions and you would just get an export/import tool for each version so you could bring the data across if you really wanted it. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance of IndexSearcher.explain(Query)
On Wed, Nov 21, 2012 at 10:40 AM, Robert Muir wrote: > Explain is not performant... but the comment is fair I think? Its more of a > worst-case, depends on the query. > Explain is going to rewrite the query/create the weight and so on just to > advance() the scorer to that single doc > So if this is e.g. a wildcard query then it could definitely be almost as > slow as searching the whole index since the rewrite involves scanning > through the term dictionary or whatever. Hmm, yep. That does seem to be it. For complicated queries (or at least queries which are slow to create a weight for) it's about the same speed no matter which way I do it. For the more normal queries I was trying, explain() seems to speed things up a fair bit. For simple one-term queries it might be a bit quicker still. It's at least never slower than doing the full query though, so I can still use it. I'll just be putting a similar (though perhaps more specific) warning about performance on the method. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Grouping on multiple shards possible in lucene?
Hi Shai, I would only want to sort based on doc additions. Ex: d1,d2,d3. Then true sort order means d3,d2,d1. Doc timestamp based solution is much more involved like you said It's nice to know that you are already working on it and there will be a solution in the near future. In the meantime, I will live with good old sorting -- Ravi On Wed, Nov 21, 2012 at 1:59 AM, Shai Erera wrote: > Hi Ravi, > > I've been dealing with reverse indexing lately, so let me share with you a > bit of my experience thus far. > > First, you need to define what does reverse indexing mean for you. If it > means that docs that were indexed in the following order: d1, d2, d3 should > be traversed during search in that order: d3, d2, d1 - then that's one > thing. > However, if it means that the traversal needs to occur by e.g. the > documents' timestamp, as a means to process documents from latest to > oldest, then that's a totally different thing, and way more complicated. > > You will need to think about an IndexReader which reverses the order of the > segments that it reads, so that segments are processed from latest to > oldest. Also, you might need to merge the segments in reverse order too > (i.e. if segments s1, s4, s5 are merged, merge them as s5, s4, s1). > > If you are interested in timestamp based sorting, it gets complicated. > Documents flow in from multiple producers (e.g. a parallel crawler, > different processes which feed documents to the index et.c) and processed > usually by multiple consumers (indexing threads). That makes sorting the > index based on a timestamp difficult. > > Lucene used to have IndexSorter (before 4.0) which could sort an index by a > field. That was an offline process and if that's what you're after -- you > should do just that and forget about the rest. If however you're interested > in an on-line process, where documents are fed in some order and searched > in the exact true order (latest to oldest), that's a more complicated > solution -- I'm still working on it :). > > HTH > > Shai > > On Tue, Nov 20, 2012 at 5:37 PM, Ravikumar Govindarajan < > ravikumar.govindara...@gmail.com> wrote: > > > But, I think it should be possible with some fun codec & merge policy > > & MultiReader magic, to have docIDs assigned in "reverse chronological > > order" > > > > Can you explain it a bit more? I was thinking perhaps we store absolute > > doc-ids instead of delta to do reverse traversal. But this could waste a > > lot of storage > > > > The default merge policy will merge adjacent segments no? Is it going to > > disturb the ordering? > > > > -- > > Ravi > > > > On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > > > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan > > > wrote: > > > > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am > > able > > > > to iterate postings in reverse doc-id order. Is this possible in > > lucene? > > > > > > Alas that is not easy to do in Lucene: the posting lists are encoded > > > in forward docID order. > > > > > > But, I think it should be possible with some fun codec & merge policy > > > & MultiReader magic, to have docIDs assigned in "reverse chronological > > > order" ... > > > > > > > Also, for a TopN query sorted by doc-id will the query terminate > early? > > > > > > Actually, it won't! But it really should ... you could make a > > > Collector that throws an exception once the N docs have been > > > collected? > > > > > > Mike McCandless > > > > > > http://blog.mikemccandless.com > > > > > > - > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > >