Re: ANN: UweSays Query Operator

2012-11-20 Thread Tommaso Teofili
that's nice!

Tommaso


2012/11/19 Uwe Schindler 

> Lol!
>
> Many thanks for this support!
>
> Uwes
>
>
>
> Otis Gospodnetic  schrieb:
>
> >Hi,
> >
> >Quick announcement for Uwe & Friends.
> >
> >UweSays is now a super-duper-special query operator over on
> >http://search-lucene.com/ .  Now whenever you want to know what Uwe
> >says
> >about something just start the query with UweSays.
> >
> >Example:
> >  http://search-lucene.com/?q=UweSays+mmap
> >
> >It's not case sensitive, so you can lay off the shift key.
> >There are some other similar Easter eggs in there if you want to hunt.
> >
> >Otis
> >--
> >Performance Monitoring - http://sematext.com/spm/index.html
> >Search Analytics - http://sematext.com/search-analytics/index.html
>
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de


Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Carsten Schnober
Am 19.11.2012 17:44, schrieb Carsten Schnober:

Hi,

> However, after switching to Lucene 4 and TokenStreamComponents, I'm
> getting a strange behaviour: only the first document in the collection
> is tokenized properly. The others do appear in the index, but
> un-tokenized, although I have tried not to change anything in the logic.
> The Analyzer now has this createComponents() method calling the custom
> TokenStreamComponents class with my custom Tokenizer:

After some debugging, it turns out that the Analyer method
createComponents() is called only once, for the first document. This
seems to be the problem, the other documents are just not analyzed.
Here's the loop that creates the fields and supposedly calls the
analyzer. Does anyone have a hint why this does only happend for the
first document; the loop itself runs once for every document though:

---

List documents;
Version lucene_version = Version.LUCENE_40;
Analyzer analyzer = new KoraAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(lucene_version, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
[...]

for (de.ids_mannheim.korap.main.Document doc : documents) {
  luceneDocument = new Document();

  /* Store document name/ID */
  Field idField = new StringField(titleFieldName, doc.getDocid(),
Field.Store.YES);

  /* Store tokens */
  String layerFile = layer.getFile();
  Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
Field.Store.YES);

  luceneDocument.add(textFieldAnalyzed);
  luceneDocument.add(idField);

  try {
writer.addDocument(luceneDocument);
  } catch (IOException e) {
jlog.error("Error adding document
"+doc.getDocid()+":\n"+e.getLocalizedMessage());
  }
}
[...]
writer.close();
---

The class de.ids_mannheim.korap.main.Document defines our own document
objects from which the relevant information can be read as shown in the
loop. The list 'documents' is filled in in intermediately called method.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Uwe Schindler
Hi,

all the components of your Tokenstream in Lucene 4.0 are *required* tob e 
reuseable, see the documentation:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/Analyzer.html

All your components must implement reset() according to the Tokenstream 
contract:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

The createComponents() method of Analyzers is only called *once* for each 
thread and the Tokenstream is *reused* for later documents. The Analyzer will 
call the final method Tokenizer#setReader() to notify the Tokenizer of a new 
Reader (this method will update the protected "input" field in the Tokenizer 
base class) and then it will reset() the whole tokenization chain. The custom 
TokenStream components must "initialize" themselves with the new settings on 
the reset() method.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
> Sent: Tuesday, November 20, 2012 10:15 AM
> To: java-user@lucene.apache.org
> Subject: Re: TokenStreamComponents in Lucene 4.0
> 
> Am 19.11.2012 17:44, schrieb Carsten Schnober:
> 
> Hi,
> 
> > However, after switching to Lucene 4 and TokenStreamComponents, I'm
> > getting a strange behaviour: only the first document in the collection
> > is tokenized properly. The others do appear in the index, but
> > un-tokenized, although I have tried not to change anything in the logic.
> > The Analyzer now has this createComponents() method calling the custom
> > TokenStreamComponents class with my custom Tokenizer:
> 
> After some debugging, it turns out that the Analyer method
> createComponents() is called only once, for the first document. This seems
> to be the problem, the other documents are just not analyzed.
> Here's the loop that creates the fields and supposedly calls the analyzer.
> Does anyone have a hint why this does only happend for the first document;
> the loop itself runs once for every document though:
> 
> ---
> 
> List documents; Version
> lucene_version = Version.LUCENE_40; Analyzer analyzer = new
> KoraAnalyzer(); IndexWriterConfig config = new
> IndexWriterConfig(lucene_version, analyzer); IndexWriter writer = new
> IndexWriter(dir, config); [...]
> 
> for (de.ids_mannheim.korap.main.Document doc : documents) {
>   luceneDocument = new Document();
> 
>   /* Store document name/ID */
>   Field idField = new StringField(titleFieldName, doc.getDocid(),
> Field.Store.YES);
> 
>   /* Store tokens */
>   String layerFile = layer.getFile();
>   Field textFieldAnalyzed = new TextField(textFieldName, layerFile,
> Field.Store.YES);
> 
>   luceneDocument.add(textFieldAnalyzed);
>   luceneDocument.add(idField);
> 
>   try {
> writer.addDocument(luceneDocument);
>   } catch (IOException e) {
> jlog.error("Error adding document
> "+doc.getDocid()+":\n"+e.getLocalizedMessage());
>   }
> }
> [...]
> writer.close();
> ---
> 
> The class de.ids_mannheim.korap.main.Document defines our own
> document objects from which the relevant information can be read as shown
> in the loop. The list 'documents' is filled in in intermediately called 
> method.
> Best,
> Carsten
> 
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation Next Generation Corpus
> Analysis Platform
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Ian Lea
You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader.
 You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work
fine as far as I know.


--
Ian.



On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy <
youngestachie...@gmail.com> wrote:

> I understand lucene 2.x indexes are not compatible with the latest version
> of lucene 4.0. However we have all our indexes indexed with lucene 2.3.
>
> Now that we are planning to migrate to Lucene 4.0, is there any work
> around/hack I can do, so that I can still read the 2.3 indices? Or is
> forgoing the older indices the only option?
>
> P.S : Am afraid, Re-indexing is not feasible.
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Chennai,
> India.
>


Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Danil ŢORIN
However behavior of some analyzers changed.

So even after upgrade the old index is readable with 4.0, it doesn't mean
everything still works as before.

On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea  wrote:

> You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader.
>  You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work
> fine as far as I know.
>
>
> --
> Ian.
>
>
>
> On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy <
> youngestachie...@gmail.com> wrote:
>
> > I understand lucene 2.x indexes are not compatible with the latest
> version
> > of lucene 4.0. However we have all our indexes indexed with lucene 2.3.
> >
> > Now that we are planning to migrate to Lucene 4.0, is there any work
> > around/hack I can do, so that I can still read the 2.3 indices? Or is
> > forgoing the older indices the only option?
> >
> > P.S : Am afraid, Re-indexing is not feasible.
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > Chennai,
> > India.
> >
>


Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Ian Lea
Sure - read all the release notes, migration guides, everything, test and
test again.


--
Ian.



On Tue, Nov 20, 2012 at 10:24 AM, Danil ŢORIN  wrote:

> However behavior of some analyzers changed.
>
> So even after upgrade the old index is readable with 4.0, it doesn't mean
> everything still works as before.
>
> On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea  wrote:
>
> > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader.
> >  You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work
> > fine as far as I know.
> >
> >
> > --
> > Ian.
> >
> >
> >
> > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy <
> > youngestachie...@gmail.com> wrote:
> >
> > > I understand lucene 2.x indexes are not compatible with the latest
> > version
> > > of lucene 4.0. However we have all our indexes indexed with lucene 2.3.
> > >
> > > Now that we are planning to migrate to Lucene 4.0, is there any work
> > > around/hack I can do, so that I can still read the 2.3 indices? Or is
> > > forgoing the older indices the only option?
> > >
> > > P.S : Am afraid, Re-indexing is not feasible.
> > >
> > > --
> > > With Thanks and Regards,
> > > Ramprakash Ramamoorthy,
> > > Chennai,
> > > India.
> > >
> >
>


Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Ramprakash Ramamoorthy
On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN  wrote:

> However behavior of some analyzers changed.
>
> So even after upgrade the old index is readable with 4.0, it doesn't mean
> everything still works as before.
>

Thank you Torin, I am using the standard analyzer only and both the systems
use Unicode 4.0 and I don't smell any problems here.

>
> On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea  wrote:
>
> > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader.
> >  You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work
> > fine as far as I know.
> >
> >
> > --
> > Ian.
> >
>
Thank you Ian, this is giving me some head starts.

> >
> >
> > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy <
> > youngestachie...@gmail.com> wrote:
> >
> > > I understand lucene 2.x indexes are not compatible with the latest
> > version
> > > of lucene 4.0. However we have all our indexes indexed with lucene 2.3.
> > >
> > > Now that we are planning to migrate to Lucene 4.0, is there any work
> > > around/hack I can do, so that I can still read the 2.3 indices? Or is
> > > forgoing the older indices the only option?
> > >
> > > P.S : Am afraid, Re-indexing is not feasible.
> > >
> > > --
> > > With Thanks and Regards,
> > > Ramprakash Ramamoorthy,
> > > Chennai,
> > > India.
> > >
> >
>



-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420


Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Carsten Schnober
Am 20.11.2012 10:22, schrieb Uwe Schindler:

Hi,

> The createComponents() method of Analyzers is only called *once* for each 
> thread and the Tokenstream is *reused* for later documents. The Analyzer will 
> call the final method Tokenizer#setReader() to notify the Tokenizer of a new 
> Reader (this method will update the protected "input" field in the Tokenizer 
> base class) and then it will reset() the whole tokenization chain. The custom 
> TokenStream components must "initialize" themselves with the new settings on 
> the reset() method.

Thanks, Uwe!
I think what changed in comparison to Lucene 3.6 is that reset() is
called upon initialization, too, instead of after processing the first
document only, right? Apart from the fact that it used not to be
obligatory to make all components reuseable, I suppose.
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TokenStreamComponents in Lucene 4.0

2012-11-20 Thread Robert Muir
On Tue, Nov 20, 2012 at 6:26 AM, Carsten Schnober
wrote:

>
> Thanks, Uwe!
> I think what changed in comparison to Lucene 3.6 is that reset() is
> called upon initialization, too, instead of after processing the first
> document only, right?


 There is no such change: this step was always mandatory!


Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Michael McCandless
On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan
 wrote:
> Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am able
> to iterate postings in reverse doc-id order. Is this possible in lucene?

Alas that is not easy to do in Lucene: the posting lists are encoded
in forward docID order.

But, I think it should be possible with some fun codec & merge policy
& MultiReader magic, to have docIDs assigned in "reverse chronological
order" ...

> Also, for a TopN query sorted by doc-id will the query terminate early?

Actually, it won't!  But it really should ... you could make a
Collector that throws an exception once the N docs have been
collected?

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Danil ŢORIN
Ironically most of the changes are in unicode handling and standard
analyzer ;)

On Tue, Nov 20, 2012 at 12:31 PM, Ramprakash Ramamoorthy <
youngestachie...@gmail.com> wrote:

> On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN  wrote:
>
> > However behavior of some analyzers changed.
> >
> > So even after upgrade the old index is readable with 4.0, it doesn't mean
> > everything still works as before.
> >
>
> Thank you Torin, I am using the standard analyzer only and both the systems
> use Unicode 4.0 and I don't smell any problems here.
>
> >
> > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea  wrote:
> >
> > > You can upgrade the indexes with org.apache.lucene.index.IndexUpgrader.
> > >  You'll need to do it in steps, from 2.x to 3.x to 4.x, but should work
> > > fine as far as I know.
> > >
> > >
> > > --
> > > Ian.
> > >
> >
> Thank you Ian, this is giving me some head starts.
>
> > >
> > >
> > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy <
> > > youngestachie...@gmail.com> wrote:
> > >
> > > > I understand lucene 2.x indexes are not compatible with the latest
> > > version
> > > > of lucene 4.0. However we have all our indexes indexed with lucene
> 2.3.
> > > >
> > > > Now that we are planning to migrate to Lucene 4.0, is there any work
> > > > around/hack I can do, so that I can still read the 2.3 indices? Or is
> > > > forgoing the older indices the only option?
> > > >
> > > > P.S : Am afraid, Re-indexing is not feasible.
> > > >
> > > > --
> > > > With Thanks and Regards,
> > > > Ramprakash Ramamoorthy,
> > > > Chennai,
> > > > India.
> > > >
> > >
> >
>
>
>
> --
> With Thanks and Regards,
> Ramprakash Ramamoorthy,
> Engineer Trainee,
> Zoho Corporation.
> +91 9626975420
>


Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Michael Sokolov

On 11/20/2012 6:49 AM, Michael McCandless wrote:

On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan
 wrote:


Also, for a TopN query sorted by doc-id will the query terminate early?

Actually, it won't!  But it really should ... you could make a
Collector that throws an exception once the N docs have been
collected?

I've never much liked this exception-throwing for early termination - 
IMO Lucene should really expose an Iterator-style API for pulling 
matches so that callers can choose when to terminate.  I've been writing 
an XQuery service that uses Lucene as its data storage and retrieval 
engine.  XQuery is entirely design to be lazily evaluated - everything 
is iterators from top to bottom, and the entire language is designed to 
be streamed so that all expressions can be terminated early.  For this 
case I really needed early termination to be controlled *by the caller*, 
since the conditions for early termination are unknowable.  So I wrote 
the attached class, which provides that by extending IndexSearcher.


Of course it would be nice if someone up to speed w/Lucene 4 would like 
to provide something similar built in to Lucene...


-Mike


package lux.search;

import java.io.IOException;

import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.search.Weight;
import org.apache.lucene.store.Directory;

public class LuxSearcher extends IndexSearcher {

  public LuxSearcher (Directory dir) throws IOException {
super (dir);
  }
  
  public LuxSearcher (IndexSearcher searcher) {
  super (searcher.getIndexReader());
  }


  /**
   * @param query the Lucene query
   * @return the unordered results of the query as a Lucene DocIdSetIterator.  
Unordered means the order
   * is not predictable and may change with subsequent calls. 
   * @throws IOException
   */
  public DocIdSetIterator search (Query query) throws IOException {
  return new DocIterator (query, false);
  }

  /**
   * @param query the Lucene query
   * @return the results of the query as a Lucene DocIdSetIterator in docID 
order
   * @throws IOException
   */
  public DocIdSetIterator searchOrdered (Query query) throws IOException {
  return new DocIterator (query, true);
  }
  
  class DocIterator extends DocIdSetIterator {
  
  private final Weight weight;
  private final boolean ordered;
  private int nextReader;
  private int docID;
  private int docBase; // add to docID which is relative to each sub-reader
  private Scorer scorer;
  
  /**
   * @param query the lucene query whose results will be iterated
   * @param ordered whether the docs must be scored in order
   * @throws IOException
   */
  DocIterator (Query query, boolean ordered) throws IOException {
  weight = createNormalizedWeight(query);
  this.ordered = ordered;
  nextReader = 0;
  docID = -1;
  advanceScorer();
  }

  private void advanceScorer () throws IOException {
  while (nextReader < subReaders.length) {
  docBase = docStarts[nextReader];
  scorer = weight.scorer(subReaders[nextReader++], ordered, true);
  if (scorer != null) {
  return;
  }
  }
  scorer = null;
  }
  
@Override
public int docID() {
return docID;
}

@Override
public int nextDoc() throws IOException {
while (scorer != null) {
docID = scorer.nextDoc();
if (docID != NO_MORE_DOCS) {
return docID + docBase;
}
advanceScorer();
}
return NO_MORE_DOCS;
}

@Override
public int advance(int target) throws IOException {
while (scorer != null) {
docID = scorer.advance(target - docBase);
if (docID != NO_MORE_DOCS) {
return docID + docBase;
}
advanceScorer();
}
return NO_MORE_DOCS;
}
  
  }
  

}

/* This Source Code Form is subject to the terms of the Mozilla Public
 * License, v. 2.0. If a copy of the MPL was not distributed with this file,
 * You can obtain one at http://mozilla.org/MPL/2.0/. */


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Ramprakash Ramamoorthy
On Tue, Nov 20, 2012 at 5:42 PM, Danil ŢORIN  wrote:

> Ironically most of the changes are in unicode handling and standard
> analyzer ;)
>

Ouch! It hurts then ;)

>
> On Tue, Nov 20, 2012 at 12:31 PM, Ramprakash Ramamoorthy <
> youngestachie...@gmail.com> wrote:
>
> > On Tue, Nov 20, 2012 at 3:54 PM, Danil ŢORIN  wrote:
> >
> > > However behavior of some analyzers changed.
> > >
> > > So even after upgrade the old index is readable with 4.0, it doesn't
> mean
> > > everything still works as before.
> > >
> >
> > Thank you Torin, I am using the standard analyzer only and both the
> systems
> > use Unicode 4.0 and I don't smell any problems here.
> >
> > >
> > > On Tue, Nov 20, 2012 at 12:20 PM, Ian Lea  wrote:
> > >
> > > > You can upgrade the indexes with
> org.apache.lucene.index.IndexUpgrader.
> > > >  You'll need to do it in steps, from 2.x to 3.x to 4.x, but should
> work
> > > > fine as far as I know.
> > > >
> > > >
> > > > --
> > > > Ian.
> > > >
> > >
> > Thank you Ian, this is giving me some head starts.
> >
> > > >
> > > >
> > > > On Tue, Nov 20, 2012 at 10:16 AM, Ramprakash Ramamoorthy <
> > > > youngestachie...@gmail.com> wrote:
> > > >
> > > > > I understand lucene 2.x indexes are not compatible with the latest
> > > > version
> > > > > of lucene 4.0. However we have all our indexes indexed with lucene
> > 2.3.
> > > > >
> > > > > Now that we are planning to migrate to Lucene 4.0, is there any
> work
> > > > > around/hack I can do, so that I can still read the 2.3 indices? Or
> is
> > > > > forgoing the older indices the only option?
> > > > >
> > > > > P.S : Am afraid, Re-indexing is not feasible.
> > > > >
> > > > > --
> > > > > With Thanks and Regards,
> > > > > Ramprakash Ramamoorthy,
> > > > > Chennai,
> > > > > India.
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > With Thanks and Regards,
> > Ramprakash Ramamoorthy,
> > Engineer Trainee,
> > Zoho Corporation.
> > +91 9626975420
> >
>



-- 
With Thanks and Regards,
Ramprakash Ramamoorthy,
Engineer Trainee,
Zoho Corporation.
+91 9626975420


Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Ravikumar Govindarajan
But, I think it should be possible with some fun codec & merge policy
& MultiReader magic, to have docIDs assigned in "reverse chronological
order"

Can you explain it a bit more? I was thinking perhaps we store absolute
doc-ids instead of delta to do reverse traversal. But this could waste a
lot of storage

The default merge policy will merge adjacent segments no? Is it going to
disturb the ordering?

--
Ravi

On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan
>  wrote:
> > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am able
> > to iterate postings in reverse doc-id order. Is this possible in lucene?
>
> Alas that is not easy to do in Lucene: the posting lists are encoded
> in forward docID order.
>
> But, I think it should be possible with some fun codec & merge policy
> & MultiReader magic, to have docIDs assigned in "reverse chronological
> order" ...
>
> > Also, for a TopN query sorted by doc-id will the query terminate early?
>
> Actually, it won't!  But it really should ... you could make a
> Collector that throws an exception once the N docs have been
> collected?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Shai Erera
Hi Ravi,

I've been dealing with reverse indexing lately, so let me share with you a
bit of my experience thus far.

First, you need to define what does reverse indexing mean for you. If it
means that docs that were indexed in the following order: d1, d2, d3 should
be traversed during search in that order: d3, d2, d1 - then that's one
thing.
However, if it means that the traversal needs to occur by e.g. the
documents' timestamp, as a means to process documents from latest to
oldest, then that's a totally different thing, and way more complicated.

You will need to think about an IndexReader which reverses the order of the
segments that it reads, so that segments are processed from latest to
oldest. Also, you might need to merge the segments in reverse order too
(i.e. if segments s1, s4, s5 are merged, merge them as s5, s4, s1).

If you are interested in timestamp based sorting, it gets complicated.
Documents flow in from multiple producers (e.g. a parallel crawler,
different processes which feed documents to the index et.c) and processed
usually by multiple consumers (indexing threads). That makes sorting the
index based on a timestamp difficult.

Lucene used to have IndexSorter (before 4.0) which could sort an index by a
field. That was an offline process and if that's what you're after -- you
should do just that and forget about the rest. If however you're interested
in an on-line process, where documents are fed in some order and searched
in the exact true order (latest to oldest), that's a more complicated
solution -- I'm still working on it :).

HTH

Shai

On Tue, Nov 20, 2012 at 5:37 PM, Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:

> But, I think it should be possible with some fun codec & merge policy
> & MultiReader magic, to have docIDs assigned in "reverse chronological
> order"
>
> Can you explain it a bit more? I was thinking perhaps we store absolute
> doc-ids instead of delta to do reverse traversal. But this could waste a
> lot of storage
>
> The default merge policy will merge adjacent segments no? Is it going to
> disturb the ordering?
>
> --
> Ravi
>
> On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan
> >  wrote:
> > > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am
> able
> > > to iterate postings in reverse doc-id order. Is this possible in
> lucene?
> >
> > Alas that is not easy to do in Lucene: the posting lists are encoded
> > in forward docID order.
> >
> > But, I think it should be possible with some fun codec & merge policy
> > & MultiReader magic, to have docIDs assigned in "reverse chronological
> > order" ...
> >
> > > Also, for a TopN query sorted by doc-id will the query terminate early?
> >
> > Actually, it won't!  But it really should ... you could make a
> > Collector that throws an exception once the N docs have been
> > collected?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Line feed on windows

2012-11-20 Thread Jack Krupansky
This doesn't sound like a Lucene issue. It's up to you to read a file and 
pass it as a string to Lucene. Maybe you're trying to read the file one line 
at a time, in which case it is up to you to supply line delimiters when 
combining the lines into a single string. Try reading the full file into a 
single string, line delimiters and all. Be careful about encoding though.


-- Jack Krupansky

-Original Message- 
From: Mansour Al Akeel

Sent: Tuesday, November 20, 2012 1:19 PM
To: java-user
Subject: Line feed on windows

Hello all,
We are indexing and storing files contents in lucene index. These
files contains line feed "\n" as end of line character. Lucene is
storing the content as is,
however when we read them, the "\n" is removed and we end up with text
that is concatenated when there's no space.

I can re-read the files from the filesystem to avoid this, but I like
to see if there is other alternatives.

Thank you.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Trejkaz
I have a feature I wanted to implement which required a quick way to
check whether an individual document matched a query or not.

IndexSearcher.explain seemed to be a good fit for this.

The query I tested was just a BooleanQuery with two TermQuery inside
it, both with MUST. I ran an empty query to match all documents and
then ran the new code against each document. Within 40,743 documents,
1,072 documents matched the query.

I got the times of around 15.5s doing this. After noticing that
ConstantScoreQuery now works with Query in addition to Filter, I
started using it as well, which further reduced this time to 13.6s.

There is a comment like this on the explain method, though:

"Computing an explanation is as expensive as executing
 the query over the entire index."

So I wanted to test this. To do this, I made a collector which did
nothing but look for the single item being matched.

Times for searching the whole index using this collector came to
around 30.9s, which is more than twice as slow as using explain (times
didn't vary at all if I used ConstantScoreQuery here, which I assume
is something to do with using a custom collector which is ignoring the
scorer.)

So I was wondering, is this comment just out of date? It seems that by
using explain(), I get the same information I get by querying the
whole index, *plus* information about the score which the custom
collector wasn't recording, all in less than half the time it took to
query the whole index.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Robert Muir
On Tue, Nov 20, 2012 at 6:18 PM, Trejkaz  wrote:

> I have a feature I wanted to implement which required a quick way to
> check whether an individual document matched a query or not.
>
> IndexSearcher.explain seemed to be a good fit for this.
>
> The query I tested was just a BooleanQuery with two TermQuery inside
> it, both with MUST. I ran an empty query to match all documents and
> then ran the new code against each document. Within 40,743 documents,
> 1,072 documents matched the query.
>
> I got the times of around 15.5s doing this. After noticing that
> ConstantScoreQuery now works with Query in addition to Filter, I
> started using it as well, which further reduced this time to 13.6s.
>
> There is a comment like this on the explain method, though:
>
> "Computing an explanation is as expensive as executing
>  the query over the entire index."
>
> So I wanted to test this. To do this, I made a collector which did
> nothing but look for the single item being matched.
>
> Times for searching the whole index using this collector came to
> around 30.9s, which is more than twice as slow as using explain (times
> didn't vary at all if I used ConstantScoreQuery here, which I assume
> is something to do with using a custom collector which is ignoring the
> scorer.)
>
> So I was wondering, is this comment just out of date? It seems that by
> using explain(), I get the same information I get by querying the
> whole index, *plus* information about the score which the custom
> collector wasn't recording, all in less than half the time it took to
> query the whole index.
>
>
Explain is not performant... but the comment is fair I think? Its more of a
worst-case, depends on the query.
Explain is going to rewrite the query/create the weight and so on just to
advance() the scorer to that single doc
So if this is e.g. a wildcard query then it could definitely be almost as
slow as searching the whole index since the rewrite involves scanning
through the term dictionary or whatever.


Re: Using Lucene 2.3 indices with Lucene 4.0

2012-11-20 Thread Trejkaz
On Wed, Nov 21, 2012 at 12:33 AM, Ramprakash Ramamoorthy
 wrote:
> On Tue, Nov 20, 2012 at 5:42 PM, Danil ŢORIN  wrote:
>
>> Ironically most of the changes are in unicode handling and standard
>> analyzer ;)
>>
>
> Ouch! It hurts then ;)

What we did going from 2 -> 3 (and in some cases where passing the
right Version into a constructor didn't actually give the same
behaviour as the old version... I'm looking at you, StandardTokenizer)
was to archive copies of the classes from older versions of Lucene and
layer our own backwards-compatible API on top of them. You just have
to come up with a way to identify how something was indexed and
support that forever (e.g. give all the Tokenizer and TokenFilter
implementations unique names and never change the names.)

The only time this really hurts is when Lucene change the API on
something like TokenFilter and you have 20 or so implementations of it
which you now have to update.

It's a good example of how backwards compatibility slows down
development time. The amount of work you have to do each time upstream
changes something is more or less directly proportional to how long
your application has been supported for. If I were making the
decisions, I wouldn't support anything across major versions and you
would just get an export/import tool for each version so you could
bring the data across if you really wanted it.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Performance of IndexSearcher.explain(Query)

2012-11-20 Thread Trejkaz
On Wed, Nov 21, 2012 at 10:40 AM, Robert Muir  wrote:
> Explain is not performant... but the comment is fair I think? Its more of a
> worst-case, depends on the query.
> Explain is going to rewrite the query/create the weight and so on just to
> advance() the scorer to that single doc
> So if this is e.g. a wildcard query then it could definitely be almost as
> slow as searching the whole index since the rewrite involves scanning
> through the term dictionary or whatever.

Hmm, yep. That does seem to be it. For complicated queries (or at
least queries which are slow to create a weight for) it's about the
same speed no matter which way I do it. For the more normal queries I
was trying, explain() seems to speed things up a fair bit. For simple
one-term queries it might be a bit quicker still.

It's at least never slower than doing the full query though, so I can
still use it. I'll just be putting a similar (though perhaps more
specific) warning about performance on the method.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grouping on multiple shards possible in lucene?

2012-11-20 Thread Ravikumar Govindarajan
Hi Shai,

I would only want to sort based on doc additions. Ex: d1,d2,d3. Then true
sort order means d3,d2,d1. Doc timestamp based solution is much more
involved like you said

It's nice to know that you are already working on it and there will be a
solution in the near future.

In the meantime, I will live with good old sorting

--
Ravi

On Wed, Nov 21, 2012 at 1:59 AM, Shai Erera  wrote:

> Hi Ravi,
>
> I've been dealing with reverse indexing lately, so let me share with you a
> bit of my experience thus far.
>
> First, you need to define what does reverse indexing mean for you. If it
> means that docs that were indexed in the following order: d1, d2, d3 should
> be traversed during search in that order: d3, d2, d1 - then that's one
> thing.
> However, if it means that the traversal needs to occur by e.g. the
> documents' timestamp, as a means to process documents from latest to
> oldest, then that's a totally different thing, and way more complicated.
>
> You will need to think about an IndexReader which reverses the order of the
> segments that it reads, so that segments are processed from latest to
> oldest. Also, you might need to merge the segments in reverse order too
> (i.e. if segments s1, s4, s5 are merged, merge them as s5, s4, s1).
>
> If you are interested in timestamp based sorting, it gets complicated.
> Documents flow in from multiple producers (e.g. a parallel crawler,
> different processes which feed documents to the index et.c) and processed
> usually by multiple consumers (indexing threads). That makes sorting the
> index based on a timestamp difficult.
>
> Lucene used to have IndexSorter (before 4.0) which could sort an index by a
> field. That was an offline process and if that's what you're after -- you
> should do just that and forget about the rest. If however you're interested
> in an on-line process, where documents are fed in some order and searched
> in the exact true order (latest to oldest), that's a more complicated
> solution -- I'm still working on it :).
>
> HTH
>
> Shai
>
> On Tue, Nov 20, 2012 at 5:37 PM, Ravikumar Govindarajan <
> ravikumar.govindara...@gmail.com> wrote:
>
> > But, I think it should be possible with some fun codec & merge policy
> > & MultiReader magic, to have docIDs assigned in "reverse chronological
> > order"
> >
> > Can you explain it a bit more? I was thinking perhaps we store absolute
> > doc-ids instead of delta to do reverse traversal. But this could waste a
> > lot of storage
> >
> > The default merge policy will merge adjacent segments no? Is it going to
> > disturb the ordering?
> >
> > --
> > Ravi
> >
> > On Tue, Nov 20, 2012 at 5:19 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > On Tue, Nov 20, 2012 at 1:49 AM, Ravikumar Govindarajan
> > >  wrote:
> > > > Thanks Mike. Actually, I think I can eliminate sort-by-time, if I am
> > able
> > > > to iterate postings in reverse doc-id order. Is this possible in
> > lucene?
> > >
> > > Alas that is not easy to do in Lucene: the posting lists are encoded
> > > in forward docID order.
> > >
> > > But, I think it should be possible with some fun codec & merge policy
> > > & MultiReader magic, to have docIDs assigned in "reverse chronological
> > > order" ...
> > >
> > > > Also, for a TopN query sorted by doc-id will the query terminate
> early?
> > >
> > > Actually, it won't!  But it really should ... you could make a
> > > Collector that throws an exception once the N docs have been
> > > collected?
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>