Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Shai Erera
Hi LogMP *always* picks adjacent segments together. Therefore, if you have segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent segments and in a raw (i.e. it doesn't skip segments). I guess what both

Adding custom weights to individual terms

2014-02-12 Thread Rune Stilling
Hi list I’m trying to figure out how customizable scoring and weighting is in the Lucene API. I read about the API’s but still can’t figure out if the following is possible. I would like to do normal document text indexing, but I would like to control the weight added to tokens my self, also I

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
@Mike, I had suggested the same approach in one of my previous mails, where-by each segment records min/max timestamps in seg-info diagnostics and use it for merging adjacent segments. "Then, I define a TimeMergePolicy extends LogMergePolicy and define the segment-size=Long.MAX_VALUE - SEG_LEAST_

Lucene Algorithm for retrieving docs

2014-02-12 Thread Harshvardhan Ojha
Hi All, I have a question regarding retrieval of documents by lucene. I know lucene uses many files on disk to keep documents, each comprising fields in it, and uses many IR algorithms, and inverted index to match documents. My question is : 1. How lucene stores these documents inside file system

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread andy
Hi Uwe, thanks a lot, I will try with that. Uwe Schindler wrote > Hi andy, > > unfortunately, that is not easy to show with one simple code. You have to > change the Similarity used. > > Before starting to do this, you should be sure, that this affects you > users. The example you gave is sh

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Michael McCandless
Right, I think you'll need to use either of the LogXMergePolicy (or subclass LogMergePolicy and make your own): they always pick adjacent segments to merge. SortingMP let's you pass in the MP to wrap, so just pass in a LogXMP, and then sort by timestamp? Mike McCandless http://blog.mikemccandles

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Shai Erera
Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent segments and SortingMP ensures the merged segment is also sorted. Shai On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > Yes exactly as you have described. > > Ex: Cons

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
Yes exactly as you have described. Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and goes for a merge While SortingMergePolicy will correctly solve the merge-part, it does not however play any role in picking segments to merge right? SMP internally delegates to TieredMer

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Michael McCandless
OK, I see (early termination). That's a challenge, because you really want the docs sorted backwards from how they were added right? And, e.g., merged and then searched in "reverse segment order"? I think you should be able to do this w/ SortingMergePolicy? And then use a custom collector that

Re: Actual min and max-value of NumericField during codec flush

2014-02-12 Thread Ravikumar Govindarajan
Mike, All our queries need to be sorted by timestamp field, in descending order of time. [latest-first] Each segment is sorted in itself. But TieredMergePolicy picks arbitrary segments and merges them [even with SortingMergePolicy etc...]. I am trying to avoid this and see if an approximate globa

Re: Getting term ords during collect

2014-02-12 Thread Michael McCandless
It sounds like you are just indexing at TextField and then calling getDocTermOrds? This then requires a slow "uninvert" step...Hmm, how are you adding this field to your documents? Instead, you should use SortedSetDocValuesField, which will store the doc values directly in the index, and loading

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread Uwe Schindler
Hi andy, unfortunately, that is not easy to show with one simple code. You have to change the Similarity used. Before starting to do this, you should be sure, that this affects you users. The example you gave is showing very short documents. Lucene is optimized to handle larger documents, for

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread andy
Thanks Uwe,could you please give me a more detail example about how to change the lucene behavior Uwe Schindler wrote > Hi Erick, > > a statement like " Adding &debug=all to the query will show you if this is > the case" will not help a Lucene user, as it is only available in the Solr > server.

RE: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread Uwe Schindler
Hi Erick, a statement like " Adding &debug=all to the query will show you if this is the case" will not help a Lucene user, as it is only available in the Solr server. But Andy uses Lucene directly. In his case he should use IndexSearcher's explain functionalities to retrieve a structured outpu

Re: Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

2014-02-12 Thread andy
thanks for your reply Erick, this is the case ,But how can I keep the precision of the fields' length? -- View this message in context: http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390p4116832.html