Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-11 Thread Chitra R
Hi Shai,

i)Hope, when opening SortedSetDocValuesReaderState , we are
calculating ordinals( this will be used to calculate facet count ) for doc
values field and this only made the state instance somewhat costly.
  Am I right or any other reason behind that?



 ii) During indexing, we are providing facet ordinals in each doc
and I think it will be useful in search side, to calculate facet counts
only for matching docs.  otherwise, it carries any other benefits?


 iii) Is SortedSetDocValuesReaderState thread-safe (ie) multiple
threads can call this method concurrently?


Kindly post your suggestions.


Thanks,

Chitra


On Thu, Nov 10, 2016 at 4:34 PM, Shai Erera  wrote:

> Hi
>
> The reason IMO is historic - ES and Solr had faceting solutions before
> Lucene had it. There were discussions in the past about using the Lucene
> faceting module in Solr (can't tell for ES) but, sadly, I can't say I see
> it happening at this point.
>
> Regarding your other question, IMO the Lucene faceting engine, in terms of
> performance and customizability, is on par with Solr/ES. However, it lacks
> distributed faceting support and aggregations. Since many people use
> Solr/ES and not Lucene directly, the Solr/ES faceting module continues to
> advance separately from the Lucene one.
>
> Enhancing Lucene facets with aggregations and even distributed faceting
> capabilities is mostly a matter of time and priorities. If you're
> interested in it, I'd be willing to collaborate with you on that as much as
> I can!
>
> And I'd still hope that this work finds its way into Solr/ES, as I think
> it's silly to have that many number of faceting implementations, where they
> all rely on the same low-level data structure - Lucene!
>
> Shai
>
>
> On Thu, Nov 10, 2016 at 12:32 PM Kumaran Ramasubramanian <
> kums@gmail.com>
> wrote:
>
> > Hi All,
> > We all know that Lucene supports faceting by providing
> > Taxonomy(Separate index and hierarchical facets) and
> > SortedSetDocValuesFacetField ( flat facets and no sidecar index).
> >
> >   Then why did solr and elastic search go for its own implementation
> ?
> >  ( that is, solr uses block join & elasticsearch uses aggregations ) Is
> > there any limitations in lucene's implementation ?
> >
> >
> > --
> > Kumaran R
> >
>


Re: Getting list of committed documents

2016-11-11 Thread Michael McCandless
Hi lukes,

First, IW never "auto commits".  The maxBufferedDocs/RAMBufferSizeMB
settings control when IW moves the recently indexed documents from RAM
to disk, but that moving, which writes new segments files, does not
commit them.  It just writes them to disk, not visible yet to an
external reader (unless you open a near-real-time reader from IW),
until you explicitly call IW.commit.

Second, every IW operation returns a long sequence number, and so does
IW.commit, such that all sequence numbers <= the sequence number
returned from IW.commit "made it" into the index, and all other ops
did not make it.

You should be able to use this information to e.g. tell the channel
(e.g. a kafka queue) which offset your Lucene app has "durably"
consumed.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 9, 2016 at 2:40 PM, lukes  wrote:
> Hi all,
>
>   I need some feedback on getting hold of documents which got committed
> during commit call on indexwriter. There are multiple threads which keeps on
> adding documents to indexWriter in parallel, and there's another thread
> which wakes up after n number of minutes and does the commit. Below are the
> points i need help on
>
> 1) How to disable the auto flush / commit ? On IndexWriterConfig i setted up
> setMaxBufferedDocs (IndexWriterConfig.DISABLE_AUTO_FLUSH) and also
> setRAMBufferSizeMB to high number arnd 128 MB. Is this correct or is there
> any other knob i need to play around with ?
>
> 2) How to find out which documents got committed during commit(So certain
> action can be done, like removing from channel, etc) ? I tried extending
> IndexWriter and @Override doAfterFlush, but i don't see any pointer to get
> handle of documents which made in this commit.
>
> any help is really appreciated.
>
> Regards.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-list-of-committed-documents-tp4305258.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Too long token is not handled properly?

2016-11-11 Thread Alexey Makeev
Hello,

I'm using lucene 6.2.0 and expecting the following test to pass:

import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.standard.StandardTokenizer;

import java.io.IOException;
import java.io.StringReader;

public class TestStandardTokenizer extends BaseTokenStreamTestCase
{
    public void testLongToken() throws IOException
    {
        final StandardTokenizer tokenizer = new StandardTokenizer();
        final int maxTokenLength = tokenizer.getMaxTokenLength();

        // string with the following contents: a...maxTokenLength+5 times...a 
abc
        final String longToken = new String(new char[maxTokenLength + 
5]).replace("\0", "a") + " abc";

        tokenizer.setReader(new StringReader(longToken));
        
        assertTokenStreamContents(tokenizer, new String[]{"abc"});
        // actual contents: "a" 255 times, "a", "abc"
    }
}

It seems like StandardTokenizer considers completely filled buffer as a 
successfully extracted token (1), and also includes tail of too-long-token as a 
separate token (2). Maybe (1) is disputable (I think it is bug), but I think 
(2) is a bug. 

Best regards,
Alexey Makeev
makeev...@mail.ru

How exclude empty fields?

2016-11-11 Thread voidmind
Hi,

I have indexed content about Promotions with effectiveDate and endDate
fields for when the promotions start and end.

I want to query for expired promotions so I do have this criteria, which
works fine:

+Promotion.endDate:[210100TOvariable containing yesterday's date]

The issue I have is that some promotions are permanent so they don't have
an endDate set.

I tried doing:

( +Promotion.endDate:[210100TOvariable containing yesterday's date]
|| -Promotion.endDate:* )

But it doesn't seem to work because the promotions with no endDate are in
my results (empty endDate fields are not indexed apparently)

How would I exclude content that doesn't have an endDate set?

Thanks,
Alexandre Leduc


Re: Too long token is not handled properly?

2016-11-11 Thread Steve Rowe
Hi Alexey,

The behavior you mention is an intentional change from the behavior in Lucene 
4.9.0 and earlier, when tokens longer than maxTokenLenth were silently ignored: 
see LUCENE-5897[1] and LUCENE-5400[2].

The new behavior is as follows: Token matching rules are no longer allowed to 
match against input char sequences longer than maxTokenLength.  If a rule that 
would match a sequence longer than maxTokenLength, but also matches at 
maxTokenLength chars or fewer, and has the highest priority among all other 
rules matching at this length, and no other rule matches more chars, then a 
token will be emitted for that rule at the matching length.  And then the 
rule-matching iteration simply continues from that point as normal.  If the 
same rule matches against the remainder of the sequence that the first rule 
would have matched if maxTokenLength were longer, then another token at the 
matched length will be emitted, and so on. 

Note that this can result in effectively splitting the sequence at 
maxTokenLength intervals as you noted.

You can fix the problem by setting maxTokenLength higher - this has the side 
effect of growing the buffer and not causing unwanted token splitting.  If this 
results in tokens larger than you would like, you can remove them with 
LengthFilter.

FYI there is discussion on LUCENE-5897 about separating buffer size from 
maxTokenLength, starting here: 

 - ultimately I decided that few people would benefit from the increased 
configuration complexity.

[1] https://issues.apache.org/jira/browse/LUCENE-5897
[2] https://issues.apache.org/jira/browse/LUCENE-5400

--
Steve
www.lucidworks.com

> On Nov 11, 2016, at 6:23 AM, Alexey Makeev  wrote:
> 
> Hello,
> 
> I'm using lucene 6.2.0 and expecting the following test to pass:
> 
> import org.apache.lucene.analysis.BaseTokenStreamTestCase;
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> 
> import java.io.IOException;
> import java.io.StringReader;
> 
> public class TestStandardTokenizer extends BaseTokenStreamTestCase
> {
> public void testLongToken() throws IOException
> {
> final StandardTokenizer tokenizer = new StandardTokenizer();
> final int maxTokenLength = tokenizer.getMaxTokenLength();
> 
> // string with the following contents: a...maxTokenLength+5 times...a 
> abc
> final String longToken = new String(new char[maxTokenLength + 
> 5]).replace("\0", "a") + " abc";
> 
> tokenizer.setReader(new StringReader(longToken));
> 
> assertTokenStreamContents(tokenizer, new String[]{"abc"});
> // actual contents: "a" 255 times, "a", "abc"
> }
> }
> 
> It seems like StandardTokenizer considers completely filled buffer as a 
> successfully extracted token (1), and also includes tail of too-long-token as 
> a separate token (2). Maybe (1) is disputable (I think it is bug), but I 
> think (2) is a bug. 
> 
> Best regards,
> Alexey Makeev
> makeev...@mail.ru


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How exclude empty fields?

2016-11-11 Thread Ahmet Arslan
Hi,

Match all docs query minus Promotion.endDate:[* TO *]
+*:* -Promotion.endDate:[* TO *]

Ahmet


On Friday, November 11, 2016 5:59 PM, voidmind  wrote:
Hi,

I have indexed content about Promotions with effectiveDate and endDate
fields for when the promotions start and end.

I want to query for expired promotions so I do have this criteria, which
works fine:

+Promotion.endDate:[210100TOvariable containing yesterday's date]

The issue I have is that some promotions are permanent so they don't have
an endDate set.

I tried doing:

( +Promotion.endDate:[210100TOvariable containing yesterday's date]
|| -Promotion.endDate:* )

But it doesn't seem to work because the promotions with no endDate are in
my results (empty endDate fields are not indexed apparently)

How would I exclude content that doesn't have an endDate set?

Thanks,
Alexandre Leduc

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-11 Thread Michael McCandless
On Fri, Nov 11, 2016 at 5:21 AM, Chitra R  wrote:

> i)Hope, when opening SortedSetDocValuesReaderState , we are
> calculating ordinals( this will be used to calculate facet count ) for doc
> values field and this only made the state instance somewhat costly.
>   Am I right or any other reason behind that?

That's correct.  It adds some latency to an NRT refresh, and some heap
used to hold the ordinal mappings.

>  ii) During indexing, we are providing facet ordinals in each doc
> and I think it will be useful in search side, to calculate facet counts
> only for matching docs.  otherwise, it carries any other benefits?

Well, compared to the taxonomy facets, SSDV facets don't require a
separate index.

But they add latency/heap usage, and they cannot do hierarchical
facets yet (though this could be fixed if someone just built it).

>  iii) Is SortedSetDocValuesReaderState thread-safe (ie) multiple
> threads can call this method concurrently?

Yes.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org