Re: Apache Lucene 4.x word counting

2014-03-28 Thread Jose Carlos Canova
There is a small problem in your problem formulation and Lucene, Lucene don't count words, you count terms based on an Analyzer that you have defined during a phase called IndexWriting, such analyzer will tokenize (which does not means use the white space between the words) a sequence of strings

Re: Lucene 4.7 intermittently not applying query filter

2014-03-28 Thread Jamie
Steve Thank for the contact. I believe UAX29URLEmailTokenizer tokenizes email addresses as follows: john@mycompany.com.au john.doe mycompany.com.au john doe mycompany com au com.au.We have an overridden query parser that swaps out anyaddress: with to, from, cc, bcc, etc. Inside the overri

Apache Lucene 4.x word counting

2014-03-28 Thread Hollow Quincy
Hello, I would like to use Apache *Lucene 4*.x and count words in the string, for example: "I loved cats, but now I really love dogs" - count "love" word in the String - result should be 2. I would like to count how many times there was: "give up" in the String as well. I spend a lot of time to r

Re: Lucene 4.7 intermittently not applying query filter

2014-03-28 Thread Steve Rowe
Hi Jamie, What does EmailFilter do? Why is the expanded form "required for the UAX29URLEmailTokenizer"? Seems like an exact match would work on the email address alone, without the expanded components? Do you have an example of a query that reproducibly matches more documents than it shoul

RE: Lucene 4.7 intermittently not applying query filter

2014-03-28 Thread Uwe Schindler
Hi Jamie, is your Query Filter also implemented by your team? If this is the case, maybe you are not correctly implementing the random access getDocIdSet(), bits(), or you don't correctly handle acceptDocs parameter in your own DocIdSet / Filter implementation, leading to random failures. Uwe

Re: Lucene 4.7 intermittently not applying query filter

2014-03-28 Thread Jamie
I beg your pardon. Its our EmailFilter class that emits the tokens. We do it this way, since users like to search using individual components of an email address. e.g. joe or mycompany.com.au. I think we may have a synchronization issue at play. I will perform some further testing and will get

Re: Lucene 4.7 intermittently not applying query filter

2014-03-28 Thread Steve Rowe
Jamie, UAX29URLEmailTokenizer does not emit email components as tokens; “john@mycompany.com.au” will be tokenized as “john@mycompany.com.au”, nothing more. That’s why I asked what EmailFilter does. If the filter really is ignored by Lucene, that would be a bug in Lucene. I think some

Lucene 4.7 intermittently not applying query filter

2014-03-28 Thread Jamie
Greetings We have a problem whereby Lucene 4.7 occasionally does not apply a filter query during searching. The problem is intermittent. One in thirty or so searches will return what appears to be an unfiltered result set. There are no exceptions or errors occurring.. just incorrect results.