Re: PostingsHighlighter to highlight the first Match ion the document

2013-07-18 Thread Michael McCandless
But for this one document, where you get only the first sentence back
from PH without "android" in it, does "android" in fact occur in that
field for that document?

Ie, it could be that document was returned because another field (e.g.
title) matched, but the body field you are highlighting on did not
have the term?

Yes, PH supports any analyzer.

Mike

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jul 18, 2013 at 2:57 AM, VIGNESH S  wrote:
> Hi Mike,
>
> I am getting the Search Hits.
>
> Will PostingsHighlighter support all analyzers.?
>
>
> Thanks and Regards
> Vignesh Srinivasan
>
>
>
> On Wed, Jul 17, 2013 at 11:06 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm it sounds like you are getting the "default passage" (first N
>> sentences), which happens when the document did not have any matched
>> terms from the query.  Are you sure your content matches Android?  Can
>> you post a full test case showing the issue?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Jul 17, 2013 at 10:12 AM, VIGNESH S 
>> wrote:
>> > Hi Mike,
>> >
>> > I tried the TestPostingsHighlighter.java.The contents I gave my own
>> > content..
>> >
>> > In that,If iam searching "Android",it is always returning the First
>> > Sentence as highlighted text whether the sentence contains Searched
>> keyword
>> > or not..
>> >
>> >
>> >
>> >
>> > On Wed, Jul 17, 2013 at 3:48 PM, VIGNESH S 
>> wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I need to do highlight the first sentence which matches the search
>> keyword
>> >> in a document using PostingsHighlighter.
>> >>
>> >> How can i do this
>> >>
>> >> Any Help or suggestions welcome
>> >> --
>> >> Thanks and Regards
>> >> Vignesh Srinivasan
>> >>
>> >>
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Vignesh Srinivasan
>> > 9739135640
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Thanks and Regards
> Vignesh Srinivasan
> 9739135640

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Indexing into SolrCloud

2013-07-18 Thread Beale, Jim (US-KOP)
Hey folks,

I've been migrating an application which indexes about 15M documents from 
straight-up Lucene into SolrCloud.  We've set up 5 Solr instances with a 3 
zookeeper ensemble using HAProxy for load balancing. The documents are 
processed on a quad core machine with 6 threads and indexed into SolrCloud 
through HAProxy using ConcurrentUpdateSolrServer in order to batch the updates. 
 The indexing box is heavily-loaded

I've been accepting the default HttpClient with 50K buffered docs and 2 
threads, i.e.,

int solrMaxBufferedDocs = 5;
int solrThreadCount = 2;
solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, 
solrMaxBufferedDocs, solrThreadCount);

autoCommit is configured in the solrconfig as follows:

 
   60
   50
   false
 

I'm getting the following errors on the client and server sides respectively:

Client side:

2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught when 
processing request: Software caused connection abort: socket write error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO  
SystemDefaultHttpClient - Retrying request
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught when 
processing request: Software caused connection abort: socket write error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO  
SystemDefaultHttpClient - Retrying request

Server side:

7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore  â 
java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] early 
EOF
at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)

When I disabled autoCommit on the server side, I didn't see any errors there 
but I still get the issue client-side after about 2 million documents - which 
is about 45 minutes.

Has anyone seen this issue before?  I couldn't find anything useful on the 
usual places.

I suppose I could setup wireshark to see what is happening but I'm hoping that 
someone has a better suggestion.

Thanks in advance for any help!


Best regards,
Jim Beale

hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067

The information contained in this email message, including any attachments, is 
intended solely for use by the individual or entity named above and may be 
confidential. If the reader of this message is not the intended recipient, you 
are hereby notified that you must not read, use, disclose, distribute or copy 
any part of this communication. If you have received this communication in 
error, please immediately notify me by email and destroy the original message, 
including any attachments. Thank you.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing into SolrCloud

2013-07-18 Thread Jack Krupansky
Sorry, but you need to resend this message to the Solr user list - this is 
the Lucene user list.


-- Jack Krupansky

-Original Message- 
From: Beale, Jim (US-KOP)

Sent: Thursday, July 18, 2013 12:34 PM
To: java-user@lucene.apache.org
Subject: Indexing into SolrCloud

Hey folks,

I've been migrating an application which indexes about 15M documents from 
straight-up Lucene into SolrCloud.  We've set up 5 Solr instances with a 3 
zookeeper ensemble using HAProxy for load balancing. The documents are 
processed on a quad core machine with 6 threads and indexed into SolrCloud 
through HAProxy using ConcurrentUpdateSolrServer in order to batch the 
updates.  The indexing box is heavily-loaded


I've been accepting the default HttpClient with 50K buffered docs and 2 
threads, i.e.,


int solrMaxBufferedDocs = 5;
int solrThreadCount = 2;
solrServer = new ConcurrentUpdateSolrServer(solrHttpIPAddress, 
solrMaxBufferedDocs, solrThreadCount);


autoCommit is configured in the solrconfig as follows:


  60
  50
  false


I'm getting the following errors on the client and server sides 
respectively:


Client side:

2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO 
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
when processing request: Software caused connection abort: socket write 
error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-4] INFO 
SystemDefaultHttpClient - Retrying request
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO 
SystemDefaultHttpClient - I/O exception (java.net.SocketException) caught 
when processing request: Software caused connection abort: socket write 
error
2013-07-16 19:02:47,002 [concurrentUpdateScheduler-1-thread-5] INFO 
SystemDefaultHttpClient - Retrying request


Server side:

7988753 [qtp1956653918-23] ERROR org.apache.solr.core.SolrCore  â 
java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException] 
early EOF
   at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
   at 
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
   at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
   at 
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
   at 
org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)


When I disabled autoCommit on the server side, I didn't see any errors there 
but I still get the issue client-side after about 2 million documents - 
which is about 45 minutes.


Has anyone seen this issue before?  I couldn't find anything useful on the 
usual places.


I suppose I could setup wireshark to see what is happening but I'm hoping 
that someone has a better suggestion.


Thanks in advance for any help!


Best regards,
Jim Beale

hibu.com
2201 Renaissance Boulevard, King of Prussia, PA, 19406
Office: 610-879-3864
Mobile: 610-220-3067

The information contained in this email message, including any attachments, 
is intended solely for use by the individual or entity named above and may 
be confidential. If the reader of this message is not the intended 
recipient, you are hereby notified that you must not read, use, disclose, 
distribute or copy any part of this communication. If you have received this 
communication in error, please immediately notify me by email and destroy 
the original message, including any attachments. Thank you.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Partial word match using n-grams

2013-07-18 Thread Becker, Thomas
One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like "abc ab" since we first split on whitespace and 
then construct PhraseQuerys containing each trigram yielded by the "word".  
Obviously we cannot get a trigram out of "ab".  So our choices would seem to be 
either discard this part of the search term which seems unwise, or to reduce 
the minimum n-gram size.  But I'm slightly concerned about the resulting bloat 
in both the of number of Terms stored in the index as well as contained in 
queries.  Is this something I should be concerned about?  It just "feels" like 
a query for the word "abcdef" shouldn't require a PhraseQuery of 15 terms 
(assuming n-grams 1,3).  Is this the best way to do partial word matches?  
Thanks in advance.

-Tommy




Re: Another question on sorting documents

2013-07-18 Thread Adrien Grand
Hi,

On Thu, Jul 18, 2013 at 7:15 AM, Sriram Sankar  wrote:
> The approach we have discussed in an earlier thread uses:
>
> writer.addIndexes(new SortingAtomicReader(...));
>
> I want to confirm (this is not absolutely clear to me yet) that the above
> call will not create multiple segments - i.e., the output will be optimized.

All the provided readers will be merged into a single segment but if
your index already has segments, it will have an additional one.

> We are also trying another approach - sorting the documents in Hadoop - so
> that we can repeatedly call writer.addDocument(...) providing documents in
> the correct order.
>
> How can we make sure that the final output contains documents in a single
> segment  and in the order in which they were added?

You can ensure that documents stay in the order in which they have
been added by using LogByteMergePolicy or LogDocMergePolicy. However,
don't use TieredMergePolicy which will happily merge non-adjacent
segments.

If this is an offline operation, you can just use LogByteMergePolicy,
add documents in order and run forceMerge(1) when finished.

--
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



ShingleFilter

2013-07-18 Thread Malgorzata Urbanska
Hello,

For some time I have been trying to apply ShingleFilter. I have a string:
"The users get program in the User RPC API in Apache Rave"

and I would like to get:

[the users get]  [users get program]  [get program in] [program in
the] [in the user] [the user rpc] [user rpc api] [rpc api in] [api in
apache] [in apache rave][apache rave 0.11]

however I'm getting :

[the users get] [users] [users get program] [get] [get program in]
[program] [program in the] [in the user] [the user rpc] [user] [user
rpc api] [rpc] [rpc api in] [api] [api in apache] [in apache rave]
[apache] [apache rave 0.11] [rave]

part of my code:

protected TokenStreamComponents createComponents(String fieldName,
Reader reader){


StandardTokenizer source = new
StandardTokenizer(Version.LUCENE_43, reader);

TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);

tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);

tokenStream = new ShingleFilter(tokenStream,3,3);

tokenStream = new
StopFilter(Version.LUCENE_43,tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);


return new TokenStreamComponents(source, tokenStream)

could please, somebody explain me why I'm getting single shinglers
when I set min size 3.
Thanks,
--
gosia

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: ShingleFilter

2013-07-18 Thread Allison, Timothy B.
Need to set outputUnigrams = false with something like:

  StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43, 
reader);
  TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);
  tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);

  TokenFilter sf = new ShingleFilter(tokenStream, 3,3);
  ((ShingleFilter)sf).setOutputUnigrams(false);

  sf = new 
StopFilter(Version.LUCENE_43,sf,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
  
  return new Analyzer.TokenStreamComponents(source, sf);


Not sure the stopFilter will do you any good if you're extracting only trigrams.
-Original Message-
From: murba...@rams.colostate.edu [mailto:murba...@rams.colostate.edu] On 
Behalf Of Malgorzata Urbanska
Sent: Thursday, July 18, 2013 6:02 PM
To: java-user@lucene.apache.org
Subject: ShingleFilter

Hello,

For some time I have been trying to apply ShingleFilter. I have a string:
"The users get program in the User RPC API in Apache Rave"

and I would like to get:

[the users get]  [users get program]  [get program in] [program in
the] [in the user] [the user rpc] [user rpc api] [rpc api in] [api in
apache] [in apache rave][apache rave 0.11]

however I'm getting :

[the users get] [users] [users get program] [get] [get program in]
[program] [program in the] [in the user] [the user rpc] [user] [user
rpc api] [rpc] [rpc api in] [api] [api in apache] [in apache rave]
[apache] [apache rave 0.11] [rave]

part of my code:

protected TokenStreamComponents createComponents(String fieldName,
Reader reader){


StandardTokenizer source = new
StandardTokenizer(Version.LUCENE_43, reader);

TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);

tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);

tokenStream = new ShingleFilter(tokenStream,3,3);

tokenStream = new
StopFilter(Version.LUCENE_43,tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);


return new TokenStreamComponents(source, tokenStream)

could please, somebody explain me why I'm getting single shinglers
when I set min size 3.
Thanks,
--
gosia

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Partial word match using n-grams

2013-07-18 Thread Allison, Timothy B.
Tommy,
  I'm sure that I don't fully understand your use case and your data.  Some 
thoughts:

1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your 
needs or else you wouldn't have gone the ngram route.  If fuzzy term search + 
phrase/proximity search would meet your needs, see if ComplexPhraseQueryParser 
would work (although it looks like you're already building your own queries).

2) Would it make sense to modify NGramFilter so that it outputs a bigram for a 
two letter term and a unigram for a one letter term?  Might be messy...and "ab" 
in this scenario would never match "abc"

3) Would it make sense to pad your terms behind the scenes with "##"...this 
would add bloat, but not nearly as much as variable gram sizes with 1<= n <=3

ab -> ##ab## yields trigrams ##a, #ab, ab#, b##

4) How partial and what types of partial do you need?  This is related to 1).  
If minimum edit distance is sufficient; use it, especially with the blazing 
fast automaton (thank you, Robert Muir). If you have a smallish dataset you 
might consider allowing leading wildcards so that you could easily find all 
words, for example, containing abc with *abc*.  If your dataset is larger, you 
might consider something like ReversedWildcardFilterFactory (Solr) to speed 
this type of matching.

I look forward to other opinions from the list.

-Original Message-
From: Becker, Thomas [mailto:thomas.bec...@netapp.com] 
Sent: Thursday, July 18, 2013 3:55 PM
To: java-user@lucene.apache.org
Subject: Partial word match using n-grams

One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like "abc ab" since we first split on whitespace and 
then construct PhraseQuerys containing each trigram yielded by the "word".  
Obviously we cannot get a trigram out of "ab".  So our choices would seem to be 
either discard this part of the search term which seems unwise, or to reduce 
the minimum n-gram size.  But I'm slightly concerned about the resulting bloat 
in both the of number of Terms stored in the index as well as contained in 
queries.  Is this something I should be concerned about?  It just "feels" like 
a query for the word "abcdef" shouldn't require a PhraseQuery of 15 terms 
(assuming n-grams 1,3).  Is this the best way to do partial word matches?  
Thanks in advance.

-Tommy



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ShingleFilter

2013-07-18 Thread Malgorzata Urbanska
thanks !

On Thu, Jul 18, 2013 at 5:30 PM, Allison, Timothy B.  wrote:
> Need to set outputUnigrams = false with something like:
>
>   StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43, 
> reader);
>   TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source);
>   tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);
>
>   TokenFilter sf = new ShingleFilter(tokenStream, 3,3);
>   ((ShingleFilter)sf).setOutputUnigrams(false);
>
>   sf = new 
> StopFilter(Version.LUCENE_43,sf,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
>   return new Analyzer.TokenStreamComponents(source, sf);
>
>
> Not sure the stopFilter will do you any good if you're extracting only 
> trigrams.
> -Original Message-
> From: murba...@rams.colostate.edu [mailto:murba...@rams.colostate.edu] On 
> Behalf Of Malgorzata Urbanska
> Sent: Thursday, July 18, 2013 6:02 PM
> To: java-user@lucene.apache.org
> Subject: ShingleFilter
>
> Hello,
>
> For some time I have been trying to apply ShingleFilter. I have a string:
> "The users get program in the User RPC API in Apache Rave"
>
> and I would like to get:
>
> [the users get]  [users get program]  [get program in] [program in
> the] [in the user] [the user rpc] [user rpc api] [rpc api in] [api in
> apache] [in apache rave][apache rave 0.11]
>
> however I'm getting :
>
> [the users get] [users] [users get program] [get] [get program in]
> [program] [program in the] [in the user] [the user rpc] [user] [user
> rpc api] [rpc] [rpc api in] [api] [api in apache] [in apache rave]
> [apache] [apache rave 0.11] [rave]
>
> part of my code:
>
> protected TokenStreamComponents createComponents(String fieldName,
> Reader reader){
>
>
> StandardTokenizer source = new
> StandardTokenizer(Version.LUCENE_43, reader);
>
> TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, 
> source);
>
> tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);
>
> tokenStream = new ShingleFilter(tokenStream,3,3);
>
> tokenStream = new
> StopFilter(Version.LUCENE_43,tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
>
>
> return new TokenStreamComponents(source, tokenStream)
>
> could please, somebody explain me why I'm getting single shinglers
> when I set min size 3.
> Thanks,
> --
> gosia
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Malgorzata Urbanska (Gosia)
Graduate Assistant
Colorado State University

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Partial word match using n-grams

2013-07-18 Thread Becker, Thomas
Thanks for the reply Tim.  I really should have been clearer.  Let's say I have 
an object named "quota_tommy_1234".  I'd like to match that object with any 3 
character (or more) substring of that name.  So for example:

quo
tom 
234
quota
etc.

Further, at search time I'm splitting input on whitespace before tokenizing 
into PhraseQueries and then ANDing them together.  So using the example above I 
also want the following queries to match:

quo tom
quo 234 
quota to <- this is the problem because there are no trigrams of "to"

That said, in response to your points:

1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
misspellings, which is the main function of FuzzyQuery is it not?

2) The original names are all going to be > 3 characters, so there are no 1 or 
2 letter terms at indexing time.  So generating the bigram "to" at search time 
will never match anything, unless I switch to bigrams at indexing time also, 
which is what I'm asking about.

3)  Again the names are all > 3 characters so I don't need to pad at indexing 
time.

4) Hopefully my explanation above clarifies.

I should point out that I'm a Lucene novice and am not at all sure that what 
I'm doing is optimal.  But I have been impressed with how easy it is to get 
something working very quickly!


From: Allison, Timothy B. [talli...@mitre.org]
Sent: Thursday, July 18, 2013 7:49 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Tommy,
  I'm sure that I don't fully understand your use case and your data.  Some 
thoughts:

1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your 
needs or else you wouldn't have gone the ngram route.  If fuzzy term search + 
phrase/proximity search would meet your needs, see if ComplexPhraseQueryParser 
would work (although it looks like you're already building your own queries).

2) Would it make sense to modify NGramFilter so that it outputs a bigram for a 
two letter term and a unigram for a one letter term?  Might be messy...and "ab" 
in this scenario would never match "abc"

3) Would it make sense to pad your terms behind the scenes with "##"...this 
would add bloat, but not nearly as much as variable gram sizes with 1<= n <=3

ab -> ##ab## yields trigrams ##a, #ab, ab#, b##

4) How partial and what types of partial do you need?  This is related to 1).  
If minimum edit distance is sufficient; use it, especially with the blazing 
fast automaton (thank you, Robert Muir). If you have a smallish dataset you 
might consider allowing leading wildcards so that you could easily find all 
words, for example, containing abc with *abc*.  If your dataset is larger, you 
might consider something like ReversedWildcardFilterFactory (Solr) to speed 
this type of matching.

I look forward to other opinions from the list.

-Original Message-
From: Becker, Thomas [mailto:thomas.bec...@netapp.com]
Sent: Thursday, July 18, 2013 3:55 PM
To: java-user@lucene.apache.org
Subject: Partial word match using n-grams

One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like "abc ab" since we first split on whitespace and 
then construct PhraseQuerys containing each trigram yielded by the "word".  
Obviously we cannot get a trigram out of "ab".  So our choices would seem to be 
either discard this part of the search term which seems unwise, or to reduce 
the minimum n-gram size.  But I'm slightly concerned about the resulting bloat 
in both the of number of Terms stored in the index as well as contained in 
queries.  Is this something I should be concerned about?  It just "feels" like 
a query for the word "abcdef" shouldn't require a PhraseQuery of 15 terms 
(assuming n-grams 1,3).  Is this the best way to do partial word matches?  
Thanks in advance.

-Tommy



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Searching for words begining with "or"

2013-07-18 Thread ABlaise
Hi everyone,

I am new to this forum, I have made some research for my question but I
can't seem to find an answer for it.
 I am using Lucene for a project and I know for sure that in my lucene index
I have somewhere this document with these elements :
Document
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms>.

I am looking for it but this query doesn't work :  "(+areaType:(City OR
Neighborhood OR County) +areaName:portland*) AND *(city:or* OR state:or*)*"
and I have tried tons of alternatives (o*, o*r, ...). Lucene seems to
mistake 'or' for the OR operator. How should I do to be more precise ?
To add precision to my question, this String goes through a QueryParser with
a StandardAnalyzer before being searched for in the index.

Any help would be welcomed !
Thanks in advance,

Adrien



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Searching for words begining with "or"

2013-07-18 Thread Doug Turnbull
This seems relevant. Though admittedly I haven't tried it

http://stackoverflow.com/questions/10337908/how-to-properly-escape-or-and-and-in-lucene-query

Sent from my Windows Phone From: ABlaise
Sent: 7/18/2013 9:52 PM
To: java-user@lucene.apache.org
Subject: Searching for words begining with "or"
Hi everyone,

I am new to this forum, I have made some research for my question but I
can't seem to find an answer for it.
 I am using Lucene for a project and I know for sure that in my lucene index
I have somewhere this document with these elements :
Document
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms>.

I am looking for it but this query doesn't work :  "(+areaType:(City OR
Neighborhood OR County) +areaName:portland*) AND *(city:or* OR state:or*)*"
and I have tried tons of alternatives (o*, o*r, ...). Lucene seems to
mistake 'or' for the OR operator. How should I do to be more precise ?
To add precision to my question, this String goes through a QueryParser with
a StandardAnalyzer before being searched for in the index.

Any help would be welcomed !
Thanks in advance,

Adrien



--
View this message in context:
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for words begining with "or"

2013-07-18 Thread Jack Krupansky
Break your query down into simpler pieces for testing. What pieces seem to 
have what problems? Be specific about the symptom, and how you "know" that 
something is wrong.


You wrote:
stored,indexed,tokenized,omitNorms>.

But... the standard analyzer would have lowercased that term. Did it, or are 
you using some other analyzer?


-- Jack Krupansky

-Original Message- 
From: ABlaise

Sent: Thursday, July 18, 2013 9:19 PM
To: java-user@lucene.apache.org
Subject: Searching for words begining with "or"

Hi everyone,

I am new to this forum, I have made some research for my question but I
can't seem to find an answer for it.
I am using Lucene for a project and I know for sure that in my lucene index
I have somewhere this document with these elements :
Document
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms
stored,indexed,tokenized,omitNorms>.

I am looking for it but this query doesn't work :  "(+areaType:(City OR
Neighborhood OR County) +areaName:portland*) AND *(city:or* OR state:or*)*"
and I have tried tons of alternatives (o*, o*r, ...). Lucene seems to
mistake 'or' for the OR operator. How should I do to be more precise ?
To add precision to my question, this String goes through a QueryParser with
a StandardAnalyzer before being searched for in the index.

Any help would be welcomed !
Thanks in advance,

Adrien



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for words begining with "or"

2013-07-18 Thread ABlaise
When I make my query, everything goes well until I add the last part :
(city:or* OR state:or*).
I tried the first solution that was given to me but putting \OR and \AND
doesn't seem to be the solution. The query is actually well built, he has no
problem with OR or \OR to parse the query since the query looks like that :
+(+(areaType:city areaType:neighborhood areaType:county)
+areaName:portland*) +(city:or* state:or*).
It seems to me as a valid query. It's just that he can't seem to find the
'OR' *in* the index... it's like they don't exist. And I know this because
if I retrieve the last dysfunctional part of the query, he finds (among
others) the right document, with the state written in it... It's like he
can't 'see' the 'or' in the index...

As for the upper/lower case, I am using a standard Analyzer to index and to
search and I feed him with the states in upper case and he doesn't seem to
change it. Still, I tried to put them in lower case but it didn't change
anything...

Thanks in advance for your future answers and for the help you already
provided me with.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018p4079035.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching for words begining with "or"

2013-07-18 Thread Jack Krupansky
Just so you know, the presence of a wildcard in a term means that the term 
will not be analyzed. So, state:OR* should fail since "OR" will not be in 
the index - because it would index as "or" (lowercase). Hmmm... why does 
"or" seem familiar...?


Ah yeah... right!... The standard analyzer includes the standard stop 
filter, which defaults to using this set of stopwords:


final List stopWords = Arrays.asList(
 "a", "an", "and", "are", "as", "at", "be", "but", "by",
 "for", "if", "in", "into", "is", "it",
 "no", "not", "of", "on", "or", "such",
 "that", "the", "their", "then", "there", "these",
 "they", "this", "to", "was", "will", "with"
);

And... "or" is on that list! So, the standard analyzer is removing "or" from 
the index! That's why the query can't find it.


Unless you really want these stop words removed, construct your own analyzer 
that does not do stop word removal.


-- Jack Krupansky

-Original Message- 
From: ABlaise

Sent: Friday, July 19, 2013 12:07 AM
To: java-user@lucene.apache.org
Subject: Re: Searching for words begining with "or"

When I make my query, everything goes well until I add the last part :
(city:or* OR state:or*).
I tried the first solution that was given to me but putting \OR and \AND
doesn't seem to be the solution. The query is actually well built, he has no
problem with OR or \OR to parse the query since the query looks like that :
+(+(areaType:city areaType:neighborhood areaType:county)
+areaName:portland*) +(city:or* state:or*).
It seems to me as a valid query. It's just that he can't seem to find the
'OR' *in* the index... it's like they don't exist. And I know this because
if I retrieve the last dysfunctional part of the query, he finds (among
others) the right document, with the state written in it... It's like he
can't 'see' the 'or' in the index...

As for the upper/lower case, I am using a standard Analyzer to index and to
search and I feed him with the states in upper case and he doesn't seem to
change it. Still, I tried to put them in lower case but it didn't change
anything...

Thanks in advance for your future answers and for the help you already
provided me with.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-begining-with-or-tp4079018p4079035.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Partial word match using n-grams

2013-07-18 Thread Shai Erera
There are several options:

As Allison suggested, pad your words with ##, so that "quota tom" becomes
"##quota## ##tom##" at indexing time, and the query "quota to" becomes
either "##quota ##to", or if you want to optimize, only pad query terms < 3
characters, so it becomes "quota ##to". That should guarantee you will find
matches even if the user enters one character. Note that it will add more
terms to the index, but I suspect not much. E.g. for English, assuming all
words begin w/ letters you will add 26 ##[a-z] and 676 #[a-z][a-z] terms,
which isn't much. Overall, even for numbers and other languages, I don't
think it will bloat your index and this technique should have good
performance.

You can optimize that further depending whether you need to math "ta" with
"quota". If not, you don't need to pad with ## in the end of words, only
the beginning.

If you're worried about index bloat, you can convert queries like "quota
to" to the query "quota to*", i.e. MultiPhraseQuery. You do that only when
words are less than 3 characters. But I think the padding is the better
solution both from coding and performance perspectives.

Shai

On Fri, Jul 19, 2013 at 3:59 AM, Becker, Thomas wrote:

> Thanks for the reply Tim.  I really should have been clearer.  Let's say I
> have an object named "quota_tommy_1234".  I'd like to match that object
> with any 3 character (or more) substring of that name.  So for example:
>
> quo
> tom
> 234
> quota
> etc.
>
> Further, at search time I'm splitting input on whitespace before
> tokenizing into PhraseQueries and then ANDing them together.  So using the
> example above I also want the following queries to match:
>
> quo tom
> quo 234
> quota to <- this is the problem because there are no trigrams of "to"
>
> That said, in response to your points:
>
> 1)  Not sure FuzzyQuery is what I need; I'm not trying to match via
> misspellings, which is the main function of FuzzyQuery is it not?
>
> 2) The original names are all going to be > 3 characters, so there are no
> 1 or 2 letter terms at indexing time.  So generating the bigram "to" at
> search time will never match anything, unless I switch to bigrams at
> indexing time also, which is what I'm asking about.
>
> 3)  Again the names are all > 3 characters so I don't need to pad at
> indexing time.
>
> 4) Hopefully my explanation above clarifies.
>
> I should point out that I'm a Lucene novice and am not at all sure that
> what I'm doing is optimal.  But I have been impressed with how easy it is
> to get something working very quickly!
>
> 
> From: Allison, Timothy B. [talli...@mitre.org]
> Sent: Thursday, July 18, 2013 7:49 PM
> To: java-user@lucene.apache.org
> Subject: RE: Partial word match using n-grams
>
> Tommy,
>   I'm sure that I don't fully understand your use case and your data.
>  Some thoughts:
>
> 1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your
> needs or else you wouldn't have gone the ngram route.  If fuzzy term search
> + phrase/proximity search would meet your needs, see if
> ComplexPhraseQueryParser would work (although it looks like you're already
> building your own queries).
>
> 2) Would it make sense to modify NGramFilter so that it outputs a bigram
> for a two letter term and a unigram for a one letter term?  Might be
> messy...and "ab" in this scenario would never match "abc"
>
> 3) Would it make sense to pad your terms behind the scenes with
> "##"...this would add bloat, but not nearly as much as variable gram sizes
> with 1<= n <=3
>
> ab -> ##ab## yields trigrams ##a, #ab, ab#, b##
>
> 4) How partial and what types of partial do you need?  This is related to
> 1).  If minimum edit distance is sufficient; use it, especially with the
> blazing fast automaton (thank you, Robert Muir). If you have a smallish
> dataset you might consider allowing leading wildcards so that you could
> easily find all words, for example, containing abc with *abc*.  If your
> dataset is larger, you might consider something like
> ReversedWildcardFilterFactory (Solr) to speed this type of matching.
>
> I look forward to other opinions from the list.
>
> -Original Message-
> From: Becker, Thomas [mailto:thomas.bec...@netapp.com]
> Sent: Thursday, July 18, 2013 3:55 PM
> To: java-user@lucene.apache.org
> Subject: Partial word match using n-grams
>
> One of our main use-cases for search is to find objects based on partial
> name matches.  I've implemented this using n-grams and it works pretty
> well.  However we're currently using trigrams and that causes an
> interesting problem when searching for things like "abc ab" since we first
> split on whitespace and then construct PhraseQuerys containing each trigram
> yielded by the "word".  Obviously we cannot get a trigram out of "ab".  So
> our choices would seem to be either discard this part of the search term
> which seems unwise, or to reduce the minimum n-gram size.  But I'm slightly
> concerned about th