RE: ShingleFilter

2013-07-18 Thread Allison, Timothy B.
Need to set outputUnigrams = false with something like: StandardTokenizer source = new StandardTokenizer(Version.LUCENE_43, reader); TokenStream tokenStream = new StandardFilter(Version.LUCENE_43, source); tokenStream = new LowerCaseFilter(Version.LUCENE_43, tokenStream);

RE: Partial word match using n-grams

2013-07-18 Thread Allison, Timothy B.
Tommy, I'm sure that I don't fully understand your use case and your data. Some thoughts: 1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your needs or else you wouldn't have gone the ngram route. If fuzzy term search + phrase/proximity search would meet your needs, se

RE: Partial word match using n-grams

2013-07-19 Thread Allison, Timothy B.
at what I'm doing is optimal. But I have been impressed with how easy it is to get something working very quickly! From: Allison, Timothy B. [talli...@mitre.org] Sent: Thursday, July 18, 2013 7:49 PM To: java-user@lucene.apache.org Subject: RE: Partia

RE: Searching for words begining with "or"

2013-07-19 Thread Allison, Timothy B.
If Jack's recommendation for keeping stopwords will work in your use case, this constructor should do the trick: Analyzer analyzer = new StandardAnalyzer(VERSION, CharArraySet.EMPTY_SET) From: Jack Krupansky [j...@basetechnology.com] Sent: Friday, July 19

RE: PhraseQuery Search

2013-08-05 Thread Allison, Timothy B.
Try: http://lucene.apache.org/core/4_4_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html -Original Message- From: raghavendra.k@barclays.com [mailto:raghavendra.k@barclays.com] Sent: Friday, August 02, 2013 3:17 PM To: java-user@lucene.apach

RE: Lucene Text Similarity

2013-09-04 Thread Allison, Timothy B.
I agree with Ivan and Koji. You also might want to look into MoreLikeThis, which should take care of finding the highest tf*idf terms for you to use in your query -- http://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/mlt/MoreLikeThis.html Best, Tim _

RE: Lucene Text Similarity

2013-09-04 Thread Allison, Timothy B.
il to find out what the best term to choose. Thanks. 2013/9/4 Allison, Timothy B. : > I agree with Ivan and Koji. You also might want to look into MoreLikeThis, > which should take care of finding the highest tf*idf terms for you to use in > your query -- > http://lucene.apac

FuzzyQuery with short words

2013-09-11 Thread Allison, Timothy B.
All, Apologies if I missed this in the documentation, but should: FuzzyQuery q = new FuzzyQuery(new Term("field", "ab"), 2) retrieve a document that contains: abcd and vice versa. Same question for: xy~1 and a document that contains "x". Will submit test case if this is not a known issue or

RE: FuzzyQuery with short words

2013-09-12 Thread Allison, Timothy B.
f "ab" or edit distance 1 of "x" then then may cause your example "abcd" to rank below the top 50, and be pruned. Mike McCandless http://blog.mikemccandless.com On Wed, Sep 11, 2013 at 9:42 PM, Allison, Timothy B. wrote: > All, > Apologies if I mis

RE: variable string search

2013-09-13 Thread Allison, Timothy B.
Brian, It looks like "variable" is variable; and you'll probably want to use some combination of PhraseQuery, FuzzyQuery and maybe BooleanQuery. I've made my best guess at what the underlying types of Queries would be that would meet your use cases below. "free text" : Doc1, Doc2 :: Phrase

RE: Multiphrase Query in Lucene 4.3

2013-09-27 Thread Allison, Timothy B.
1) An alternate method to your original question would be to do something like this (I haven't compiled or tested this!): Query q = new PrefixQuery(new Term("field", "app")); q = q.rewrite(indexReader) ; Set terms = new HashSet(); q.extractTerms(terms); Term[] arr = terms.toArray(new Term[terms.

RE: docFreq of a Boolean query (LUCENE 4.3)

2013-12-17 Thread Allison, Timothy B.
TotalHitCountCollector? Others on the list may have a more efficient method, but that'd be straightforward. -Original Message- From: Peyman Faratin [mailto:peymanfara...@gmail.com] Sent: Monday, December 16, 2013 10:05 PM To: java-user@lucene.apache.org Subject: docFreq of a Boolean que

RE: Sample Data to Test Lucene

2014-01-16 Thread Allison, Timothy B.
To confirm, Lucene does not perform OCR. (If you are looking for open source java ocr packages, you might take a look here for some ideas: https://issues.apache.org/jira/i#browse/TIKA-93). Are you trying to find a corpus of noisy OCR'd text to use as input into Lucene? If so, this looks pote

RE: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-04 Thread Allison, Timothy B.
This will be of no immediate help, but in the next iteration of LUCENE-5317, which I'll post in a few weeks (if I can find the time), I'll have an option to pull concordance windows from character offsets which can be stored at index time (so you wouldn't have to re-analyze). The current versio

RE: Wildcard searches

2014-02-06 Thread Allison, Timothy B.
Ditto Jack on ComplexPhraseQueryParser. See also: https://issues.apache.org/jira/i#browse/LUCENE-5205 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Wednesday, February 05, 2014 6:59 PM To: java-user@lucene.apache.org Subject: Re: Wildcard searches Take a

RE: Wildcard searches

2014-02-06 Thread Allison, Timothy B.
--Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, February 06, 2014 8:02 AM To: java-user@lucene.apache.org Subject: RE: Wildcard searches Ditto Jack on ComplexPhraseQueryParser. See also: https://issues.apache.org/jira/i#browse/LUCENE-5205 -Origin

RE: Wildcard searches

2014-02-06 Thread Allison, Timothy B.
ect me to any useful links for ComplexPhraseQueryParser that you may be aware of? I am looking for some examples. Thanks! Regards, Raghu -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Thursday, February 06, 2014 8:02 AM To: java-user@lucene.apache.org S

RE: QueryParser

2014-03-21 Thread Allison, Timothy B.
What analyzer are you using? smartcn? From: kalaik [kalaiselva...@zohocorp.com] Sent: Friday, March 21, 2014 5:10 AM To: java-user@lucene.apache.org Subject: QueryParser Dear Team, we are using lucene in our product , it well searching fo

RE: QueryParser

2014-03-24 Thread Allison, Timothy B.
To expand on Herb's comment, in Lucene, the StandardAnalyzer will break CJK into characters: 1 : 轻 2 : 歌 3 : 曼 4 : 舞 5 : 庆 6 : 元 7 : 旦 If you initialize the classic QueryParser with StandardAnalyzer, the parser will use that Analyzer to break this string into individual characters as above.

RE: Strange behavior of ShingleFilter in Lucene 4.6

2014-04-02 Thread Allison, Timothy B.
I agree entirely with Robert about not doubling up on the filter, wrapper. To stop unigrams, consider setOutputUnigrams(false). -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Wednesday, April 02, 2014 2:50 PM To: java-user Subject: Re: Strange behavior of ShingleFi

RE: Proximity Search for SENTENCE and PARAGRAPH

2014-04-07 Thread Allison, Timothy B.
One simple hack which may or may not meet your objectives: 1) index each paragraph as if it were a document (this would then not allow Boolean across paragraphs, which could be a problem) 2) set the position increment gap to, say, 100 and then index each sentence within the paragraph as another

RE: Question about multi-valued fields

2014-05-20 Thread Allison, Timothy B.
Chris, Good to see you over here. There's probably an easier way... I ran into this with geo queries, and the answer there is to test every value in the multi field for the document that is a hit. For the text search question, though, you could use analysis and then run a SpanQuery against y

RE: SpanQuery not working as expected

2014-06-06 Thread Allison, Timothy B.
Hi Darin, Have you thought about using multivalued fields? If you set the positionIncrementGap to something kind of big (well > 1, say :) ), and you know that your data is always authorfirst, authorlast, you could just search for "darin fulford". The positionincrementgap will prevent matchin

RE: SpanQuery not working as expected

2014-06-09 Thread Allison, Timothy B.
this reference where a specific author contains 'john' for the first name and 'smith' in the last name. I guess I'm curious if what I was doing with the SpanQuery should have worked, whether I misunderstood something, or if this is a bug. Darin.

RE: Exact Phrase Search returning in correct results

2014-06-11 Thread Allison, Timothy B.
StandardAnalyzer with that configuration drops stop words at both index and search time. So, in effect, you really are just searching for "becomes". If your use case requires you to be able to search stop words consider adding CharArraySet.EMPTY_SET to the StandardAnalyzer's initializer. -

RE: Index Not Finding Results some times

2014-06-16 Thread Allison, Timothy B.
The problem is that you are using an analyzer at index time but then not at search time. StandardAnalyzer will convert "Name1" to "name1" at index time. At search time, because you aren't using a query parser (which would by default lowercase your terms) you are literally searching for "Name1"

RE: Finding words not followed by other words

2014-07-15 Thread Allison, Timothy B.
And if you're looking for a parser, take a look at LUCENE-5205. ["george washington" carver]!~5,5 Find "George Washington" but not if carver appears 5 words before or 5 words after. -Original Message- From: Michael Ryan [mailto:mr...@moreover.com] Sent: Monday, July 14, 2014 9:58 PM To

RE: How to use 'PhraseQuery' with Fuzzy?!

2014-09-23 Thread Allison, Timothy B.
If you're looking for a parser, take a look at ComplexPhraseQueryParser or LUCENE-5205. From: Uwe Schindler [u...@thetaphi.de] Sent: Tuesday, September 23, 2014 6:32 AM To: java-user@lucene.apache.org Subject: RE: How to use 'PhraseQuery' with Fuzzy?! Hi,

RE: multiterm numbers regexp search

2014-12-15 Thread Allison, Timothy B.
If you can't change the analyzer, you can programmatically build a MultiPhraseQuery (you'd have to fill in the alternatives ... not a great option) or a SpanNearQuery composed of span-wrapped RegexpQueries (rewrites are taken care of for you). You might also want to look into using the ComplexP

RE: Proximity query

2015-02-12 Thread Allison, Timothy B.
Might also look at concordance code on LUCENE-5317 and here: https://github.com/tballison/lucene-addons/tree/master/lucene-5317 Let me know if you have any questions. -Original Message- From: Maisnam Ns [mailto:maisnam...@gmail.com] Sent: Thursday, February 12, 2015 11:57 AM To: java-us

RE: Lucene Field Boost

2015-04-30 Thread Allison, Timothy B.
Depending on your version of Lucene, perhaps: http://lucene.apache.org/core/4_10_4/core/org/apache/lucene/document/Field.html#setBoost(float) -Original Message- From: Muhammad Ismail [mailto:it.is.ism...@gmail.com] Sent: Thursday, April 30, 2015 3:22 AM To: java-user@lucene.apache.org Sub

RE: ignore a match in a query

2015-07-24 Thread Allison, Timothy B.
Agree on span query. Might try SpanNotQuery("record", "type", 0, 1)... Find "record" but not if "type" comes one word after "record". If you use LUCENE-5205's SpanQueryParser: "record type"!~0,1 -Original Message- From: Trejkaz [mailto:trej...@trypticon.org] Sent: Thursday, July 23, 2

extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

2015-11-02 Thread Allison, Timothy B.
All, I'm trying to find all spans in a given String via stored offsets in Lucene 5.3.1. I wanted to use the Highlighter with a NullFragmenter, but that is highlighting only the matching terms, not the full Spans (related to LUCENE-6796?). My Current code iterates through the spans, stores

RE: extracting charoffsets from SpanWeight's getSpans() in 5.3.1?

2015-11-03 Thread Allison, Timothy B.
ollectLeaf() is the position, rather than an index of any kind, which I think is going to mess things up for you. But other than that, you've got the right idea. :-) Alan Woodward www.flax.co.uk On 3 Nov 2015, at 00:26, Allison, Timothy B. wrote: > All, > > I'm trying to fi

RE: Wild card search not working

2015-11-30 Thread Allison, Timothy B.
If you want to find the matching terms, you have to do something like this: Query rewritten = spanTerm.rewrite(indexReader); Weight w = rewritten.createWeight(isearcher, false); Set terms = new HashSet<>(); w.extractTerms(terms); for (Ter

RE: Wild card search not working

2015-11-30 Thread Allison, Timothy B.
I'm getting this (with a single document that contains the word 'quartz': Term freq(indexReader.totalTermFreq(term))=0 Term freq(indexReader.getSumTotalTermFreq("Doc"))=1 totalHits = 1 termStatics=0 Is this what you're getting? So...the search is working, but the term counts aren't returning wh

RE: Highlighting deprecation?

2015-12-02 Thread Allison, Timothy B.
Y, to add to Scott's advice, make sure to use the NullFragmenter and make sure to setExpandMultiTermQuery to true on your scorer QueryScorer scorer = new QueryScorer(query, field); scorer.setExpandMultiTermQuery(true); If you need to highlight entire phrases, see Koji Sekiguchi

RE: TermRangeQuery with Proximity

2015-12-08 Thread Allison, Timothy B.
And, if you're looking for a parser, take a look at LUCENE-5205's parser, available as a standalone on github [0]. The syntax for the query mentioned in archived link would be: "microsoft [belgium TO spain]" [0] https://github.com/tballison/lucene-addons -Original Message- From: Uwe Sc

different handling of multiterm within a SpanNot Query in 5.3.1 vs 5.4.0?

2015-12-14 Thread Allison, Timothy B.
Great to see 5.4.0 is out. I tried to update my fork of LUCENE-5205, and found that multiterms within a SpanNotQuery don't seem to be processed correctly. [fever bieb*]!~2,5 Find "fever" but not if a multiterm hit on bieb* appears within 2 words before or 5 words after. In 5.3.1, this worked

migrating to 6.0 -- how to apply filter to getSpans

2016-04-12 Thread Allison, Timothy B.
On the living github version of LUCENE-5317, I'm trying to migrate to 6.0, and most is fairly clear. However, how do I modify the following code to return spans only from documents that match the -Filter- Query. For each LeafReaderContext, I used to get a DocIdSet, call the iterator on that, a

RE: migrating to 6.0 -- how to apply filter to getSpans

2016-05-23 Thread Allison, Timothy B.
ator.empty())) { continue; } boolean cont = visitLeafReader(ctx, spans, filterItr, visitor); ... } -Original Message----- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, April 12, 2016 10:07 AM To: java-user@lucene.apache.org Subject: migrat

RE: analyzers-common VS analyzers-icu

2016-06-01 Thread Allison, Timothy B.
That package has an ICU tokenizer and the ICUFoldingFilter. The ICUFoldingFilter does advanced (well, Unicode compliant) case folding/lowercasing/normalization and is critical for non-ascii languages. You can use that in place of the AsciiFoldingFilter and the LowerCaseFilter, and it should

RE: SpanQuery - How to wrap a NOT subquery

2016-06-20 Thread Allison, Timothy B.
Bouncing over to user’s list. As you’ve found, spans are different from regular queries. MUST_NOT at the BooleanQuery level means that the term must not appear anywhere in the document; whereas spans focus on terms near each other. Have you tried SpanNotQuery? This would allow you at least to

RE: New type of proximity/fuzzy search

2016-08-31 Thread Allison, Timothy B.
Unfortunately, that does require a new type of query. As you probably know, you can do the "at least" (minimum number should match) with regular BooleanQueries, but you can't yet do the "at least" with SpanQuery. You might want to look at modifying the SpanOrQuery to get this functionality. I

RE: New type of proximity/fuzzy search

2016-08-31 Thread Allison, Timothy B.
Doh, sorry, Uwe, didn't see your response first. Scratch SpanOr, take a look at SpanNear. This would be a great capability to have! -Original Message- From: Allison, Timothy B. Sent: Wednesday, August 31, 2016 3:30 PM To: java-user@lucene.apache.org Subject: RE: New type of prox

RE: New type of proximity/fuzzy search

2016-09-01 Thread Allison, Timothy B.
https://issues.apache.org/jira/browse/LUCENE-7434 -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Wednesday, August 31, 2016 3:41 PM To: java-user@lucene.apache.org Subject: RE: New type of proximity/fuzzy search Doh, sorry, Uwe, didn't see your res

RE: Cooccurrence matrices

2016-09-19 Thread Allison, Timothy B.
Take a look at LUCENE-5317 [1] and LUCENE-5318 [2]. They're available on my github site [3], and I've pushed them to maven central [4]. LUCENE-5318 is crazily useful as a term/phrase recommender system. I haven't documented either very well yet. I'll try to add documentation to my github site

RE: How to get the terms matching a WildCardQuery in Lucene 6.2?

2016-10-24 Thread Allison, Timothy B.
Make sure to setRewriteMethod on the MultiTermQuery to: MultiTermQuery.SCORING_BOOLEAN_REWRITE or CONSTANT_SCORE_BOOLEAN_REWRITE Then something like this should work: q = q.rewrite(reader); Set terms = new HashSet<>(); Weight weight = q.createWeight(searcher, false);

RE: How to get the terms matching a WildCardQuery in Lucene 6.2?

2016-10-25 Thread Allison, Timothy B.
for (int i = start; i < end; i++) { Document doc = searcher.doc(hits[i].doc); String path = doc.get("path"); System.out.println((i + 1) + ". " + path); query.rewrite(reader); }

RE: query parser of SpanNearQuery

2016-12-05 Thread Allison, Timothy B.
Not part of Lucene, but take a look at LUCENE-5205 [1], which I actively maintain on github [2]. And, you can integrate via maven [3] See the jira issue for an overview of the query syntax, and let me know if you have any questions. [1] https://issues.apache.org/jira/browse/LUCENE-5205 [2] h

RE: calculate term co-occurrence matrix

2017-03-20 Thread Allison, Timothy B.
I have code as part of LUCENE-5318 that counts terms that cooccur within a window of where your query terms appear. This makes a really useful query term recommender, and the math is dirt simple. INPUT Doc1: quick brown fox jumps over the lazy dog Doc2: quick green fox leaps over the lazy dog

RE: Correction: SpanNearQuery Class issue through spans object (Not through Searcher.search() method)

2017-06-20 Thread Allison, Timothy B.
As an example of Mikhail's suggestion: https://github.com/tballison/lucene-addons/blob/master/lucene-5317/src/main/java/org/apache/lucene/search/concordance/charoffsets/SpansCrawler.java If you are trying to build a concordance, see ConcordanceSearcher in that package. See examples on how to ru

RE: Extending Analyzer at runtime

2017-06-23 Thread Allison, Timothy B.
I plagiarized Solr's org.apache.solr.analysis.TokenizerChain to read the configuration from a json file: https://github.com/tballison/lucene-addons/blob/6.x/gramreaper/src/main/java/org/tallison/gramreaper/ingest/schema/MyTokenizerChain.java I wouldn't recommend using anything in gramreaper just

RE: Extending Analyzer at runtime

2017-06-23 Thread Allison, Timothy B.
need to write your own one. Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Friday, June 23, 2017 3:55 PM > To: java-user@lucene.ap

ICUFoldingFilter loading in IDE, but not jar ?!

2017-08-15 Thread Allison, Timothy B.
In Intellij, when I run unit tests in my app that uses Lucene (6.6.0) and the ICUFoldingFilterFactory, I see 96 filter factories available via TokenFilterFactory.availableTokenFilters(). When I run the same code from a jar built with the maven shade plugin, and I confirm that the jar actually

RE: ICUFoldingFilter loading in IDE, but not jar ?!

2017-08-15 Thread Allison, Timothy B.
never mind...overwriting service file... -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Tuesday, August 15, 2017 10:36 PM To: java-user@lucene.apache.org Subject: ICUFoldingFilter loading in IDE, but not jar ?! In Intellij, when I run unit tests in my

FW: PointValues ordering

2018-02-26 Thread Allison, Timothy B.
Prob better question for user list. From: Dominik Safaric [mailto:dominiksafa...@gmail.com] Sent: Monday, February 26, 2018 1:20 PM To: d...@lucene.apache.org Subject: PointValues ordering Given a multi-valued and non-indexed point value field, how does Lucene internally store this kind of field