RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread Steven A Rowe
Hi Phani, Assuming you're using Lucene 3.6.X, see: and

RE: Using stop words with snowball analyzer and shingle filter

2012-09-19 Thread Steven A Rowe
Hi Martin, SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in Lucene 5.0. Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for PorterStemmer; disabling stopword holes' position increm

RE: ReferenceManager.maybeRefreshBlocking() should not be declared throwing InterruptedException

2012-07-21 Thread Steven A Rowe
Hi Vitaly, Info here should help you set up snapshot dependencies: http://wiki.apache.org/lucene-java/NightlyBuilds Steve -Original Message- From: Vitaly Funstein [mailto:vfunst...@gmail.com] Sent: Saturday, July 21, 2012 9:22 PM To: java-user@lucene.apache.org Subject: Re: ReferenceMa

RE: RAMDirectory and expungeDeletes()/optimize()

2012-07-11 Thread Steven A Rowe
Nabble silently drops content from email sent through their interface on a regular basis. I've told them about it multiple times. My suggestion: find another way to post to this mailing list. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesda

RE: how to remove the dash

2012-06-25 Thread Steven A Rowe
I added the following to both TestStandardAnalyzer and TestClassicAnalyzer in branches/lucene_solr_3_6/, and it passed in both cases: public void testWhitespaceHyphenWhitespace() throws Exception { BaseTokenStreamTestCase.assertAnalyzesTo (a, "drinks - water", new String[]{"drinks", "

RE: [MAVEN] Heads up: build changes

2012-05-09 Thread Steven A Rowe
nux localhost 2.6.39 #4 SMP Sun Aug 21 13:53:29 PDT 2011 x86_64 > Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz GenuineIntel GNU/Linux > > > On 08/05/12 11:24, Steven A Rowe wrote: >> Hi Greg, >> >> I don't see that problem - 'ant generate-maven-artifacts' j

RE: [MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
ava:809) [copy] at org.apache.tools.ant.Main.startAnt(Main.java:217) [copy] at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) [copy] at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) On 08/05/12 10:31, Steven A Rowe wrote: > If you use the Luce

[MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
If you use the Lucene/Solr Maven POMs to drive the build, I committed a major change last night (see https://issues.apache.org/jira/browse/LUCENE-3948 for more details): * 'ant get-maven-poms' no longer places pom.xml files under the lucene/ and solr/ directories. Instead, they are placed in a

RE: Highlighter and Shingles...

2012-04-20 Thread Steven A Rowe
Hi Dawn, Can you give an example of a "partial match"? Steve -Original Message- From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] Sent: Friday, April 20, 2012 7:59 AM To: java-user@lucene.apache.org Subject: Highlighter and Shingles... Hi, Are there any notes on making the highligh

RE: Two questions on RussianAnalyzer

2012-04-19 Thread Steven A Rowe
Hi Vladimir, > The most uncomfortable in new behaviour to me is that in past I used > to search by subdomain like bbb.com: and have displayed results > with www.bbb.com:, aaa.bbb.com: and so on. Now I have 0 > results. About domain names, see my response to a similar question today on

RE: Partial word match

2012-04-09 Thread Steven A Rowe
Hi Hanu, Depending on the nature of the partial word match you're looking for - do you want to only match partial words that match at the beginning of the word? - you should look either at NGramTokenFilter or EdgeNGramTokenFilter:

RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
tags (in the field configured to use HTMLStripCharFilter, anyway). So HTMLStripCharFilter should do what you want. Steve From: okayndc [mailto:bodymo...@gmail.com] Sent: Thursday, April 05, 2012 3:36 PM To: Steven A Rowe Cc: java-user@lucene.apache.org Subject: Re: HTML tags and Lucene highlig

RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
Hi okayndc, What *do* you want? Steve -Original Message- From: okayndc [mailto:bodymo...@gmail.com] Sent: Thursday, April 05, 2012 1:34 PM To: java-user@lucene.apache.org Subject: HTML tags and Lucene highlighting Hello, I currently use Lucene version 3.0...probably need to upgrade to

RE: Lucene tokenization

2012-03-27 Thread Steven A Rowe
Hi Nilesh, Which version of Lucene are you using? StandardTokenizer behavior changed in v3.1. Steve -Original Message- From: Nilesh Vijaywargiay [mailto:nilesh.vi...@gmail.com] Sent: Tuesday, March 27, 2012 2:04 PM To: java-user@lucene.apache.org Subject: Lucene tokenization I have a

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote: > I am not seeing anything suspicious. Here's what I see in the HEX: > > "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 > (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48 I agree, standard DOS/Windows line endings. > I am pretty sure

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
hey are ASCII. I need to handle foreign text so I assume all files that I index are UTF8. I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts Thanks, Ilya -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Mon

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Hi Ilya, What analyzers are you using at index-time and query-time? My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match. Luke can

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
You want the lucene-queryparser jar. From trunk MIGRATE.txt: * LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been consolidated into module/queryparser, where other QueryParsers from the codebase will also be placed. The following classes were moved: - o.a.l.queryParser.Cha

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
IndexReader.openIfChanged in Lucene 4.0? On Mon, Mar 5, 2012 at 11:07 AM, Steven A Rowe wrote: > The second item in the top section in trunk CHANGES.txt (back compat policy > changes): Could you guys put this on the web site (or a link to it)? Or try to get it to SEO more prominently? > > *

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
The second item in the top section in trunk CHANGES.txt (back compat policy changes): * LUCENE-2858, LUCENE-3733: IndexReader was refactored into abstract AtomicReader, CompositeReader, and DirectoryReader. To open Directory- based indexes use DirectoryReader.open(), the corresponding method

RE: Customizing indexing of large files

2012-02-27 Thread Steven A Rowe
PatternReplaceCharFilter would probably work, or maybe a custom CharFilter? *CharFilter has the advantage of preserving original text offsets, for highlighting. Steve > -Original Message- > From: Glen Newton [mailto:glen.new...@gmail.com] > Sent: Monday, February 27, 2012 12:57 PM > To

RE: StandardAnalyzer and Email Addresses

2012-02-26 Thread Steven A Rowe
There is no Analyzer implementation because no one ever made one :). Copy-pasting StandardAnalyzer and substituting UAX29URLEmailTokenizer wherever StandardTokenizer appears should do the trick. Because people often want to be able to search against *both* whole email addresses and URLs *and*

RE: Can I just add ShingleFilter to my nalayzer used for indexing and searching

2012-02-21 Thread Steven A Rowe
Hi Paul, Lucene QueryParser splits on whitespace and then sends individual words one-by-one to be analyzed. All analysis components that do their work based on more than one word, including ShingleFilter and SynonymFilter, are borked by this. (There is a JIRA issue open for the QueryParser pr

RE: Maven repository for lucene trunk

2012-02-14 Thread Steven A Rowe
Hi Sudarshan, I think this wiki page has the info you want: Steve > -Original Message- > From: sudarsh...@gmail.com [mailto:sudarsh...@gmail.com] On Behalf Of > Sudarshan Gaikaiwari > Sent: Tuesday, February 14, 2012 10:01 PM

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
-Original Message- > From: Damerian [mailto:dameria...@gmail.com] > Sent: Thursday, February 09, 2012 5:00 PM > To: java-user@lucene.apache.org > Subject: Re: Access next token in a stream > > Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε: > > Damerian, > > &

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
gt; Sent: Thursday, February 09, 2012 4:15 PM > To: java-user@lucene.apache.org > Subject: Re: Access next token in a stream > > Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: > > Hi Damerian, > > > > One way to handle your scenario is to hold on to the previous tok

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Hi Damerian, One way to handle your scenario is to hold on to the previous token, and only emit a token after you reach at least the second token (or at end-of-stream). Your incrementToken() method could look something like: 1. Get current attributes: input.incrementToken() 2. If previous toke

RE: Analysers for newspaper pages...

2011-11-28 Thread Steven A Rowe
Hi Dawn, I assume that when you refer to "the impact of stop words," you're concerned about query-time performance? You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue. That said, there are

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, What version of Lucene are you using? The JFlex spec you quote below looks pre-v3.1? Steve > -Original Message- > From: Paul Taylor [mailto:paul_t...@fastmail.fm] > Sent: Wednesday, October 19, 2011 6:50 AM > To: Steven A Rowe; java-user@lucene.apache.org &g

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, On 10/19/2011 at 5:26 AM, Paul Taylor wrote: > On 18/10/2011 15:25, Steven A Rowe wrote: > > On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > > > On 18/10/2011 06:19, Steven A Rowe wrote: > > > > Another option is to create a char filter that substitutes

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-18 Thread Steven A Rowe
Hi Paul, On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > On 18/10/2011 06:19, Steven A Rowe wrote: > > Another option is to create a char filter that substitutes > > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, > > etc., > > Yes that is how I firs

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Steven A Rowe
Hi Paul, You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing its other rules. Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but only when the entire input consists excl

RE: setting MaxFieldLength in indexwriter

2011-09-28 Thread Steven A Rowe
Hi Peyman, The API docs give a hint : = Nested Class Summary ... static class IndexWriter.MaxFieldLength Deprecated. use LimitTokenCountAnalyzer instead. =

RE: Enabling indexing of hyphenated terms sans the hyphen

2011-09-19 Thread Steven A Rowe
Hi sbs, Solr's WordDelimiterFilterFactory does what you want. You can see a description of its function here: . WordDelimiterFilter, the filter class implementing the above factory's functionality, is

RE: 4.0-SNAPSHOT in maven repo via Jenkins?

2011-07-25 Thread Steven A Rowe
Hi Eric, On 7/24/2011 at 3:07 AM, Eric Charles wrote: 0112233445566778 12345678901234567890123456789012345678901234567890123456789012345678901234567890 > Jenkins jobs builds lucene trunk with 'mvn --batch-mode > --non-recursive -Pboot

RE: Some question about Lucene

2011-07-10 Thread Steven A Rowe
This slide show is a few years old, but I think it might be a good introduction for you to the differences between the projects: http://www.slideshare.net/dnaber/apache-lucene-searching-the-web-and-everything-else-jazoon07/ Steve -Original Message- From: Ing. Yusniel Hidalgo Delgado [ma

RE: how are built the packages in the maven repository?

2011-07-06 Thread Steven A Rowe
Ant is the official Lucene/Solr build system. Snapshot and release artifacts are produced with Ant. While Maven is capable of producing artifacts, the artifacts produced in this way may not be the same as the official Ant artifacts. For this reason: no, the artifacts should not be built with

RE: Lucene Simple Project

2011-06-18 Thread Steven A Rowe
Hi Hamada, Do you know about the Lucene demo?: http://lucene.apache.org/java/3_2_0/demo.html Steve > -Original Message- > From: hamadazahera [mailto:hamadazah...@gmail.com] > Sent: Saturday, June 18, 2011 9:30 AM > To: java-user@lucene.apache.org > Subject: Lucene Simple Project > > He

RE: Bug fix to contrib/.../IndexSplitter

2011-06-09 Thread Steven A Rowe
Hi Ivan, You do have rights to submit fixes to Lucene - everyone does! Here's how: http://wiki.apache.org/lucene-java/HowToContribute Please create a patch, create an issue in JIRA, and then attach the patch to the JIRA issue. When you do this, you are asked to state that you grant license to

RE: FastVectorHighlighter StringIndexOutofBounds bug

2011-05-22 Thread Steven A Rowe
Hi WeiWei, Thanks for the report. Can you provide a self-contained unit test that triggers the bug? Thanks, Steve > -Original Message- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Monday, May 23, 2011 1:25 AM > To: java-user@lucene.apache.org > Subject: FastVectorHighlight

RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud, On 5/20/2011 at 1:58 PM, Renaud Delbru wrote: > As said in > , > "if one or more of the terms in a term list has an explicit term operator > (+ or - or relational operator) the rest of the terms will be treated as

RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud, That's normal behavior, since you have AND as default operator. This is equivalent to placing a "+" in front of every element of your query. In fact, if you removed the other two "+"s, you would get the same behavior. I think you'll get what you want by just switching the default

RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
lto:zhoucheng2...@gmail.com] > Sent: Sunday, May 15, 2011 10:48 AM > To: java-user@lucene.apache.org > Cc: Steven A Rowe > Subject: RE: Lucene 3.3 in Eclipse > > Steve, thanks for correction. You are right. The version is 3.0.3 > released last Oct. > > I did place an ant

RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
Hi Cheng, Lucene 3.3 does not exist - do you mean branches/branch_3x ? FYI, as of Lucene 3.1, there is an Ant target you can use to setup an Eclipse project for Lucene/Solr - run this from the top level directory of a full source tree (including dev-tools/ directory) checked out from Subversio

RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Thursday, May 12, 2011 1:15 PM > To: java-user@lucene.apache.org > Subject: Re: Can I omit ShingleFilter's filler tokens > > On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe wrote: > > A thought: one way to do #

RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
A thought: one way to do #1 without modifying ShingleFilter: if there were a StopFilter variant that accepted regular expressions instead of a stopword list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a full match is required, i.e. implicit beginning and end anchors), and

RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
> > On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: > > Hi Bill, > > > > I can think of two possible interpretations of "removing filler > tokens": > > > > 1. Don't create shingles across stopwords, e.g. for text "one two three >

RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
Hi Bill, I can think of two possible interpretations of "removing filler tokens": 1. Don't create shingles across stopwords, e.g. for text "one two three four five" and stopword "three", bigrams only, you'd get ("one two", "four five"), instead of the current ("one two", "two _", "_ four", "fou

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Thanks Dawid. – Steve From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf Of Dawid Weiss Sent: Friday, April 29, 2011 4:45 PM To: java-user@lucene.apache.org Cc: Steven A Rowe Subject: Lucene 3.0.3 with debug information This is the e-mail you're looking for, Steven (it w

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul, On 4/29/2011 at 4:14 PM, Paul Taylor wrote: > On 29/04/2011 16:03, Steven A Rowe wrote: > > What did you find about Luke that's buggy? Bug reports are very > > useful; please contribute in this way. > > Please see previous post, in summary mistake on my par

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul, What did you find about Luke that's buggy? Bug reports are very useful; please contribute in this way. The official Lucene 3.0.3 distribution jars were compiled using the -g cmdline argument to javac - by default, though, only line number and source file information is generated. If

RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-27 Thread Steven A Rowe
Ranjit, The problem is definitely the analyzer you are passing to QueryParser or MultiFieldQueryParser, and not the parser itself. The following tests succeed using KeywordAnalyzer, which is a pass-through analyzer (the output is the same as the input): public void testSharpQP() throws Excep

RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-26 Thread Steven A Rowe
Hi Ranjit, I suspect the problem is not QueryParser, since the definition includes the '#' character (from ): | <#_TERM_START_CHAR: ( ~[ " ", "\t", "\n",

RE: lucene 3.0.3 | searching problem with *.docx file

2011-04-12 Thread Steven A Rowe
Hi Ranjit, Do you know about Luke? It will let you see what's in your index, and much more: http://code.google.com/p/luke/ Steve > -Original Message- > From: Ranjit Kumar [mailto:ranjit.ku...@otssolutions.com] > Sent: Tuesday, April 12, 2011 9:05 AM > To: java-user-h...@lucene.apa

RE: word + ngram tokenization

2011-04-05 Thread Steven A Rowe
Hi Shambhu, ShingleFilter will construct word n-grams: http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html Steve > -Original Message- > From: sham singh [mailto:shamsing...@gmail.com] > Sent: Tuesday, April 05, 2011 5:53 PM > T

RE: Lucene 3.1

2011-04-05 Thread Steven A Rowe
Hi Tanuj, Can you be more specific? What file did you download? (Lucene 3.1 has three downloadable packages: -src.tar.gz, .tar.gz, and .zip.) What did you expect to find that is not there? (Some examples would help.) Steve > -Original Message- > From: Tanuj Jain [mailto:tanujjain.

RE: lucene-snowball 3.1.0 packages are missing?

2011-04-03 Thread Steven A Rowe
Hi Alex, From Lucene contrib CHANGES.html : 3. LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers. Be sure to remove any old obselete l

RE: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Steven A Rowe
Hi Patrick, The Jenkins (formerly Hudson) nightly Ant builds do not produce the jar containing WhitespaceAnalyzer. This is not intentional - I just created an issue to track fixing the problem: . The nightly Maven JARs Uwe pointed you to are

RE: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Steven A Rowe
> [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout.

RE: lucene-based log searcher?

2011-01-13 Thread Steven A Rowe
Hi Paul, I saw this yesterday, but haven't tried it myself: http://karussell.wordpress.com/2010/10/27/feeding-solr-with-its-own-logs/ The author has a project called "Sogger" - Solr + Logger? - that can read various forms of logs. Steve > -Original Message- > From: Paul Libbrecht [mai

RE: Re: Scale up design

2010-12-22 Thread Steven A Rowe
On 12/22/2010 at 2:38 AM, Ganesh wrote: > Any other tips targeting 64 bit? If memory usage is an issue, you might consider using HotSpot's "compressed oops" option:

RE: tokensFromAnalysis

2010-12-02 Thread Steven A Rowe
Lewis, Simon asked about the version of Lucene you're using because this section of the API has seen regular change. If you don't tell us which version, we can't help, because we don't know what you're coding against. Steve > -Original Message- > From: McGibbney, Lewis John [mailto:le

RE: Analyzer

2010-11-29 Thread Steven A Rowe
Hi Manjula, It's not terribly clear what you're doing here - I got lost in your description of your (two? or maybe four?) classes. Sometimes things are easier to understand if you provide more concrete detail. I suspect that you could benefit from reading the book Lucene in Action, 2nd editio

RE: IndexWriters and write locks

2010-11-10 Thread Steven A Rowe
NFS[1] != NTFS[2] [1] NFS: [2] NTFS: > -Original Message- > From: Pulkit Singhal [mailto:pulkitsing...@gmail.com] > Sent: Wednesday, November 10, 2010 2:55 PM > To: java-user@lucene.apach

RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
t; > A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a > StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 > is due? > > -Original Message- > From: Steven A Rowe [mailto:sar...@syr.edu] > Sent: 24 Oct 2010 21 31 > T

RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
Hi Martin, StandardTokenizer and -Analyzer have been changed, as of future version 3.1 (the next release) to support the Unicode segmentation rules in UAX#29. My (untested) guess is that your hyphenated word will be kept as a single token if you set the version to 3.1 or higher in the construc

RE: Issue with sentence specific search

2010-10-07 Thread Steven A Rowe
Hi Sirish, StandardTokenizer does not produce a token from '#', as you suspected. Something that fits the "word" definition, but which won't ever be encountered in your documents, is what you should use for the delimiter - something like a1b2c3c2b1a . Sentence boundary handling is clunky in L

RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish, Have you looked at SpanQuery's yet?: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/spans/package-summary.html See also this Lucid Imagination blog post by Mark Miller: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ One common technique, instead

RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish, I think I understand "within sentence" phrase search - you want the entire phrase to be within a single sentence. But can you give an example of "non sentence specific phrase search"? It's not clear to me how useful such capability would be. Steve > -Original Message- > F

RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
3_0_2/api/all/org/apache/lucene/index/IndexW > riter.html#getReader() > > > > > > - Original Message > From: Steven A Rowe > To: "java-user@lucene.apache.org" > Sent: Mon, October 4, 2010 1:05:36 PM > Subject: RE: Updating document

RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
This is not a defect: . > -Original Message- > From: Justin [mailto:cry...@yahoo.com] > Sent: Monday, October 04, 2010 2:03 PM > To: java-user@lucene.apache.org > Subject: Updating doc

RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
gt; $ pop > I want: doc4 then doc5 (because the path to doc4 is smaller then doc5) > > So to do this I need: > 1 - change field boost > 2 - set priority of path, and to do that: I create N field (one field > to node in the path) or have some Lucene feature (but I don't kn

RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
Hi Iam, Can you say why you don't like the proposed solution? Also, the example of the scoring you're looking for doesn't appear to be hierarchical in nature - can you give illustrate the relationship between the tokens in [token1, token2, token3]? Also, why do you want token1 to contribute m

RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Oops, setLowercaseExpandedTerms() is an instance method, not static. I wrote: > QueryParser has a static method setLowercaseExpandedTerms() that you can call > to turn on automatic pre-expansion query term downcasing: > >

RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Hi Christoph, There could be several things going on, but it's difficult to tell without more information. Since excluded terms require a non-empty set from which to remove documents at the same boolean clause level, you could try something like "title:(*:* -Datei*) avl", or "-title:Datei* a

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
> > you want what Lucene already does, but that's clearly not true > > Hmmm, let's pretend that "contents" field in my example wasn't analyzed at > index > time. The unstemmed form of terms will be indexed. But if I query with a > stemmed > form or use QueryParser with the SnowballAnalyzer, I'm

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, > > an example > > PerFieldAnalyzerWrapper analyzers = > new PerFieldAnalyzerWrapper(new KeywordAnalyzer()); > // myfield defaults to KeywordAnalyzer > analyzers.addAnalyzer("content", new SnowballAnalyzer(luceneVersion, > "English")); > // analyzers affects the indexed field valu

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, > Unfortunately the suffix requires a wildcard as well in our case. There > are a limited number of prefixes though (10ish), so perhaps we could > combine them all into one query. We'd still need some sort of > InverseWildcardQuery implementation. > > > use another analyzer so you don'

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, > [...] "*:* AND -myfield:foo*". > > If my document contains "myfield:foobar" and "myfield:dog", the document > would be thrown out because of the first field. I want to keep the > document because the second field does not match. I'm assuming that you mistakenly used the same field n

RE: ShingleFilter failing with more terms than index phrase

2010-07-13 Thread Steven A Rowe
Hi Ethan, You'll probably get better answers about Solr specific stuff on the solr-u...@a.l.o list. Check out PositionFilterFactory - it may address your issue: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Steve > -Original Message- > From: Et

RE: URL Tokenization

2010-06-24 Thread Steven A Rowe
The > hudson link for nightly builds on the apache-lucene site seems to be > broke. Or may be I have a different problem. > > I'd appreciate any help. > > Thanks, > Sudha > > > > On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe wrote: > > > Hi Sud

RE: URL Tokenization

2010-06-23 Thread Steven A Rowe
Hi Sudha, There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following JIRA issue: https://issues.apache.org/jira/browse/LUCENE-2167 It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails too, in accordance with the relevant IETF R

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Andy, I think batching commits either by time or number of documents is common. Do you know about NRT (Near Realtime Search)?: . Using IndexWriter.getReader(), you can avoid commits altogether, as well as reducing update->search latency.

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Andy, it sounds like you're doing the right thing. Maybe you aren't using the IndexReader instance returned by reopen(), but instead are continuing to use the instance on which you called reopen()? It's tough to figure this kind of thing out without looking at the code. For example, what do yo

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Hi Andy, From the API docs for IndexWriter : [D]ocuments are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updat

RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
have to run the queries against that single document. > But my dilemma is, I might have upto 100,000 queries to run against it. > Do you think this route will give me results in reasonable amount of > time, i.e. in a few seconds? > > thanks > -siraj > > On 5/17/2010 5:21 PM,

RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
Hi Siraj, Lucene's MemoryIndex can be used to serve this purpose. >From >: [T]his class targets fulltext search of huge numbers of queries over comparatively small transient r

RE: PrefixQuery and special characters

2010-04-14 Thread Steven A Rowe
Hi Franz, The likely problem is that you're using an index-time analyzer that strips out the parentheses. StandardAnalyzer, for example, does this; WhitespaceAnalyzer does not. Remember that hits are the result of matches between index-analyzed terms and query-analyzed terms. Except in the c

RE: Lucene query with long strings

2010-03-23 Thread Steven A Rowe
Hi Aaron, Your "false positives" comments point to a mismatch between what you're currently asking Lucene for (any document matching any one of the terms in the query) and what you want (only fully "correct" matches). You need to identify the terms of the query that MUST match and tell Lucene

RE: Increase number of available positions?

2010-03-17 Thread Steven A Rowe
Hi Rene, On 03/17/2010 at 11:17 AM, Rene Hackl-Sommer wrote: > > > > > > > t293 > t4979 > > > > L_2 > > > > > > > > > t293 > t4979 > > > > L_3 > > > > > > Shouldn't this query only leave documents, where t293 and t4979 are in > the same L_2, but not within the same L_3? I'

RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene, Have you seen SpanNotQuery?: For a document that looks like: T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 ... ... You could genera

RE: Increase number of available positions?

2010-03-15 Thread Steven A Rowe
Hi Rene, Why can't you use a different field for each of the Level_X's, i.e. MyLevel1Field, MyLevel2Field, MyLevel3Field? On 03/15/2010 at 9:59 AM, Rene Hackl-Sommer wrote: > > > Search in MyField: Terms T1 and T2 on Level_2 and T3, > > > T4, and T5 on Level_3, which should both be in the > > >

RE: Searching Subversion comments:

2010-03-08 Thread Steven A Rowe
On 03/08/2010 at 4:37 PM, Robert Muir wrote: > > Also, in the open source realm: > > > > 3. ViewVC (formerly ViewCVS) has a facility to query revision > history, including commit messages.  Apache's instance, which serves > Lucene's repository, doesn't expose this functionality, though > > >

RE: Searching Subversion comments:

2010-03-08 Thread Steven A Rowe
Hi Erick, On 03/08/2010 at 3:48 PM, Erick Erickson wrote: > Is there any convenient way to, say, find all the files associated with > patch ? I realize one can (hopefully) get this information from > JIRA, but... This is a subset of the problem of searching Subversion > comments. I know of tw

RE: Reverse Search

2010-03-01 Thread Steven A Rowe
Hi Mark, On 03/01/2010 at 3:35 PM, Mark Ferguson wrote: > I will be processing short bits of text (Tweets for example), and > need to search them to see if they certain terms. You might consider, instead of performing reverse search, just querying all of your locations against one document at a

RE: Match span of capitalized words

2010-02-05 Thread Steven A Rowe
Hi Max, On 02/05/2010 at 10:18 AM, Grant Ingersoll wrote: > On Feb 3, 2010, at 8:57 PM, Max Lynch wrote: > > Hi, I would like to do a search for "Microsoft Windows" as a span, but > > not match if words before or after "Microsoft Windows" are upper cased. > > > > For example, I want this to match

RE: Unexpected Query Results

2010-02-04 Thread Steven A Rowe
On 02/04/2010 at 3:24 PM, Chris Hostetter wrote: > : Since phrase query terms aren't analyzed, you're getting exact > : matches > > quoted phrase passed to the QueryParser are analyzed -- but they are > analyzed as complete strings, so Analyzers that treat whitespace > special may produce differne

RE: Analyzer for stripping non alpha-numeric characters?

2010-02-04 Thread Steven A Rowe
Hi Jason, Solr's PatternReplaceFilter(ts, "\\P{Alnum}+$", "", false) should work, chained after an appropriate tokenizer. Steve On 02/04/2010 at 12:18 PM, Jason Rutherglen wrote: > Is there an analyzer that easily strips non alpha-numeric from the end > of a token? > >

RE: Unexpected Query Results

2010-02-04 Thread Steven A Rowe
Hi Jamie, Since phrase query terms aren't analyzed, you're getting exact matches for terms "было" and "время", but when you search for them individually, they are analyzed, and it is the analyzed query terms that fail to match against the indexed terms. Sounds to me like your index-time and qu

RE: combine query score with external score

2010-01-28 Thread Steven A Rowe
Hi Dennis, You should check out payloads (arbitrary per-index-term byte[] arrays), which can be used to encode values which are then incorporated into documents' scores, by overriding Similarity.scorePayload():

  1   2   3   >