RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
Hi Damerian, One way to handle your scenario is to hold on to the previous token, and only emit a token after you reach at least the second token (or at end-of-stream). Your incrementToken() method could look something like: 1. Get current attributes: input.incrementToken() 2. If previous toke

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
gt; Sent: Thursday, February 09, 2012 4:15 PM > To: java-user@lucene.apache.org > Subject: Re: Access next token in a stream > > Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: > > Hi Damerian, > > > > One way to handle your scenario is to hold on to the previous tok

RE: Access next token in a stream

2012-02-09 Thread Steven A Rowe
-Original Message- > From: Damerian [mailto:dameria...@gmail.com] > Sent: Thursday, February 09, 2012 5:00 PM > To: java-user@lucene.apache.org > Subject: Re: Access next token in a stream > > Στις 9/2/2012 10:51 μμ, ο/η Steven A Rowe έγραψε: > > Damerian, > > &

RE: Maven repository for lucene trunk

2012-02-14 Thread Steven A Rowe
Hi Sudarshan, I think this wiki page has the info you want: Steve > -Original Message- > From: sudarsh...@gmail.com [mailto:sudarsh...@gmail.com] On Behalf Of > Sudarshan Gaikaiwari > Sent: Tuesday, February 14, 2012 10:01 PM

RE: Can I just add ShingleFilter to my nalayzer used for indexing and searching

2012-02-21 Thread Steven A Rowe
Hi Paul, Lucene QueryParser splits on whitespace and then sends individual words one-by-one to be analyzed. All analysis components that do their work based on more than one word, including ShingleFilter and SynonymFilter, are borked by this. (There is a JIRA issue open for the QueryParser pr

RE: StandardAnalyzer and Email Addresses

2012-02-26 Thread Steven A Rowe
There is no Analyzer implementation because no one ever made one :). Copy-pasting StandardAnalyzer and substituting UAX29URLEmailTokenizer wherever StandardTokenizer appears should do the trick. Because people often want to be able to search against *both* whole email addresses and URLs *and*

RE: Customizing indexing of large files

2012-02-27 Thread Steven A Rowe
PatternReplaceCharFilter would probably work, or maybe a custom CharFilter? *CharFilter has the advantage of preserving original text offsets, for highlighting. Steve > -Original Message- > From: Glen Newton [mailto:glen.new...@gmail.com] > Sent: Monday, February 27, 2012 12:57 PM > To

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
The second item in the top section in trunk CHANGES.txt (back compat policy changes): * LUCENE-2858, LUCENE-3733: IndexReader was refactored into abstract AtomicReader, CompositeReader, and DirectoryReader. To open Directory- based indexes use DirectoryReader.open(), the corresponding method

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
IndexReader.openIfChanged in Lucene 4.0? On Mon, Mar 5, 2012 at 11:07 AM, Steven A Rowe wrote: > The second item in the top section in trunk CHANGES.txt (back compat policy > changes): Could you guys put this on the web site (or a link to it)? Or try to get it to SEO more prominently? > > *

RE: What replaces IndexReader.openIfChanged in Lucene 4.0?

2012-03-05 Thread Steven A Rowe
You want the lucene-queryparser jar. From trunk MIGRATE.txt: * LUCENE-3283: Lucene's core o.a.l.queryParser QueryParsers have been consolidated into module/queryparser, where other QueryParsers from the codebase will also be placed. The following classes were moved: - o.a.l.queryParser.Cha

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
Hi Ilya, What analyzers are you using at index-time and query-time? My guess is that you're using an analyzer that includes punctuation in the tokens it emits, in which case your index will have things like "sentence." and "sentence?" in it, so querying for "sentence" will not match. Luke can

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
hey are ASCII. I need to handle foreign text so I assume all files that I index are UTF8. I am using the standard analyzer for English text and other contributed analyzers for respective foreign texts Thanks, Ilya -Original Message- From: Steven A Rowe [mailto:sar...@syr.edu] Sent: Mon

RE: can't find common words -- using Lucene 3.4.0

2012-03-26 Thread Steven A Rowe
On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote: > I am not seeing anything suspicious. Here's what I see in the HEX: > > "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65 > (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48 I agree, standard DOS/Windows line endings. > I am pretty sure

RE: Lucene tokenization

2012-03-27 Thread Steven A Rowe
Hi Nilesh, Which version of Lucene are you using? StandardTokenizer behavior changed in v3.1. Steve -Original Message- From: Nilesh Vijaywargiay [mailto:nilesh.vi...@gmail.com] Sent: Tuesday, March 27, 2012 2:04 PM To: java-user@lucene.apache.org Subject: Lucene tokenization I have a

RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
Hi okayndc, What *do* you want? Steve -Original Message- From: okayndc [mailto:bodymo...@gmail.com] Sent: Thursday, April 05, 2012 1:34 PM To: java-user@lucene.apache.org Subject: HTML tags and Lucene highlighting Hello, I currently use Lucene version 3.0...probably need to upgrade to

RE: HTML tags and Lucene highlighting

2012-04-05 Thread Steven A Rowe
tags (in the field configured to use HTMLStripCharFilter, anyway). So HTMLStripCharFilter should do what you want. Steve From: okayndc [mailto:bodymo...@gmail.com] Sent: Thursday, April 05, 2012 3:36 PM To: Steven A Rowe Cc: java-user@lucene.apache.org Subject: Re: HTML tags and Lucene highlig

RE: Partial word match

2012-04-09 Thread Steven A Rowe
Hi Hanu, Depending on the nature of the partial word match you're looking for - do you want to only match partial words that match at the beginning of the word? - you should look either at NGramTokenFilter or EdgeNGramTokenFilter:

RE: Two questions on RussianAnalyzer

2012-04-19 Thread Steven A Rowe
Hi Vladimir, > The most uncomfortable in new behaviour to me is that in past I used > to search by subdomain like bbb.com: and have displayed results > with www.bbb.com:, aaa.bbb.com: and so on. Now I have 0 > results. About domain names, see my response to a similar question today on

RE: Highlighter and Shingles...

2012-04-20 Thread Steven A Rowe
Hi Dawn, Can you give an example of a "partial match"? Steve -Original Message- From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] Sent: Friday, April 20, 2012 7:59 AM To: java-user@lucene.apache.org Subject: Highlighter and Shingles... Hi, Are there any notes on making the highligh

[MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
If you use the Lucene/Solr Maven POMs to drive the build, I committed a major change last night (see https://issues.apache.org/jira/browse/LUCENE-3948 for more details): * 'ant get-maven-poms' no longer places pom.xml files under the lucene/ and solr/ directories. Instead, they are placed in a

RE: [MAVEN] Heads up: build changes

2012-05-08 Thread Steven A Rowe
ava:809) [copy] at org.apache.tools.ant.Main.startAnt(Main.java:217) [copy] at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) [copy] at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) On 08/05/12 10:31, Steven A Rowe wrote: > If you use the Luce

RE: [MAVEN] Heads up: build changes

2012-05-09 Thread Steven A Rowe
nux localhost 2.6.39 #4 SMP Sun Aug 21 13:53:29 PDT 2011 x86_64 > Intel(R) Core(TM) i7-2820QM CPU @ 2.30GHz GenuineIntel GNU/Linux > > > On 08/05/12 11:24, Steven A Rowe wrote: >> Hi Greg, >> >> I don't see that problem - 'ant generate-maven-artifacts' j

RE: how to remove the dash

2012-06-25 Thread Steven A Rowe
I added the following to both TestStandardAnalyzer and TestClassicAnalyzer in branches/lucene_solr_3_6/, and it passed in both cases: public void testWhitespaceHyphenWhitespace() throws Exception { BaseTokenStreamTestCase.assertAnalyzesTo (a, "drinks - water", new String[]{"drinks", "

RE: RAMDirectory and expungeDeletes()/optimize()

2012-07-11 Thread Steven A Rowe
Nabble silently drops content from email sent through their interface on a regular basis. I've told them about it multiple times. My suggestion: find another way to post to this mailing list. -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesda

RE: ReferenceManager.maybeRefreshBlocking() should not be declared throwing InterruptedException

2012-07-21 Thread Steven A Rowe
Hi Vitaly, Info here should help you set up snapshot dependencies: http://wiki.apache.org/lucene-java/NightlyBuilds Steve -Original Message- From: Vitaly Funstein [mailto:vfunst...@gmail.com] Sent: Saturday, July 21, 2012 9:22 PM To: java-user@lucene.apache.org Subject: Re: ReferenceMa

RE: Using stop words with snowball analyzer and shingle filter

2012-09-19 Thread Steven A Rowe
Hi Martin, SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in Lucene 5.0. Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for PorterStemmer; disabling stopword holes' position increm

RE: StandardTokenizer generation from JFlex grammar

2012-10-04 Thread Steven A Rowe
Hi Phani, Assuming you're using Lucene 3.6.X, see: and

RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
Hi Siraj, Lucene's MemoryIndex can be used to serve this purpose. >From >: [T]his class targets fulltext search of huge numbers of queries over comparatively small transient r

RE: Reverse Searching

2010-05-17 Thread Steven A Rowe
have to run the queries against that single document. > But my dilemma is, I might have upto 100,000 queries to run against it. > Do you think this route will give me results in reasonable amount of > time, i.e. in a few seconds? > > thanks > -siraj > > On 5/17/2010 5:21 PM,

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Hi Andy, From the API docs for IndexWriter : [D]ocuments are added with addDocument and removed with deleteDocuments(Term) or deleteDocuments(Query). A document can be updated with updat

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Andy, it sounds like you're doing the right thing. Maybe you aren't using the IndexReader instance returned by reopen(), but instead are continuing to use the instance on which you called reopen()? It's tough to figure this kind of thing out without looking at the code. For example, what do yo

RE: search hits not returned until I stop and restart application

2010-06-21 Thread Steven A Rowe
Andy, I think batching commits either by time or number of documents is common. Do you know about NRT (Near Realtime Search)?: . Using IndexWriter.getReader(), you can avoid commits altogether, as well as reducing update->search latency.

RE: URL Tokenization

2010-06-23 Thread Steven A Rowe
Hi Sudha, There is such a tokenizer, named NewStandardTokenizer, in the most recent patch on the following JIRA issue: https://issues.apache.org/jira/browse/LUCENE-2167 It keeps (HTTP(S), FTP, and FILE) URLs together as single tokens, and e-mails too, in accordance with the relevant IETF R

RE: URL Tokenization

2010-06-24 Thread Steven A Rowe
The > hudson link for nightly builds on the apache-lucene site seems to be > broke. Or may be I have a different problem. > > I'd appreciate any help. > > Thanks, > Sudha > > > > On Wed, Jun 23, 2010 at 12:21 PM, Steven A Rowe wrote: > > > Hi Sud

RE: ShingleFilter failing with more terms than index phrase

2010-07-13 Thread Steven A Rowe
Hi Ethan, You'll probably get better answers about Solr specific stuff on the solr-u...@a.l.o list. Check out PositionFilterFactory - it may address your issue: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Steve > -Original Message- > From: Et

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, > [...] "*:* AND -myfield:foo*". > > If my document contains "myfield:foobar" and "myfield:dog", the document > would be thrown out because of the first field. I want to keep the > document because the second field does not match. I'm assuming that you mistakenly used the same field n

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, > Unfortunately the suffix requires a wildcard as well in our case. There > are a limited number of prefixes though (10ish), so perhaps we could > combine them all into one query. We'd still need some sort of > InverseWildcardQuery implementation. > > > use another analyzer so you don'

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
Hi Justin, > > an example > > PerFieldAnalyzerWrapper analyzers = > new PerFieldAnalyzerWrapper(new KeywordAnalyzer()); > // myfield defaults to KeywordAnalyzer > analyzers.addAnalyzer("content", new SnowballAnalyzer(luceneVersion, > "English")); > // analyzers affects the indexed field valu

RE: InverseWildcardQuery

2010-07-30 Thread Steven A Rowe
> > you want what Lucene already does, but that's clearly not true > > Hmmm, let's pretend that "contents" field in my example wasn't analyzed at > index > time. The unstemmed form of terms will be indexed. But if I query with a > stemmed > form or use QueryParser with the SnowballAnalyzer, I'm

RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Hi Christoph, There could be several things going on, but it's difficult to tell without more information. Since excluded terms require a non-empty set from which to remove documents at the same boolean clause level, you could try something like "title:(*:* -Datei*) avl", or "-title:Datei* a

RE: Search results include results with excluded terms

2010-08-16 Thread Steven A Rowe
Oops, setLowercaseExpandedTerms() is an instance method, not static. I wrote: > QueryParser has a static method setLowercaseExpandedTerms() that you can call > to turn on automatic pre-expansion query term downcasing: > >

RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
Hi Iam, Can you say why you don't like the proposed solution? Also, the example of the scoring you're looking for doesn't appear to be hierarchical in nature - can you give illustrate the relationship between the tokens in [token1, token2, token3]? Also, why do you want token1 to contribute m

RE: Hierarchical Fields

2010-09-15 Thread Steven A Rowe
gt; $ pop > I want: doc4 then doc5 (because the path to doc4 is smaller then doc5) > > So to do this I need: > 1 - change field boost > 2 - set priority of path, and to do that: I create N field (one field > to node in the path) or have some Lucene feature (but I don't kn

RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
This is not a defect: . > -Original Message- > From: Justin [mailto:cry...@yahoo.com] > Sent: Monday, October 04, 2010 2:03 PM > To: java-user@lucene.apache.org > Subject: Updating doc

RE: Updating documents with fields that aren't stored

2010-10-04 Thread Steven A Rowe
3_0_2/api/all/org/apache/lucene/index/IndexW > riter.html#getReader() > > > > > > - Original Message > From: Steven A Rowe > To: "java-user@lucene.apache.org" > Sent: Mon, October 4, 2010 1:05:36 PM > Subject: RE: Updating document

RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish, I think I understand "within sentence" phrase search - you want the entire phrase to be within a single sentence. But can you give an example of "non sentence specific phrase search"? It's not clear to me how useful such capability would be. Steve > -Original Message- > F

RE: Issue with sentence specific search

2010-10-06 Thread Steven A Rowe
Hi Sirish, Have you looked at SpanQuery's yet?: http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/spans/package-summary.html See also this Lucid Imagination blog post by Mark Miller: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ One common technique, instead

RE: Issue with sentence specific search

2010-10-07 Thread Steven A Rowe
Hi Sirish, StandardTokenizer does not produce a token from '#', as you suspected. Something that fits the "word" definition, but which won't ever be encountered in your documents, is what you should use for the delimiter - something like a1b2c3c2b1a . Sentence boundary handling is clunky in L

RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
Hi Martin, StandardTokenizer and -Analyzer have been changed, as of future version 3.1 (the next release) to support the Unicode segmentation rules in UAX#29. My (untested) guess is that your hyphenated word will be kept as a single token if you set the version to 3.1 or higher in the construc

RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Steven A Rowe
t; > A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a > StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 > is due? > > -Original Message- > From: Steven A Rowe [mailto:sar...@syr.edu] > Sent: 24 Oct 2010 21 31 > T

RE: IndexWriters and write locks

2010-11-10 Thread Steven A Rowe
NFS[1] != NTFS[2] [1] NFS: [2] NTFS: > -Original Message- > From: Pulkit Singhal [mailto:pulkitsing...@gmail.com] > Sent: Wednesday, November 10, 2010 2:55 PM > To: java-user@lucene.apach

RE: Analyzer

2010-11-29 Thread Steven A Rowe
Hi Manjula, It's not terribly clear what you're doing here - I got lost in your description of your (two? or maybe four?) classes. Sometimes things are easier to understand if you provide more concrete detail. I suspect that you could benefit from reading the book Lucene in Action, 2nd editio

RE: tokensFromAnalysis

2010-12-02 Thread Steven A Rowe
Lewis, Simon asked about the version of Lucene you're using because this section of the API has seen regular change. If you don't tell us which version, we can't help, because we don't know what you're coding against. Steve > -Original Message- > From: McGibbney, Lewis John [mailto:le

RE: Re: Scale up design

2010-12-22 Thread Steven A Rowe
On 12/22/2010 at 2:38 AM, Ganesh wrote: > Any other tips targeting 64 bit? If memory usage is an issue, you might consider using HotSpot's "compressed oops" option:

RE: lucene-based log searcher?

2011-01-13 Thread Steven A Rowe
Hi Paul, I saw this yesterday, but haven't tried it myself: http://karussell.wordpress.com/2010/10/27/feeding-solr-with-its-own-logs/ The author has a project called "Sogger" - Solr + Logger? - that can read various forms of logs. Steve > -Original Message- > From: Paul Libbrecht [mai

RE: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?

2011-01-18 Thread Steven A Rowe
> [x] ASF Mirrors (linked in our release announcements or via the Lucene > website) > > [x] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) > > [x] I/we build them from source via an SVN/Git checkout.

RE: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Steven A Rowe
Hi Patrick, The Jenkins (formerly Hudson) nightly Ant builds do not produce the jar containing WhitespaceAnalyzer. This is not intentional - I just created an issue to track fixing the problem: . The nightly Maven JARs Uwe pointed you to are

RE: lucene-snowball 3.1.0 packages are missing?

2011-04-03 Thread Steven A Rowe
Hi Alex, From Lucene contrib CHANGES.html : 3. LUCENE-2226: Moved contrib/snowball functionality into contrib/analyzers. Be sure to remove any old obselete l

RE: Lucene 3.1

2011-04-05 Thread Steven A Rowe
Hi Tanuj, Can you be more specific? What file did you download? (Lucene 3.1 has three downloadable packages: -src.tar.gz, .tar.gz, and .zip.) What did you expect to find that is not there? (Some examples would help.) Steve > -Original Message- > From: Tanuj Jain [mailto:tanujjain.

RE: word + ngram tokenization

2011-04-05 Thread Steven A Rowe
Hi Shambhu, ShingleFilter will construct word n-grams: http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html Steve > -Original Message- > From: sham singh [mailto:shamsing...@gmail.com] > Sent: Tuesday, April 05, 2011 5:53 PM > T

RE: lucene 3.0.3 | searching problem with *.docx file

2011-04-12 Thread Steven A Rowe
Hi Ranjit, Do you know about Luke? It will let you see what's in your index, and much more: http://code.google.com/p/luke/ Steve > -Original Message- > From: Ranjit Kumar [mailto:ranjit.ku...@otssolutions.com] > Sent: Tuesday, April 12, 2011 9:05 AM > To: java-user-h...@lucene.apa

RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-26 Thread Steven A Rowe
Hi Ranjit, I suspect the problem is not QueryParser, since the definition includes the '#' character (from ): | <#_TERM_START_CHAR: ( ~[ " ", "\t", "\n",

RE: lucene 3.0.3 | QueryParser | MultiFieldQueryParser

2011-04-27 Thread Steven A Rowe
Ranjit, The problem is definitely the analyzer you are passing to QueryParser or MultiFieldQueryParser, and not the parser itself. The following tests succeed using KeywordAnalyzer, which is a pass-through analyzer (the output is the same as the input): public void testSharpQP() throws Excep

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul, What did you find about Luke that's buggy? Bug reports are very useful; please contribute in this way. The official Lucene 3.0.3 distribution jars were compiled using the -g cmdline argument to javac - by default, though, only line number and source file information is generated. If

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul, On 4/29/2011 at 4:14 PM, Paul Taylor wrote: > On 29/04/2011 16:03, Steven A Rowe wrote: > > What did you find about Luke that's buggy? Bug reports are very > > useful; please contribute in this way. > > Please see previous post, in summary mistake on my par

RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Thanks Dawid. – Steve From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf Of Dawid Weiss Sent: Friday, April 29, 2011 4:45 PM To: java-user@lucene.apache.org Cc: Steven A Rowe Subject: Lucene 3.0.3 with debug information This is the e-mail you're looking for, Steven (it w

RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
Hi Bill, I can think of two possible interpretations of "removing filler tokens": 1. Don't create shingles across stopwords, e.g. for text "one two three four five" and stopword "three", bigrams only, you'd get ("one two", "four five"), instead of the current ("one two", "two _", "_ four", "fou

RE: Can I omit ShingleFilter's filler tokens

2011-05-11 Thread Steven A Rowe
> > On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: > > Hi Bill, > > > > I can think of two possible interpretations of "removing filler > tokens": > > > > 1. Don't create shingles across stopwords, e.g. for text "one two three >

RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
A thought: one way to do #1 without modifying ShingleFilter: if there were a StopFilter variant that accepted regular expressions instead of a stopword list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a full match is required, i.e. implicit beginning and end anchors), and

RE: Can I omit ShingleFilter's filler tokens

2011-05-12 Thread Steven A Rowe
Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Thursday, May 12, 2011 1:15 PM > To: java-user@lucene.apache.org > Subject: Re: Can I omit ShingleFilter's filler tokens > > On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe wrote: > > A thought: one way to do #

RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
Hi Cheng, Lucene 3.3 does not exist - do you mean branches/branch_3x ? FYI, as of Lucene 3.1, there is an Ant target you can use to setup an Eclipse project for Lucene/Solr - run this from the top level directory of a full source tree (including dev-tools/ directory) checked out from Subversio

RE: Lucene 3.3 in Eclipse

2011-05-15 Thread Steven A Rowe
lto:zhoucheng2...@gmail.com] > Sent: Sunday, May 15, 2011 10:48 AM > To: java-user@lucene.apache.org > Cc: Steven A Rowe > Subject: RE: Lucene 3.3 in Eclipse > > Steve, thanks for correction. You are right. The version is 3.0.3 > released last Oct. > > I did place an ant

RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud, That's normal behavior, since you have AND as default operator. This is equivalent to placing a "+" in front of every element of your query. In fact, if you removed the other two "+"s, you would get the same behavior. I think you'll get what you want by just switching the default

RE: Query Parser, Unary Operators and Multi-Field Query

2011-05-20 Thread Steven A Rowe
Hi Renaud, On 5/20/2011 at 1:58 PM, Renaud Delbru wrote: > As said in > , > "if one or more of the terms in a term list has an explicit term operator > (+ or - or relational operator) the rest of the terms will be treated as

RE: FastVectorHighlighter StringIndexOutofBounds bug

2011-05-22 Thread Steven A Rowe
Hi WeiWei, Thanks for the report. Can you provide a self-contained unit test that triggers the bug? Thanks, Steve > -Original Message- > From: Weiwei Wang [mailto:ww.wang...@gmail.com] > Sent: Monday, May 23, 2011 1:25 AM > To: java-user@lucene.apache.org > Subject: FastVectorHighlight

RE: Bug fix to contrib/.../IndexSplitter

2011-06-09 Thread Steven A Rowe
Hi Ivan, You do have rights to submit fixes to Lucene - everyone does! Here's how: http://wiki.apache.org/lucene-java/HowToContribute Please create a patch, create an issue in JIRA, and then attach the patch to the JIRA issue. When you do this, you are asked to state that you grant license to

RE: Lucene Simple Project

2011-06-18 Thread Steven A Rowe
Hi Hamada, Do you know about the Lucene demo?: http://lucene.apache.org/java/3_2_0/demo.html Steve > -Original Message- > From: hamadazahera [mailto:hamadazah...@gmail.com] > Sent: Saturday, June 18, 2011 9:30 AM > To: java-user@lucene.apache.org > Subject: Lucene Simple Project > > He

RE: how are built the packages in the maven repository?

2011-07-06 Thread Steven A Rowe
Ant is the official Lucene/Solr build system. Snapshot and release artifacts are produced with Ant. While Maven is capable of producing artifacts, the artifacts produced in this way may not be the same as the official Ant artifacts. For this reason: no, the artifacts should not be built with

RE: Some question about Lucene

2011-07-10 Thread Steven A Rowe
This slide show is a few years old, but I think it might be a good introduction for you to the differences between the projects: http://www.slideshare.net/dnaber/apache-lucene-searching-the-web-and-everything-else-jazoon07/ Steve -Original Message- From: Ing. Yusniel Hidalgo Delgado [ma

RE: 4.0-SNAPSHOT in maven repo via Jenkins?

2011-07-25 Thread Steven A Rowe
Hi Eric, On 7/24/2011 at 3:07 AM, Eric Charles wrote: 0112233445566778 12345678901234567890123456789012345678901234567890123456789012345678901234567890 > Jenkins jobs builds lucene trunk with 'mvn --batch-mode > --non-recursive -Pboot

RE: Enabling indexing of hyphenated terms sans the hyphen

2011-09-19 Thread Steven A Rowe
Hi sbs, Solr's WordDelimiterFilterFactory does what you want. You can see a description of its function here: . WordDelimiterFilter, the filter class implementing the above factory's functionality, is

RE: setting MaxFieldLength in indexwriter

2011-09-28 Thread Steven A Rowe
Hi Peyman, The API docs give a hint : = Nested Class Summary ... static class IndexWriter.MaxFieldLength Deprecated. use LimitTokenCountAnalyzer instead. =

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Steven A Rowe
Hi Paul, You could add a rule to the StandardTokenizer JFlex grammar to handle this case, bypassing its other rules. Another option is to create a char filter that substitutes PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, etc., but only when the entire input consists excl

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-18 Thread Steven A Rowe
Hi Paul, On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > On 18/10/2011 06:19, Steven A Rowe wrote: > > Another option is to create a char filter that substitutes > > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, > > etc., > > Yes that is how I firs

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, On 10/19/2011 at 5:26 AM, Paul Taylor wrote: > On 18/10/2011 15:25, Steven A Rowe wrote: > > On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > > > On 18/10/2011 06:19, Steven A Rowe wrote: > > > > Another option is to create a char filter that substitutes

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-19 Thread Steven A Rowe
Hi Paul, What version of Lucene are you using? The JFlex spec you quote below looks pre-v3.1? Steve > -Original Message- > From: Paul Taylor [mailto:paul_t...@fastmail.fm] > Sent: Wednesday, October 19, 2011 6:50 AM > To: Steven A Rowe; java-user@lucene.apache.org &g

RE: Analysers for newspaper pages...

2011-11-28 Thread Steven A Rowe
Hi Dawn, I assume that when you refer to "the impact of stop words," you're concerned about query-time performance? You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue. That said, there are

RE: Storing special characters in Lucene

2008-08-21 Thread Steven A Rowe
Hola Juan, On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote: > I have an index in Spanish and I use Snowball to stem and > analyze and it works perfectly. However, I am running into > trouble storing (not indexing, only storing) words that > have special characters. > > That is, I store the spe

RE: Lucene sample code and api documentation

2008-08-28 Thread Steven A Rowe
Hi Sithu, On 08/27/2008 at 3:13 PM, Sudarsan, Sithu D. wrote: > 2. Where do we look for sample codes? Or detailed tutorials? Lots of good stuff here: and particularly here (books, articles, presentations, oh my!):

RE: Confused with NGRAM results

2008-08-28 Thread Steven A Rowe
Hi gaz77, Here's a good place to start: Steve On 08/28/2008 at 10:52 AM, gaz77 wrote: > > Hi, > > I'd appreciate if someone could explain the results I'm getting. > > I've written a simple custom analyzer that applies the > NGramToken

RE: boost freshness instead of sorting

2008-08-28 Thread Steven A Rowe
Hi Yannis, On 08/28/2008 at 12:12 PM, Yannis Pavlidis wrote: > I am trying to boost the freshness of some of our documents > in the index using the most efficient way (i.e. if 2 news > stories have the same score based on the content then I want > to promote the one that was created last) > [...]

RE: boost freshness instead of sorting

2008-08-28 Thread Steven A Rowe
* 1/sqrt(4) = idf > > I am using the Snowball English analyzer which I believe does > the right job (I also tried the same example with bbb instead of 1) > > Any clarifications / suggestions would be appreciated. > > Thanks, > > Yannis. > > -Original Message- >

RE: Beginner: Specific indexing

2008-09-09 Thread Steven A Rowe
Hi Raymond, Check out SinkTokenizer/TeeTokenFilter: Look at the unit tests for usage hints:

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exact matching

2008-09-09 Thread Steven A Rowe
Hi mck, On 09/09/2008 at 12:58 PM, Mck wrote: > > *ShortVersion* > > is there a way to make the ShingleFilter perform exact matching via > > inserting ^ $ begin/end markers? > > Reading through the mailing list i see how exact matching can > be done, a la STFW to myself... > > So the ShortVersi

RE: Re: Replacing FAST functionality at sesam.no - ShingleFilter+exactmatching

2008-09-09 Thread Steven A Rowe
On 09/09/2008 at 4:38 PM, Mck wrote: > > > Looks to me like MultiPhraseQuery is getting in the way. Shingles > > that begin at the same word are given the same position by > > ShingleFilter, and Solr's FieldQParserPlugin creates a > > MultiPhraseQuery when it encounters tokens in a query with the

RE: RE: Re: Replacing FAST functionality at sesam.no -ShingleFilter+exactmatching

2008-09-10 Thread Steven A Rowe
Hi mck, On 09/10/2008 at 3:55 AM, Mck wrote: > > probably better to change the one instance of .setPositionIncrement(0) > > to .setPositionIncrement(1) - that way, MultiPhraseQuery will not be > > invoked, and the standard disjunction thing should happen. > > Tried this. > As you say i end up wit

RE: Understanding/controlling role of Weight in IndexSearcher

2008-09-10 Thread Steven A Rowe
Hi Micah, On 09/09/2008 at 11:57 PM, Micah Jaffe wrote: > I'm [...] curious how weights are calculated. [...] > thoughts? pointers? best practices? http://lucene.apache.org/java/docs/scoring.html - To unsubscribe, e-mail: [EMA

RE: Re: Replacing FAST functionality at sesam.no-ShingleFilter+exactmatching

2008-09-10 Thread Steven A Rowe
On 09/10/2008 at 12:02 PM, Mck wrote: > > > But this does not return the hits i want. > > > > Have you tried submitting the query without quotes? (That's where the > > PhraseQuery likely comes from.) > > Yes. It does not work. It returns just the unigrams, again the same > behaviour as mentioned

RE: Problems when changing stoplist file

2008-09-11 Thread Steven A Rowe
Hi Marie, On 09/11/2008 at 4:03 AM, Marie-Christine Plogmann wrote: > I am currently using the demo class IndexFiles to index some > corpus. I have replaced the Standard by a GermanAnalyzer. > Here, indexing works fine. > But if i specify a different stopword list that should be > used, the tokeni

RE: StandardTokenizer and Korean grouping with alphanum

2008-09-22 Thread Steven A Rowe
Hi Daniel, On 09/22/2008 at 12:49 AM, Daniel Noll wrote: > I have a question about Korean tokenisation. Currently there > is a rule in StandardTokenizerImpl.jflex which looks like this: > > ALPHANUM = ({LETTER}|{DIGIT}|{KOREAN})+ LUCENE-1126

  1   2   3   >