Re: Confused with NGRAM results

2008-08-28 Thread Otis Gospodnetic
This actually sounds bugish to me, but you removed the text from your original email, so I don't know what context this was in. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: gaz77 <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Se

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Jason Rutherglen
Hello, I have been emailing Otis regarding some of the replication issues and it is good to get them into the Lucene forums to obtain feedback and try to agree on what is most advantageous. Solr replication uses what I call segment replication. Ocean can do segment replication but usually simply

Re: phrases and slop

2008-08-28 Thread Mark Miller
Andy Goodell wrote: I thought I understood phrases and slop until one of my coworkers brought by the following example For a document that contains "quick brown fox" "quick brown fox"~0 "quick fox brown"~2 "fox quick brown"~3 all match. I would have expected "fox quick brown" to require a 4 i

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread rahul_k123
Do i need to stop indexing when i rsync snapshot to the slave? Otis Gospodnetic wrote: > > Yes, I think you pinpointed what I see over and over with Solr. The two > desires pull in opposite directions. I think Jason Rutherglen is very > keen to start talking about Lucene clusters and ind

phrases and slop

2008-08-28 Thread Andy Goodell
I thought I understood phrases and slop until one of my coworkers brought by the following example For a document that contains "quick brown fox" "quick brown fox"~0 "quick fox brown"~2 "fox quick brown"~3 all match. I would have expected "fox quick brown" to require a 4 instead of a 3, two to

Re: Confused with NGRAM results

2008-08-28 Thread gaz77
Thanks for the pointer. I've gone into this in some depth, using the AnalyzerUtils class from the lucene in action book. It seems that the NGramTokenFilter is only processing part of the string that goes in. It stops tokenising the words part way through. That's why the documents weren't found i

Re: Clarity: Is there a Query boosting 50-50 over 1000-1 ?

2008-08-28 Thread Grant Ingersoll
Can you share your query generation code? Your description doesn't make sense to me and I wonder how you are creating and running the searches. Can you run the explain() method on your documents? Also, FWIW, it sounds like you are prematurely optimizing. For every query you ever do in y

Re: Case Sensitivity

2008-08-28 Thread Michael McCandless
Andrzej Bialecki wrote: Michael McCandless wrote: In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue: https://issues.apache.org/jira/browse/LUCENE-1366 This has consequences when searching - so if we expose it the javadoc has to be really good at explaining what'

Re: Case Sensitivity

2008-08-28 Thread Michael McCandless
Yonik Seeley wrote: On Thu, Aug 28, 2008 at 1:44 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue: I wasn't originally going to add a Field.Index at all for omitNorms, but Doug suggested it. The problem with this ty

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
Michael McCandless wrote: In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue: https://issues.apache.org/jira/browse/LUCENE-1366 This has consequences when searching - so if we expose it the javadoc has to be really good at explaining what's going on :) -- Best re

Re: Case Sensitivity

2008-08-28 Thread Yonik Seeley
On Thu, Aug 28, 2008 at 1:44 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue: I wasn't originally going to add a Field.Index at all for omitNorms, but Doug suggested it. The problem with this type-safe way of doing thin

Re: Case Sensitivity

2008-08-28 Thread Michael McCandless
In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue: https://issues.apache.org/jira/browse/LUCENE-1366 Mike Otis Gospodnetic wrote: Yes. And I think I have used this "trick" a couple of years ago, but have since forgotten about it. :) Otis -- Sematext -- http:

Re: Case Sensitivity

2008-08-28 Thread Otis Gospodnetic
Yes. And I think I have used this "trick" a couple of years ago, but have since forgotten about it. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Andrzej Bialecki <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Thursday,

Re: boost freshness instead of sorting

2008-08-28 Thread Andrzej Bialecki
Steven A Rowe wrote: Hi Yannis, Hmm, hadn't thought about norms - you could just turn them off, right?:

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
Otis Gospodnetic wrote: So in other words, it *is* possible to have the field both tokenized and its norms omitted? Yes. Probably this is an unintended side-effect of adding setOmitNorms, but I think it's useful and IMHO we should keep it. -- Best regards, Andrzej Bialecki <>< ___. __

RE: Clarity: Is there a Query boosting 50-50 over 1000-1 ?

2008-08-28 Thread Shi Hui Liu
Hi Grant, Thank you for your help. My query is A AND B. The problem is if I use BooleanQuery, I got score 120 from TermQuery(A) and 0.5 from TermQuery(B) for the first article; for second article, I got score 27 from TermQuery(A) and 36 from TermQuery(B). From my point of view, I think the seco

RE: boost freshness instead of sorting

2008-08-28 Thread Steven A Rowe
Hi Yannis, Hmm, hadn't thought about norms - you could just turn them off, right?: with

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Otis Gospodnetic
Yes, I think you pinpointed what I see over and over with Solr. The two desires pull in opposite directions. I think Jason Rutherglen is very keen to start talking about Lucene clusters and index replication in such clusters without using the classic master/slave approach. Jason, want to star

Re: Case Sensitivity

2008-08-28 Thread Otis Gospodnetic
So in other words, it *is* possible to have the field both tokenized and its norms omitted? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Karl Wettin <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Thursday, August 28, 200

RE: boost freshness instead of sorting

2008-08-28 Thread Yannis Pavlidis
Hey Steve, Thanks for the quick response. Apologies my email was not very clear. I actually did what you and Andrzej propose. So in my test (with field boost and doc boost = 1) doc 0 has days: "1"and field weight = tf * idf * field Norm = sqrt(1) * idf * 1/sqrt(1) = idf doc 1 has days

RE: boost freshness instead of sorting

2008-08-28 Thread Steven A Rowe
Hi Yannis, On 08/28/2008 at 12:12 PM, Yannis Pavlidis wrote: > I am trying to boost the freshness of some of our documents > in the index using the most efficient way (i.e. if 2 news > stories have the same score based on the content then I want > to promote the one that was created last) > [...]

boost freshness instead of sorting

2008-08-28 Thread Yannis Pavlidis
Hi, I am trying to boost the freshness of some of our documents in the index using the most efficient way (i.e. if 2 news stories have the same score based on the content then I want to promote the one that was created last) I have tried several techniques which do not seems to be performing t

RE: Confused with NGRAM results

2008-08-28 Thread Steven A Rowe
Hi gaz77, Here's a good place to start: Steve On 08/28/2008 at 10:52 AM, gaz77 wrote: > > Hi, > > I'd appreciate if someone could explain the results I'm getting. > > I've written a simple custom analyzer that applies the > NGramToken

RE: Lucene sample code and api documentation

2008-08-28 Thread Steven A Rowe
Hi Sithu, On 08/27/2008 at 3:13 PM, Sudarsan, Sithu D. wrote: > 2. Where do we look for sample codes? Or detailed tutorials? Lots of good stuff here: and particularly here (books, articles, presentations, oh my!):

Confused with NGRAM results

2008-08-28 Thread gaz77
Hi, I'd appreciate if someone could explain the results I'm getting. I've written a simple custom analyzer that applies the NGramTokenFilter to the token stream during indexing. It's never applied during searching. The purpose of this is to match sub-words. Without the ngram filter, if I search

Re: Analyzer at Query time

2008-08-28 Thread Yonik Seeley
On Thu, Aug 28, 2008 at 10:32 AM, Dino Korah <[EMAIL PROTECTED]> wrote: > If I am to completely avoid the query parser and use the BooleanQuery along > with TermQuery, RangeQuery, PrefixQuery, PhraseQuery, etc, does the search > words still get to the Analyzer, before actually doing the real search

Re: Analyzer at Query time

2008-08-28 Thread Mark Miller
Dino Korah wrote: Hi All, If I am to completely avoid the query parser and use the BooleanQuery along with TermQuery, RangeQuery, PrefixQuery, PhraseQuery, etc, does the search words still get to the Analyzer, before actually doing the real search? Many thanks, Dino Answer: no The Q

Analyzer at Query time

2008-08-28 Thread Dino Korah
Hi All, If I am to completely avoid the query parser and use the BooleanQuery along with TermQuery, RangeQuery, PrefixQuery, PhraseQuery, etc, does the search words still get to the Analyzer, before actually doing the real search? Many thanks, Dino

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Shalin Shekhar Mangar
Slightly off-topic. Robert -- you may want to look at SOLR-561 -- Solr replication by Solr (for windows also) which is under development. https://issues.apache.org/jira/browse/SOLR-561 On Thu, Aug 28, 2008 at 7:39 PM, Robert Stewart <[EMAIL PROTECTED] > wrote: > We don't use Solr, since we run o

RE: Replicating Lucene Index with out SOLR

2008-08-28 Thread Robert Stewart
We don't use Solr, since we run on Windows ;(, but we did implement very similar snapshot replication. We have 2 master index servers building indexes, partitioned by document. Every 1 minute, we stop index writer, create a local snapshot (on the master server), in directory named MMDDHHM

RE: Lucene sample code and api documentation

2008-08-28 Thread Sudarsan, Sithu D.
Thanks Otis, I should be ordering the book soon :-) Yes old mails are there. May be these should be added to the FAQ. Thanks and regards Sithu -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 27, 2008 9:50 PM To: java-user@lucene.apache.org

Re: Clarity: Is there a Query boosting 50-50 over 1000-1 ?

2008-08-28 Thread Grant Ingersoll
On Aug 27, 2008, at 7:34 PM, Shi Hui Liu wrote: Hi, I think I should clarify my question a little bit. I'm using BooleanQuery to combine TermQuery(A) and TermQuery(B). But I'm not satisfied with its scoring algorigthm. Is there other queries can boost up the documents with 50 of A and 50

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Bill Au
The snapinstaller script invokes the commit command to trigger Solr to do a commit, which open a new index reader and then auto-warm the caches. You will need to replace that with your own code to do the same for your Lucene index. On Thu, Aug 28, 2008 at 1:47 AM, rahul_k123 <[EMAIL PROTECTED]> w

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread Bill Au
Solr uses Doug's rsync method to do replication. The scripts are pretty much standalone and does not require Solr. It should work on any Lucene index. Bill On Wed, Aug 27, 2008 at 11:52 PM, Kent Fitch <[EMAIL PROTECTED]> wrote: > Check out this recipe for using rsync by Doug Cutting: > http://

Re: when to refresh IndexSearcher and IndexWriter

2008-08-28 Thread Michael McCandless
Ganesh - yahoo wrote: Hello all, My index will get update very frequently. 1) When shall i need to optimize IndexWriter? I have planned to optimize every day. Is that fine. Probably you should test in your app, to see if optimization is even necessary and if so, at what frequency. Opti

Re: lucene 3.0 feature list?

2008-08-28 Thread Grant Ingersoll
We haven't even begun working on 3.0 other than the planning to say it will be on JDK 1.5. There may be a few tickets in JIRA that are marked as 3.0, though, but that doesn't even mean they will make it. And, the API will not necessarily be 2.4 compatible. That is not in our back compat.

when to refresh IndexSearcher and IndexWriter

2008-08-28 Thread Ganesh - yahoo
Hello all, My index will get update very frequently. 1) When shall i need to optimize IndexWriter? I have planned to optimize every day. Is that fine. 2) When shall i need to re-open IndexReader and IndexSearcher? I have planned to do it every 10 minutes. 3) IndexSearcher could be used acr

Re: Replicating Lucene Index with out SOLR

2008-08-28 Thread mark harwood
>> You don't need to copy the whole index every time >> if you do incremental indexing/updates and don't optimize the index But at 5 minute intervals for replication does this not quickly lead to a very fragmented index? It seems there is a fundamental conflict when building replication system

Re: Can TermDocs.skipTo() go backwards

2008-08-28 Thread Michael McCandless
Antony Bowesman wrote: Michael McCandless wrote: Ahh right, my short term memory failed me ;) I now remember this thread. Excused :) I expect you have real work to occupy your mind! Well, understanding how people are pushing Lucene *is* the real work ;) This is exactly how Lucene grow

Re: Case Sensitivity

2008-08-28 Thread Karl Wettin
28 aug 2008 kl. 11.46 skrev Andrzej Bialecki: Karl Wettin wrote: 28 aug 2008 kl. 10.58 skrev Dino Korah: Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO, Field.Index.TOKENIZED); f.setOmitNorms(true); Would that be equivalent to Document doc = new Document

Re: Case Sensitivity

2008-08-28 Thread Andrzej Bialecki
Karl Wettin wrote: 28 aug 2008 kl. 10.58 skrev Dino Korah: Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO, Field.Index.TOKENIZED); f.setOmitNorms(true); Would that be equivalent to Document doc = new Document(); Field f = new Field("body", bodyText, Field

Re: Case Sensitivity

2008-08-28 Thread Karl Wettin
28 aug 2008 kl. 10.58 skrev Dino Korah: Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO, Field.Index.TOKENIZED); f.setOmitNorms(true); Would that be equivalent to Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO ,Field.I

FW: Case Sensitivity

2008-08-28 Thread Dino Korah
Looks like my question got unnoticed among the more important Jira discussion. :( On the same topic, what would be the effect of the following code. Document doc = new Document(); Field f = new Field("body", bodyText, Field.Store.NO, Field.Index.TOKENIZED); f.setOmitNorms(true); Would that be eq