Re: App supplied docID in lucene possible?

2012-11-05 Thread Ravikumar Govindarajan
Looks far more complex than I had assumed!!! An invariant of "non-decreasing docid per flush", if pushed to the app can save lucene from handling the complex sparse data logic no? Lucene can hold it's existing logic without major changes, detect any out-of-order doc before every flush and emit an

RE: Highlighting html pages

2012-11-05 Thread Scott Smith
Since no one answered this, I decided I'd answer it myself (in case anyone else wanted the answer). First, there are two types of filters you can use in an Analyzer -- Character filters and token filters. Character filters get applied before tokenization and token filters get applied after tok

Near Real Time for multiple applications

2012-11-05 Thread Scott Smith
I've been reading about NRT thinking it might be good to integrate it into my code. However, I have a question. Suppose that the index writer and the index reader run in totally different JVMs (i.e., they are different applications and only communicate via the disk). Am I correct in thinking

Re: Lucene API

2012-11-05 Thread Vitaly Funstein
Term in my view is definitely not any more of a char buffer than a plain String. It's a unique permutation of a particular field name and its text value. If you look at its public API, the only way to mutate a Term instance is by obtaining a reference to underlying BytesRef which is in itself mutab

Re: Lucene API

2012-11-05 Thread Igal @ getRailo.org
it's CharTermAttribute in particular but since there are many such particular examples -- at some point it becomes Lucene in general. perhaps the problem is on my end that I'm not familiar enough with DSL-style, but learning DSL concepts is not a prerequisite for Lucene. as for the Term being

Re: Lucene API

2012-11-05 Thread Vitaly Funstein
Are you critiquing CharTermAttribute in particular, or Lucene in general? It appears CharTermAttribute is DSL-style builder API, just like its superinterface Appendable - does that not appear intentional and self-explanatory? Further, I believe Term instances are meant to be immutable hence no dire

Lucene API

2012-11-05 Thread Igal @ getRailo.org
I don't mean to sound critical, but is there a reason that the API is not simpler? for example, if I want to read/modify a CharTermAttribute's value, I need to use toString() to get the value, which is very unintuitive, and either copyBuffer() or setEmpty() and append(). is there a reason no

FuzzyQuery minimumSimilarity

2012-11-05 Thread Damian Birchler
Hi there Lucene calucaltes the string similarity between two strings s1 and s2 according to the formula Similarity = Levenshtein-Distance(s1,s2)/min(Length(s1),Length(s2)) I would have thought Lucene would divide by the length of the longer string. In particular, the above formula could - in m

Re: App supplied docID in lucene possible?

2012-11-05 Thread Michael McCandless
On Mon, Nov 5, 2012 at 4:37 AM, Ravikumar Govindarajan wrote: > Thanks Mike, > > Joins could be slower than docID based approach, no? Yes: slower at search time but faster at update time (generally not a good tradeoff... but it seems like in your case slow updates are the problem). > It would be

Re: Highlighting html pages

2012-11-05 Thread Michael Sokolov
HTMLStripCharFilter runs first, before any tokenizer, strips all the tags, and leaves all your text intact. If you have angle brackets in the text (ie not tags), they will be left as is. All your other analysis code should work just the same as if the text came from a plain text file. Which

Re: Overriding DefaultSimilarity to not consider tf/idf and friends

2012-11-05 Thread Erick Erickson
first id see if omitting term frequencies and positions and norms did what you need, these are all things you can disable OOB... Best Erick On Mon, Nov 5, 2012 at 5:26 AM, Damian Birchler wrote: > Hi everyone > > ** ** > > We are using Lucene to search for possible duplicates in an address

Overriding DefaultSimilarity to not consider tf/idf and friends

2012-11-05 Thread Damian Birchler
Hi everyone We are using Lucene to search for possible duplicates in an address database. We create an index with a document for each person in the database. Each document has a field with one term for the first name, a field with one term for the last name and so on. I think in this setting it

RE: "read past EOF" when merge

2012-11-05 Thread Markus Jelsma
https://issues.apache.org/jira/browse/SOLR-4032 -Original message- > From:Mark Miller > Sent: Sat 03-Nov-2012 14:20 > To: java-user@lucene.apache.org > Subject: Re: "read past EOF" when merge > > Can you file a JIRA Markus? This is probably related to the new code that > uses Direct

Re: App supplied docID in lucene possible?

2012-11-05 Thread Ravikumar Govindarajan
Thanks Mike, Joins could be slower than docID based approach, no? It would be great if lucene can incorporate an external docID after weighing the pros & cons. Many like us will be willing to trade-off search latency to some extent, in return for the low hanging fruits --- Ravi On Fri, Nov 2, 2