Re: get wordno, lineno, pageno for term/phrase

2010-08-07 Thread Babak Farhang
How about making each line a separate document? You'd worry about scaling it later (e.g. the 32-bit limitation in the number of docs in an index).. On Fri, Aug 6, 2010 at 11:37 AM, arun r wrote: > I am trying to create a custom analyzer that will check for pagebreak > and linebreak and add the pa

Re: get wordno, lineno, pageno for term/phrase

2010-08-09 Thread Babak Farhang
is spread > across two pages, then the span search does not capture it. Is there a > work around for this ? > > On Sat, Aug 7, 2010 at 8:00 PM, Babak Farhang wrote: >> How about making each line a separate document? You'd worry about >> scaling it later (e.g. the 32-bit lim

Re: Scores between words. Boosting?

2009-03-16 Thread Babak Farhang
Since you're configuring/writing your own analyzer, why not generate a token stream that emits bi-grams? Sure, you're expanding the number of terms in the index, so there's some overhead there. On the plus side, however, your bi-grams, as you've described them, are ordered--which reduces the poten

Re: Scores between words. Boosting?

2009-03-16 Thread Babak Farhang
e docs w/ just cat. You might be able to do something with a > PrefixQuery on the n-grams or a separate field that doesn't do bigrams. > > Still, that feels like a stretch for some reason. > > -Grant > > > On Mar 16, 2009, at 3:39 PM, Babak Farhang wrote: > >>

Re: Lucene Index Encryption

2009-05-10 Thread Babak Farhang
Seems to me this discussion is not necessarily limited to *encryption*: if you can implement encryption, you can also implement compression--which is perhaps interesting for archiving purposes (at access time, faster than unzipping an entire archived Directory and loading it, for example). >> Luce

Re: Lucene Index Encryption

2009-05-11 Thread Babak Farhang
On Mon, May 11, 2009 at 12:19 AM, Andrzej Bialecki wrote: > > Unfortunately, current Lucene IndexWriter implementation uses seek / > overwrite when writing term info dictionary. This is described in more > detail here: > > https://issues.apache.org/jira/browse/LUCENE-532 > Thanks for the enlight

Re: relevance function for scores

2009-05-25 Thread Babak Farhang
How about determining the cutoff by measuring the percentage difference between successive scores: if the score drops by a threshold amount then you've hit the cutoff. In the example you mention, you might want to try something like c/1000, where 1 < c < 25 is a constant (experiment to find a swee

Re: relevance function for scores

2009-05-25 Thread Babak Farhang
Woops. Got that backwards.. should read > if (score[n] / score[n-1]) < c / (boost_factor) On Mon, May 25, 2009 at 4:10 PM, Babak Farhang wrote: > How about determining the cutoff by measuring the percentage > difference between successive scores: if the score drops by a > t

Redundant fields Token class?

2009-11-13 Thread Babak Farhang
I'm writing a TokenFilter and am confused about why class Token has both an *endOffset* and a *termLength* field. It would appear that the following invariant should always hold for a Token instance: termLength() == endOffset() - startOffset() If so, then 1) Why 2 fields, instead of 1? 2) W

Re: Redundant fields Token class?

2009-11-13 Thread Babak Farhang
ven >> follow a contract like end-start=length. >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -Original Message- >> > From: Babak Farhang [m

Re: Redundant fields Token class?

2009-11-13 Thread Babak Farhang
. > > it has to break input tokens into subtokens and correct offsets... sounds > like you are on the right track though. > > On Fri, Nov 13, 2009 at 10:30 PM, Babak Farhang wrote: > >> Thanks for your explanations. I think I have a basic understanding now. >> >> W

synonym-group filter

2009-11-14 Thread Babak Farhang
SynonymTokenFilter, if I understand correctly, maps a given token to a set of tokens representing its synonyms. If used in the filter chain of a query analyzer, it causes a "query expansion". (Correct terminology?) If used in the filter chain of an analyzer it causes "index expansion". I was wonde

Switching from Store.YES to Store.NO

2010-01-05 Thread Babak Farhang
Hi, A review of the requirements of the project I'm working on has led us to conclude that going forward we don't need Lucene to store certain field values--just index. Owing to the large size of the data, we can't really afford to reindex everything, (Going forward, we plan to treat these fields

Re: Switching from Store.YES to Store.NO

2010-01-05 Thread Babak Farhang
cify Store.NO. > > I don't think this (what happens when certain schema changes happen > mid-indexing) is well documented, in general. > > Mike > > On Tue, Jan 5, 2010 at 12:01 PM, Babak Farhang wrote: >> Hi, >> >> A review of the requirements of the projec

Re: Switching from Store.YES to Store.NO

2010-01-05 Thread Babak Farhang
don't think this (what happens when certain schema changes happen >> mid-indexing) is well documented, in general. >> >> Mike >> >> On Tue, Jan 5, 2010 at 12:01 PM, Babak Farhang wrote: >> >>> >>> Hi, >>> >>> A review of the requir

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Babak Farhang
>> I wonder if renaming that to maxSegSizeMergeMB would make it more obvious >> what this does? How about using the *able* moniker to make it clear we're referring to the size of the to-be-merged segment, not the resultant merged segment? I.e. naming it something like "maxMergeableSegSizeMB" ..

incremental document field update

2010-01-14 Thread Babak Farhang
Hi, I've been thinking about how to update a single field of a document without touching its other fields. This is an old problem and I was considering a solution along the lines of Andrzej Bialecki's post to the dev list back in '07: http://markmail.org/message/tbkgmnilhvrt6bii > I have the fo

Re: incremental document field update

2010-01-14 Thread Babak Farhang
> Reading that trail, I wish the original poster gave up on his idea ( Err, that should have read.. "Reading that trail, I wish the original poster hadn't given up on his idea" On Thu, Jan 14, 2010 at 2:23 AM, Babak Farhang wrote: > Hi, > > I've been think

Re: incremental document field update

2010-01-17 Thread Babak Farhang
-Babak On Thu, Jan 14, 2010 at 3:39 AM, Michael McCandless wrote: > Parallel incremental indexing > (http://issues.apache.org/jira/browse/LUCENE-1879) is one way to solve > this. > > Mike > > On Thu, Jan 14, 2010 at 4:27 AM, Babak Farhang wrote: >>> Reading that

Re: incremental document field update

2010-01-17 Thread Babak Farhang
17, 2010 at 3:06 AM, Michael McCandless wrote: > On Sun, Jan 17, 2010 at 4:33 AM, Babak Farhang wrote: >> Thanks Mike! This is pretty cool.. >> >> So LUCENE-1879 takes care of aligning (syncing) doc-ids across >> parallel index / segment merges. Missing is the machinery for

Re: incremental document field update

2010-01-17 Thread Babak Farhang
he N updates would likely approach O(N**2) -- * So as ever, there are tradeoffs. -Babak On Sun, Jan 17, 2010 at 6:39 AM, Michael McCandless wrote: > On Sun, Jan 17, 2010 at 7:45 AM, Babak Farhang wrote: >>> So the idea is, I can change the field for only a few docs in a >>>

Re: incremental document field update

2010-01-18 Thread Babak Farhang
fields. I imagine we also need a parallel dictionary for these mapped postings lists in order to deal with new terms encountered during the update. Not sure how this would work. Can you elaborate? And how would we deal with updated stored fields? -Babak On Mon, Jan 18, 2010 at 4:42 AM, Michael Mc

Re: incremental document field update

2010-01-19 Thread Babak Farhang
and .tvx files for per-document data at search time, and index-time mapped doc-ids for the posting lists. -Babak On Tue, Jan 19, 2010 at 3:48 AM, Michael McCandless wrote: > On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang wrote: >>> This is about multiple sessions with the writer. Ie

Re: incremental document field update

2010-01-21 Thread Babak Farhang
possibility of a bad read. Make N large enough (max 256), and that should close the window, I think. Any way, just want to thank you Mike for sharing your thoughts and ideas. Time to try some of them.. Cheers, -Babak On Wed, Jan 20, 2010 at 3:41 AM, Michael McCandless wrote: > On Tue, Jan 19,