How about making each line a separate document? You'd worry about
scaling it later (e.g. the 32-bit limitation in the number of docs in
an index)..
On Fri, Aug 6, 2010 at 11:37 AM, arun r wrote:
> I am trying to create a custom analyzer that will check for pagebreak
> and linebreak and add the pa
is spread
> across two pages, then the span search does not capture it. Is there a
> work around for this ?
>
> On Sat, Aug 7, 2010 at 8:00 PM, Babak Farhang wrote:
>> How about making each line a separate document? You'd worry about
>> scaling it later (e.g. the 32-bit lim
Since you're configuring/writing your own analyzer, why not generate a
token stream that emits bi-grams? Sure, you're expanding the number of
terms in the index, so there's some overhead there. On the plus side,
however, your bi-grams, as you've described them, are ordered--which
reduces the poten
e docs w/ just cat. You might be able to do something with a
> PrefixQuery on the n-grams or a separate field that doesn't do bigrams.
>
> Still, that feels like a stretch for some reason.
>
> -Grant
>
>
> On Mar 16, 2009, at 3:39 PM, Babak Farhang wrote:
>
>>
Seems to me this discussion is not necessarily limited to
*encryption*: if you can implement encryption, you can also implement
compression--which is perhaps interesting for archiving purposes (at
access time, faster than unzipping an entire archived Directory and
loading it, for example).
>> Luce
On Mon, May 11, 2009 at 12:19 AM, Andrzej Bialecki wrote:
>
> Unfortunately, current Lucene IndexWriter implementation uses seek /
> overwrite when writing term info dictionary. This is described in more
> detail here:
>
> https://issues.apache.org/jira/browse/LUCENE-532
>
Thanks for the enlight
How about determining the cutoff by measuring the percentage
difference between successive scores: if the score drops by a
threshold amount then you've hit the cutoff. In the example you
mention, you might want to try something like c/1000, where 1 < c < 25
is a constant (experiment to find a swee
Woops. Got that backwards.. should read
> if (score[n] / score[n-1]) < c / (boost_factor)
On Mon, May 25, 2009 at 4:10 PM, Babak Farhang wrote:
> How about determining the cutoff by measuring the percentage
> difference between successive scores: if the score drops by a
> t
I'm writing a TokenFilter and am confused about why class Token has
both an *endOffset* and a *termLength* field. It would appear that
the following invariant should always hold for a Token instance:
termLength() == endOffset() - startOffset()
If so, then
1) Why 2 fields, instead of 1?
2) W
ven
>> follow a contract like end-start=length.
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -Original Message-
>> > From: Babak Farhang [m
.
>
> it has to break input tokens into subtokens and correct offsets... sounds
> like you are on the right track though.
>
> On Fri, Nov 13, 2009 at 10:30 PM, Babak Farhang wrote:
>
>> Thanks for your explanations. I think I have a basic understanding now.
>>
>> W
SynonymTokenFilter, if I understand correctly, maps a given token to a
set of tokens representing its synonyms. If used in the filter chain
of a query analyzer, it causes a "query expansion". (Correct
terminology?) If used in the filter chain of an analyzer it causes
"index expansion".
I was wonde
Hi,
A review of the requirements of the project I'm working on has led us
to conclude that going forward we don't need Lucene to store certain
field values--just index. Owing to the large size of the data, we
can't really afford to reindex everything, (Going forward, we plan to
treat these fields
cify Store.NO.
>
> I don't think this (what happens when certain schema changes happen
> mid-indexing) is well documented, in general.
>
> Mike
>
> On Tue, Jan 5, 2010 at 12:01 PM, Babak Farhang wrote:
>> Hi,
>>
>> A review of the requirements of the projec
don't think this (what happens when certain schema changes happen
>> mid-indexing) is well documented, in general.
>>
>> Mike
>>
>> On Tue, Jan 5, 2010 at 12:01 PM, Babak Farhang wrote:
>>
>>>
>>> Hi,
>>>
>>> A review of the requir
>> I wonder if renaming that to maxSegSizeMergeMB would make it more obvious
>> what this does?
How about using the *able* moniker to make it clear we're referring to
the size of the to-be-merged segment, not the resultant merged
segment? I.e. naming it something like "maxMergeableSegSizeMB" ..
Hi,
I've been thinking about how to update a single field of a document
without touching its other fields. This is an old problem and I was
considering a solution along the lines of Andrzej Bialecki's post to
the dev list back in '07:
http://markmail.org/message/tbkgmnilhvrt6bii >
I have the fo
> Reading that trail, I wish the original poster gave up on his idea (
Err, that should have read..
"Reading that trail, I wish the original poster hadn't given up on his idea"
On Thu, Jan 14, 2010 at 2:23 AM, Babak Farhang wrote:
> Hi,
>
> I've been think
-Babak
On Thu, Jan 14, 2010 at 3:39 AM, Michael McCandless
wrote:
> Parallel incremental indexing
> (http://issues.apache.org/jira/browse/LUCENE-1879) is one way to solve
> this.
>
> Mike
>
> On Thu, Jan 14, 2010 at 4:27 AM, Babak Farhang wrote:
>>> Reading that
17, 2010 at 3:06 AM, Michael McCandless
wrote:
> On Sun, Jan 17, 2010 at 4:33 AM, Babak Farhang wrote:
>> Thanks Mike! This is pretty cool..
>>
>> So LUCENE-1879 takes care of aligning (syncing) doc-ids across
>> parallel index / segment merges. Missing is the machinery for
he N updates would likely
approach O(N**2) -- *
So as ever, there are tradeoffs.
-Babak
On Sun, Jan 17, 2010 at 6:39 AM, Michael McCandless
wrote:
> On Sun, Jan 17, 2010 at 7:45 AM, Babak Farhang wrote:
>>> So the idea is, I can change the field for only a few docs in a
>>>
fields. I imagine
we also need a parallel dictionary for these mapped postings lists in order
to deal with new terms encountered during the update. Not sure how this
would work. Can you elaborate?
And how would we deal with updated stored fields?
-Babak
On Mon, Jan 18, 2010 at 4:42 AM, Michael Mc
and .tvx files for per-document
data at search time, and index-time mapped doc-ids for the posting
lists.
-Babak
On Tue, Jan 19, 2010 at 3:48 AM, Michael McCandless
wrote:
> On Tue, Jan 19, 2010 at 1:32 AM, Babak Farhang wrote:
>>> This is about multiple sessions with the writer. Ie
possibility of
a bad read. Make N large enough (max 256), and that should close the
window, I think.
Any way, just want to thank you Mike for sharing your thoughts and
ideas. Time to
try some of them..
Cheers,
-Babak
On Wed, Jan 20, 2010 at 3:41 AM, Michael McCandless
wrote:
> On Tue, Jan 19,
24 matches
Mail list logo