date:20070323

Re: Combining score from two or more hits

2007-03-23 Thread Antony Bowesman


Chris Hostetter wrote:


if you are using a HitCollector, there any re-evaluation is going to
happen in your code using whatever mechanism you want -- once your collect
method is called on a docid, Lucene is done with that docid and no longer
cares about it ... it's only whatever storage you may be maintaining of
high scoring docs thta needs to know that you've decided the score has
changed.

your big problem is going to be that you basically need to maintain a list
of *every* doc collected, if you don't know what the score of any of them
are until you've processed all the rest ... since docs are collected in
increasing order of docid, you might be able to make some optimizations
based on how big of a gap you've got between the doc you are currently
collecting and the last doc you've collected if you know that you're
always going to add docs that "relate" to eachother in sequential bundles
-- but this would be some very custom code depending on your use case.


I only ever need to return a couple of ID fields per doc hit, so I load them 
with FieldCache when I start a new searcher.  These IDs refer to unique objects 
elsewhere, but there can be one or more instances of the same Id in the index 
due to the way I've structured Documents.  A Document = an attachment in the 
other system attached to the other system's object which can have 1...n 
attachments.  My problem is I need to return only unique external Ids with some 
kind of combined score up to the requested maxHits from the client.


Getting the unique Ids is no problem, but as you say I either have to store all 
hits and then sort them by score at the end once I know all unique docs, or do 
some clever stuff with some type of PriorityQueue that allows me to re-jig 
scores that already exist in the sorted queue.


One idea your comments raise is the relationship of docids to the group of 
Documents added for the higher level object.  All the Documents for the external 
object are added with a single writer at index time.  Assuming that the 
Documents for a single external Id will either all exist or none, then will the 
doc ids always be sequential for ever for that external Id or will they 
'reorganise' themselves?


Thanks
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Reverse search

2007-03-23 Thread Melanie Langlois

Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the 
snowballAnalyzer with English stopwords and use snowballAnalyzer with language 
specific keywords for the fields which will be in different languages. But I'm 
seeing that in your MemoryIndexTest you commented the use of SnowballAnalyzer, 
is it because it's too slow. In this case, I think I could use the 
StandardAnalyzer... what do you think?

Mélanie 
  
-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 12:46 PM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 03.07 skrev Melanie Langlois:

> Thanks Karl, the performances graph is really amazing!
> I have to say that it would not have think this way around would be  
> faster, but sounds nice if I can use this, make everything easier  
> to manage. I'm just wondering what did you consider when you build  
> your graph, only the time to run the queries? Because, I should add  
> the time for creating the index anytime a new document comes in (or  
> a subset of documents if several comes in same time), and the  
> indexing of these documents. The documents should not be big,  
> around 2KB. Did you measure this part ?

Adding a document to a MemoryIndex or InstantiatedIndex takes more or  
less the same time it would take to add it to an empty RAMDirectory.  
How many clock ticks is spent really depends on what analysers you use.

-- 
karl

>
> Mélanie
>
> -Original Message-
> From: karl wettin [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 23, 2007 10:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: Reverse search
>
>
> 23 mar 2007 kl. 02.12 skrev Melanie Langlois:
>
>> I want to manage user subscriptions to specific documents. So I
>> would like to store the subscription (query) into the lucene
>> directory, and whenever I receive a new document, I will search all
>> the matching subscriptions to send the documents to all subcribers.
>> For instance if a user subscribes to all documents with text
>> containing (WORD1 and WORD2) or WORD3, how can I match the incoming
>> document based on stored subscriptions? I was thinking to have two
>> subfields for each field of the subscription: the AND conditions
>> and the OR conditions.
>>
>> -OR. I will tokenized the document field content and insert OR
>> between each of them, and run the query against OR condition of
>> subscription
>>
>> -It's for the AND that I will have an issue, because if the
>> incoming text may contains more words than the sequence I want to
>> search.
>>
>> For instance, if I subscribe for documents contents lucene and java
>> for instance , if the incoming document contents is lucene is a
>> great API which has been developed in java, once I removed
>> stopwords my query would look like lucene and great and API and
>> developed and java.
>>
>> As query is composed of more words than the stored subscription I
>> will fail to retrieve the subscription. But if I put only or words,
>> the results will not be accurate, as I can obtain subscription only
>> for java for instance.
>>
>
> I wrote such a thing way back, where I used the new document as the
> query and the user subscriptions as the index. Similar to what you
> describe, I had an AND, OR and NOT field. This really limited the
> type of queries users could store. It does however work, particullary
> well on systems with /huge/ amounts of subscriptions (many millions).
>
> Today I would have used something else. If you insert one document at
> the time to your index, take a look at MemoryIndex in contrib. If you
> insert documents in batches larger than one document at the time,
> take a look at LUCENE-550 in the Jira. Add new documents to such an
> index and place the subscribed queries on it. Depening on the
> queries, the speed should be some 20-100 times faster than using a
> RAMDirectory. One million queries should take some 20 seconds to
> assemble and place on a 25 document index on my laptop. See  issues.apache.org/jira/secure/attachment/
> 12353601/12353601_HitCollectionBench.jpg> for performace of  
> LUCENE-550.
>
> -- 
> karl
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How can I index Phrases in Lucene?

2007-03-23 Thread mark harwood

This may be of interest:
http://issues.apache.org/jira/browse/LUCENE-474

Cheers
Mark


- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 23 March, 2007 3:25:02 AM
Subject: Re: How can I index Phrases in Lucene?

Is there any way to find frequent phrases without knowing what you are
looking for?

I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that
seems kind of clunky particularly if the phrase length is large.  Is
there any position offset magic that will surface frequent phrases
automatically?

thanks
ryan


On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> Well, you don't index phrases, it's done for you. You should try
> something like the following
>
> Create a SpanNearQuery with your terms. Specify an appropriate
> slop (probably 0 assuming you want them all next to each other).
>
> Now use call getSpans and count ... You may have to do
> something with overlapping spans, but you'll need to experiment
> a bit to understand it.
>
> Erick
>
> On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > I know how to index terms in lucene, now I wanna see
> > how can I index phrases like "information retreival"
> > in lucene and calculate the number of times that
> > phrase has appeared in the document. Is there any way
> > to do it in Lucene?
> >
> > Thanks
> >
> >
> >
> >
> > 
> > It's here! Your new message!
> > Get new email alerts with the free Yahoo! Toolbar.
> > http://tools.search.yahoo.com/toolbar/features/mail/
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]









___ 
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at 
the Yahoo! Mail Championships. Plus: play games and win prizes. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Michael McCandless

"SK R" <[EMAIL PROTECTED]> wrote:
> If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
> segments will be merged in RAMDir when 100 docs arrived. At the end of
> 350th
> doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> segment files not merged together and these are flushed to FSDir.
> 
> If wrong, please correct me.
> 
> My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2
> ...
> to reduce indexing time and get greater performance or no need to worry
> about it's relation?

Actually, maxBufferedDocs is how many docs are held in RAM before
flushing to a single segment.  So with 250, after adding the 250th doc
the writer will write the first segment; after adding the 500th doc,
it writes the second segment, etc.

Then, mergeFactor says how many segments can be written before a merge
takes place.  A mergeFactor of 10 means after writing 10 such
segments from above, they will be merged into a single segment with
2500 docs.  After another 2500 docs you'll have 2 such segments.  Then
once you've added your 25000'th doc, all of the 2500 doc segments will
be merged into a single 25000 segment doc, etc.

To maximize indexing performance you really want maxBufferedDocs to be
as large as you can handle (the bigger you make it, the more RAM is
required by the writer).

I believe (not certain) larger values of mergeFactor will also improve
performance since it defers merging as long as possible.  However, the
larger you make this, the more segments are allowed to exist in your
index, and at some point you will hit file handle limits with your
searchers.

I don't think these two parameters need to be proportional to one
another.  I don't think that will affect performance.

Another performance boost is to turn off compound file, but, this has
a severe cost of requiring far more file handles during searching.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reverse search

2007-03-23 Thread karl wettin



23 mar 2007 kl. 09.57 skrev Melanie Langlois:

Well, I though to use the PerFieldAnalyzerWrapper which contains as  
basic the snowballAnalyzer with English stopwords and use  
snowballAnalyzer with language specific keywords for the fields  
which will be in different languages. But I'm seeing that in your  
MemoryIndexTest you commented the use of SnowballAnalyzer, is it  
because it's too slow. In this case, I think I could use the  
StandardAnalyzer... what do you think?


I think that creating an index with a couple of documents takes a  
fraction of the time it will take to place a million queries on that  
index. There is no real need to optimize something that takes  
milliseconds when you in the same process do something that takes  
half a minute.


--
karl



Mélanie

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 12:46 PM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 03.07 skrev Melanie Langlois:


Thanks Karl, the performances graph is really amazing!
I have to say that it would not have think this way around would be
faster, but sounds nice if I can use this, make everything easier
to manage. I'm just wondering what did you consider when you build
your graph, only the time to run the queries? Because, I should add
the time for creating the index anytime a new document comes in (or
a subset of documents if several comes in same time), and the
indexing of these documents. The documents should not be big,
around 2KB. Did you measure this part ?


Adding a document to a MemoryIndex or InstantiatedIndex takes more or
less the same time it would take to add it to an empty RAMDirectory.
How many clock ticks is spent really depends on what analysers you  
use.


--
karl



Mélanie

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Friday, March 23, 2007 10:35 AM
To: java-user@lucene.apache.org
Subject: Re: Reverse search


23 mar 2007 kl. 02.12 skrev Melanie Langlois:


I want to manage user subscriptions to specific documents. So I
would like to store the subscription (query) into the lucene
directory, and whenever I receive a new document, I will search all
the matching subscriptions to send the documents to all subcribers.
For instance if a user subscribes to all documents with text
containing (WORD1 and WORD2) or WORD3, how can I match the incoming
document based on stored subscriptions? I was thinking to have two
subfields for each field of the subscription: the AND conditions
and the OR conditions.

-OR. I will tokenized the document field content and insert OR
between each of them, and run the query against OR condition of
subscription

-It's for the AND that I will have an issue, because if the
incoming text may contains more words than the sequence I want to
search.

For instance, if I subscribe for documents contents lucene and java
for instance , if the incoming document contents is lucene is a
great API which has been developed in java, once I removed
stopwords my query would look like lucene and great and API and
developed and java.

As query is composed of more words than the stored subscription I
will fail to retrieve the subscription. But if I put only or words,
the results will not be accurate, as I can obtain subscription only
for java for instance.



I wrote such a thing way back, where I used the new document as the
query and the user subscriptions as the index. Similar to what you
describe, I had an AND, OR and NOT field. This really limited the
type of queries users could store. It does however work, particullary
well on systems with /huge/ amounts of subscriptions (many millions).

Today I would have used something else. If you insert one document at
the time to your index, take a look at MemoryIndex in contrib. If you
insert documents in batches larger than one document at the time,
take a look at LUCENE-550 in the Jira. Add new documents to such an
index and place the subscribed queries on it. Depening on the
queries, the speed should be some 20-100 times faster than using a
RAMDirectory. One million queries should take some 20 seconds to
assemble and place on a 25 document index on my laptop. See  for performace of
LUCENE-550.

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To un

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread SK R

Please clarify the following.

1.When will be the segments in RAMDirectory moved (flushed) in to
FSDirectory?

2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge by
MergeFactor happen? whether in RAMDir or FSDir?

Thanks in Advance
RSK

On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote:

"SK R" <[EMAIL PROTECTED]> wrote:
> If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100
> segments will be merged in RAMDir when 100 docs arrived. At the end of
> 350th
> doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> segment files not merged together and these are flushed to FSDir.
>
> If wrong, please correct me.
>
> My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2
> ...
> to reduce indexing time and get greater performance or no need to worry
> about it's relation?

Actually, maxBufferedDocs is how many docs are held in RAM before
flushing to a single segment.  So with 250, after adding the 250th doc
the writer will write the first segment; after adding the 500th doc,
it writes the second segment, etc.

Then, mergeFactor says how many segments can be written before a merge
takes place.  A mergeFactor of 10 means after writing 10 such
segments from above, they will be merged into a single segment with
2500 docs.  After another 2500 docs you'll have 2 such segments.  Then
once you've added your 25000'th doc, all of the 2500 doc segments will
be merged into a single 25000 segment doc, etc.

To maximize indexing performance you really want maxBufferedDocs to be
as large as you can handle (the bigger you make it, the more RAM is
required by the writer).

I believe (not certain) larger values of mergeFactor will also improve
performance since it defers merging as long as possible.  However, the
larger you make this, the more segments are allowed to exist in your
index, and at some point you will hit file handle limits with your
searchers.

I don't think these two parameters need to be proportional to one
another.  I don't think that will affect performance.

Another performance boost is to turn off compound file, but, this has
a severe cost of requiring far more file handles during searching.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Erick Erickson


I haven't used it yet, but I've seen several references to
IndexWriter.ramSizeInBytes() and using it to control when the writer
flushes the RAM. This seems like a more deterministic way of
making things efficient than trying various combinations of
maxBufferedDocs , MergeFactor, etc, all of which are guesses
at best.

I'd be really curious if it works for you...

Erick

On 3/23/07, SK R <[EMAIL PROTECTED]> wrote:


Please clarify the following.

 1.When will be the segments in RAMDirectory moved (flushed) in to
FSDirectory?

 2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge
by
MergeFactor happen? whether in RAMDir or FSDir?

Thanks in Advance
RSK


On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
>
> "SK R" <[EMAIL PROTECTED]> wrote:
> > If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first
100
> > segments will be merged in RAMDir when 100 docs arrived. At the end of
> > 350th
> > doc added to writer , RAMDir have 2 merged segment files + 50 seperate
> > segment files not merged together and these are flushed to FSDir.
> >
> > If wrong, please correct me.
> >
> > My doubt is whether we should set MergeFactor & MaxBufferedDocs in
> > proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2
> > ...
> > to reduce indexing time and get greater performance or no need to
worry
> > about it's relation?
>
> Actually, maxBufferedDocs is how many docs are held in RAM before
> flushing to a single segment.  So with 250, after adding the 250th doc
> the writer will write the first segment; after adding the 500th doc,
> it writes the second segment, etc.
>
> Then, mergeFactor says how many segments can be written before a merge
> takes place.  A mergeFactor of 10 means after writing 10 such
> segments from above, they will be merged into a single segment with
> 2500 docs.  After another 2500 docs you'll have 2 such segments.  Then
> once you've added your 25000'th doc, all of the 2500 doc segments will
> be merged into a single 25000 segment doc, etc.
>
> To maximize indexing performance you really want maxBufferedDocs to be
> as large as you can handle (the bigger you make it, the more RAM is
> required by the writer).
>
> I believe (not certain) larger values of mergeFactor will also improve
> performance since it defers merging as long as possible.  However, the
> larger you make this, the more segments are allowed to exist in your
> index, and at some point you will hit file handle limits with your
> searchers.
>
> I don't think these two parameters need to be proportional to one
> another.  I don't think that will affect performance.
>
> Another performance boost is to turn off compound file, but, this has
> a severe cost of requiring far more file handles during searching.
>
> Mike
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Reverse search

2007-03-23 Thread mark harwood

Bear in mind that the million queries you run on the MemoryIndex can be 
shortlisted if you place those queries in a RAMIndex and use the source 
document's terms to "query the queries". The list of unique terms for your 
document is readily available in the MemoryIndex's TermEnum.
You can take this list and find "likely related queries" to execute from your 
Query index.
Note that for phrase queries or other forms of query with multiple mandatory 
terms you should only index one of the terms (preferably the rarest) to ensure 
that your query is not needlessly executed. For example - using this approach I 
need only run the phrase query for "XYZ limited" whenever I encounter a 
document with the rare term "XYZ" in it, rather than the much more commonplace 
"limited". 

Cheers
Mark

- Original Message 
From: karl wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 23 March, 2007 12:54:36 PM
Subject: Re: Reverse search


23 mar 2007 kl. 09.57 skrev Melanie Langlois:

> Well, I though to use the PerFieldAnalyzerWrapper which contains as  
> basic the snowballAnalyzer with English stopwords and use  
> snowballAnalyzer with language specific keywords for the fields  
> which will be in different languages. But I'm seeing that in your  
> MemoryIndexTest you commented the use of SnowballAnalyzer, is it  
> because it's too slow. In this case, I think I could use the  
> StandardAnalyzer... what do you think?

I think that creating an index with a couple of documents takes a  
fraction of the time it will take to place a million queries on that  
index. There is no real need to optimize something that takes  
milliseconds when you in the same process do something that takes  
half a minute.

-- 
karl

>
> Mélanie
>
> -Original Message-
> From: karl wettin [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 23, 2007 12:46 PM
> To: java-user@lucene.apache.org
> Subject: Re: Reverse search
>
>
> 23 mar 2007 kl. 03.07 skrev Melanie Langlois:
>
>> Thanks Karl, the performances graph is really amazing!
>> I have to say that it would not have think this way around would be
>> faster, but sounds nice if I can use this, make everything easier
>> to manage. I'm just wondering what did you consider when you build
>> your graph, only the time to run the queries? Because, I should add
>> the time for creating the index anytime a new document comes in (or
>> a subset of documents if several comes in same time), and the
>> indexing of these documents. The documents should not be big,
>> around 2KB. Did you measure this part ?
>
> Adding a document to a MemoryIndex or InstantiatedIndex takes more or
> less the same time it would take to add it to an empty RAMDirectory.
> How many clock ticks is spent really depends on what analysers you  
> use.
>
> -- 
> karl
>
>>
>> Mélanie
>>
>> -Original Message-
>> From: karl wettin [mailto:[EMAIL PROTECTED]
>> Sent: Friday, March 23, 2007 10:35 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Reverse search
>>
>>
>> 23 mar 2007 kl. 02.12 skrev Melanie Langlois:
>>
>>> I want to manage user subscriptions to specific documents. So I
>>> would like to store the subscription (query) into the lucene
>>> directory, and whenever I receive a new document, I will search all
>>> the matching subscriptions to send the documents to all subcribers.
>>> For instance if a user subscribes to all documents with text
>>> containing (WORD1 and WORD2) or WORD3, how can I match the incoming
>>> document based on stored subscriptions? I was thinking to have two
>>> subfields for each field of the subscription: the AND conditions
>>> and the OR conditions.
>>>
>>> -OR. I will tokenized the document field content and insert OR
>>> between each of them, and run the query against OR condition of
>>> subscription
>>>
>>> -It's for the AND that I will have an issue, because if the
>>> incoming text may contains more words than the sequence I want to
>>> search.
>>>
>>> For instance, if I subscribe for documents contents lucene and java
>>> for instance , if the incoming document contents is lucene is a
>>> great API which has been developed in java, once I removed
>>> stopwords my query would look like lucene and great and API and
>>> developed and java.
>>>
>>> As query is composed of more words than the stored subscription I
>>> will fail to retrieve the subscription. But if I put only or words,
>>> the results will not be accurate, as I can obtain subscription only
>>> for java for instance.
>>>
>>
>> I wrote such a thing way back, where I used the new document as the
>> query and the user subscriptions as the index. Similar to what you
>> describe, I had an AND, OR and NOT field. This really limited the
>> type of queries users could store. It does however work, particullary
>> well on systems with /huge/ amounts of subscriptions (many millions).
>>
>> Today I would have used something else. If you insert one document at
>> the time to your index, take a look at Memo

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Michael McCandless

"SK R" <[EMAIL PROTECTED]> wrote:
>  1.When will be the segments in RAMDirectory moved (flushed) in to
> FSDirectory?

This is maxBufferedDocs.  Right now, every added doc creates its own
segment in the RAMDir.  After maxBufferedDocs, all of these single
documents are merged and flushed to a single segment in FSDir.

This is actually not really a very efficient way for IndexWriter to
use RAM.  I'm working on improving this / speeding it up under this
Jira issue:

http://issues.apache.org/jira/browse/LUCENE-843

But it will be some time before this is stable & released!

>  2.Segments creation by maxBufferedDocs occur in RAMDir.

Actually, no.  The segments created due to maxBufferedDocs are in FSDir.

> Where merge by MergeFactor happen? whether in RAMDir or FSDir?

This is always in FSDir.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Michael McCandless

"Erick Erickson" <[EMAIL PROTECTED]> wrote:
> I haven't used it yet, but I've seen several references to
> IndexWriter.ramSizeInBytes() and using it to control when the writer
> flushes the RAM. This seems like a more deterministic way of
> making things efficient than trying various combinations of
> maxBufferedDocs , MergeFactor, etc, all of which are guesses
> at best.

I agree this is the most efficient way to flush.  The one caveat is
this Jira issue:

  http://issues.apache.org/jira/browse/LUCENE-845

which can cause over-merging if you make maxBufferedDocs too large.

I think the rule of thumb to avoid this issue is 1) set
maxBufferedDocs to be no more than 10X the "typical" number of docs
you will flush, and then 2) flush by RAM usage.

So for example if when you flush by RAM you typically flush "around"
200-300 docs, then setting maxBufferedDocs to eg 1000 is good since
it's far above 200-300 (so it won't trigger a flush when you didn't
want it to) but it's also well below 10X your range of docs (so it
won't tickle the above bug).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lazy Field Loading in IndexSearcher

2007-03-23 Thread jafarim


Hi
I am seeking for making use of the latest lazy field loading in lucene 2.1.
I store the orignal bytes of a document, say a PDF file for example, in a
special untokenized field in the index. Though there is enough facilities in
IndexReader class for lazy field loading, the search API in IndexSearcher
does not contain such facilities (seemingly). Hence, the Documents I get
from the Hits.doc() would not benefit from the mentioned feature.
Am I missing an important point or this is a desired feature to go on the
todo list?
--Jafarim

index word files ( doc )

2007-03-23 Thread e.j.w.vanbloem

Hello,
 
I am planning to index Word 2003 files. I read I have to use Jakarta Apache 
POI, but I also read on the POI site that their work with doc's is in an early 
stage.
 
Is POI advisable? Or are there better alternatives?
Please give some advice.
 
Regards,
 
Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index word files ( doc )

2007-03-23 Thread jafarim


Hi
My experience is not much satisfactory. It breaks very easily on many files.

On 3/23/07, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:


Hello,

I am planning to index Word 2003 files. I read I have to use Jakarta
Apache POI, but I also read on the POI site that their work with doc's is in
an early stage.

Is POI advisable? Or are there better alternatives?
Please give some advice.

Regards,

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lazy Field Loading in IndexSearcher

2007-03-23 Thread Chris Hostetter


please read the answer i gave you the last time you asked this question...

http://www.nabble.com/Re%3A-Lazy-field-loading-in-p9604064.html


: Hi
: I am seeking for making use of the latest lazy field loading in lucene 2.1.
: I store the orignal bytes of a document, say a PDF file for example, in a
: special untokenized field in the index. Though there is enough facilities in
: IndexReader class for lazy field loading, the search API in IndexSearcher
: does not contain such facilities (seemingly). Hence, the Documents I get
: from the Hits.doc() would not benefit from the mentioned feature.
: Am I missing an important point or this is a desired feature to go on the
: todo list?
: --Jafarim




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lazy Field Loading in IndexSearcher

2007-03-23 Thread jafarim


Sorry if the question is trivial but why not a Hits.doc(int,FieldSelector)
method?

On 3/23/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



please read the answer i gave you the last time you asked this question...

http://www.nabble.com/Re%3A-Lazy-field-loading-in-p9604064.html


: Hi
: I am seeking for making use of the latest lazy field loading in lucene
2.1.
: I store the orignal bytes of a document, say a PDF file for example, in
a
: special untokenized field in the index. Though there is enough
facilities in
: IndexReader class for lazy field loading, the search API in
IndexSearcher
: does not contain such facilities (seemingly). Hence, the Documents I get
: from the Hits.doc() would not benefit from the mentioned feature.
: Am I missing an important point or this is a desired feature to go on
the
: todo list?
: --Jafarim




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lazy Field Loading in IndexSearcher

2007-03-23 Thread Chris Hostetter


: Sorry if the question is trivial but why not a Hits.doc(int,FieldSelector)
: method?

As i said before...

>> Lazy loading stored fields is really about perfermance tweaking ... if
>> yoiu are that concerned baout performance, you shouldn't be using Hits at
>> all.

...there is a lot of info in the archives about why Hits is not what you
should be using if you are trying to tweak for speed.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Search Design Question

2007-03-23 Thread Michael J. Prichard


Hello All,

We allow our users to search through our index with a simple textfield.  
The search phrase has "content" as its default value.  This allows them 
to search quickly through content but then when they type "to:blah AND 
from:foo AND content:boogie" it will know to parse,etc.


What I want to do it expand it so when they type a phrase in the 
textfield it will search select all at once and still be smart enough to 
recognize a lucene query.


For example,  say we have these fields:

to
from
content
subject

When I type "michael contract negotiation" it will look through all 
these fields and return hits.


Then it should be able to recognize more advance searches like:

to:michael AND content:foo

and not go through all fields

Am I making sense?  Is this a good way to provide search?  How would I 
do this?


Thanks,
Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Design Question

2007-03-23 Thread Erick Erickson


I don't believe there's anything built into Lucene that helps you out here
because you're really saying "do special things for my problem space
in these situations".

So about the only thing you can do that I know of is to construct the
query yourself by making a series of additions to BooleanQuery
based on your particular problem space.

Which you have to do anyway because you want to turn a simple
michael query into
to:michael from:micheal contents:michael subject:michael
(note default or)

So you'll have to do something like
if (term has colon) {
   just add boolean clause for that term in that field
} else {
  add all four clauses
}

However, be warned.. this is not a trivial task if you want to
support arbitrary grouping, implement precedence, etc. Search
the list for "bad query bug" for an explanation, or go to the FAQ and
look at something like "why don't I get what I expect from queries".
In fact, you really need to look at and understand that FAQ entry
before you let Lucene loose on queries with AND, OR and NOT
in them. It'll be well worth your time..

One final note, it may be much easier for you to throw all the
fields into a single uber-field and search that rather than implement
all four separate clauses, but it's a trade off between simplicity and
size.

Best
Erick

On 3/23/07, Michael J. Prichard <[EMAIL PROTECTED]> wrote:


Hello All,

We allow our users to search through our index with a simple textfield.
The search phrase has "content" as its default value.  This allows them
to search quickly through content but then when they type "to:blah AND
from:foo AND content:boogie" it will know to parse,etc.

What I want to do it expand it so when they type a phrase in the
textfield it will search select all at once and still be smart enough to
recognize a lucene query.

For example,  say we have these fields:

to
from
content
subject

When I type "michael contract negotiation" it will look through all
these fields and return hits.

Then it should be able to recognize more advance searches like:

to:michael AND content:foo

and not go through all fields

Am I making sense?  Is this a good way to provide search?  How would I
do this?

Thanks,
Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: index word files ( doc )

2007-03-23 Thread e.j.w.vanbloem

Thank you,
 
Are there other sollutions?



Van: jafarim [mailto:[EMAIL PROTECTED]
Verzonden: vr 23-3-2007 18:55
Aan: java-user@lucene.apache.org
Onderwerp: Re: index word files ( doc )



Hi
My experience is not much satisfactory. It breaks very easily on many files.

On 3/23/07, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I am planning to index Word 2003 files. I read I have to use Jakarta
> Apache POI, but I also read on the POI site that their work with doc's is in
> an early stage.
>
> Is POI advisable? Or are there better alternatives?
> Please give some advice.
>
> Regards,
>
> Erik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index word files ( doc )

2007-03-23 Thread Otis Gospodnetic

I think the code from Lucene in Action has examples that us POI and the 
Textmining.org API.  Check manning.com/hatcher2 for the code.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, March 23, 2007 5:03:32 PM
Subject: RE: index word files ( doc )

Thank you,
 
Are there other sollutions?



Van: jafarim [mailto:[EMAIL PROTECTED]
Verzonden: vr 23-3-2007 18:55
Aan: java-user@lucene.apache.org
Onderwerp: Re: index word files ( doc )



Hi
My experience is not much satisfactory. It breaks very easily on many files.

On 3/23/07, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I am planning to index Word 2003 files. I read I have to use Jakarta
> Apache POI, but I also read on the POI site that their work with doc's is in
> an early stage.
>
> Is POI advisable? Or are there better alternatives?
> Please give some advice.
>
> Regards,
>
> Erik
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Design Question

2007-03-23 Thread Chris Hostetter


: One final note, it may be much easier for you to throw all the
: fields into a single uber-field and search that rather than implement
: all four separate clauses, but it's a trade off between simplicity and
: size.

this would be a very simple way to get the behavior you describe straight
from the lucene QueryParser ... i would certinaly recommend that approach.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index word files ( doc )

2007-03-23 Thread Antony Bowesman

www.textmining.org, but the site is no longer accessible.  Check Nutch which has 
a Word parser - it seems to be the original textmining.org Word6+POI parser.


Pre-word6 and "fast-saved" files will not work.  I've not found a solution for 
those
Antony


[EMAIL PROTECTED] wrote:

Thank you,
 
Are there other sollutions?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: index word files ( doc )

2007-03-23 Thread Sami Siren

Antony Bowesman wrote:

>> Are there other sollutions?

There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.

--
 Sami Siren

[1] http://www.winfield.demon.nl/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining score from two or more hits

RE: Reverse search

Re: How can I index Phrases in Lucene?

Re: MergeFactor and MaxBufferedDocs value should ...?

Re: Reverse search

Re: MergeFactor and MaxBufferedDocs value should ...?

Re: MergeFactor and MaxBufferedDocs value should ...?

Re: Reverse search

Re: MergeFactor and MaxBufferedDocs value should ...?

Re: MergeFactor and MaxBufferedDocs value should ...?

Lazy Field Loading in IndexSearcher

index word files ( doc )

Re: index word files ( doc )

Re: Lazy Field Loading in IndexSearcher

Re: Lazy Field Loading in IndexSearcher

Re: Lazy Field Loading in IndexSearcher

Search Design Question

Re: Search Design Question

RE: index word files ( doc )

Re: index word files ( doc )

Re: Search Design Question

Re: index word files ( doc )

Re: index word files ( doc )

23 matches

Site Navigation

Mail list logo

Footer information