Re: Return the sentence number in the indexed files

2008-07-20 Thread starz10de

thanks Grant for the answer,

 to index each sentence as a separate document  , i already did this and it
work fine, i indexed more than 93000 sentences (Documents)  approx. in 11
minutes. I though the other option might be more efficient.

Farag 

Grant Ingersoll-6 wrote:
> 
> 
> On Jul 19, 2008, at 6:00 AM, starz10de wrote:
> 
>>
>> Hi All,
>>
>> I have a text files that contain several sentences, there is space  
>> between
>> each sentence.
>> When searching the index  , i get the path for the documents that  
>> match the
>> query
>>
>> String path = doc.get("path");
>>
>>
>> Is it possible to get the number of the sentence that match the query
>> inside the matched documents?
> 
> Not without some extra work.  This kind of thing requires post (or  
> pre) processing.  You can use SpanQuery to know where in a document  
> you matched, and then do the sentence calculations.  Another option is  
> to index each sentence as a separate document and then post process to  
> combine.
> 
> If you search the archives on this list and java-dev you'll see  
> several discussions on the topic.   See:
> http://lucene.markmail.org/message/we25gm32p6qot32c?q=sentence+detection
> and
> http://lucene.markmail.org/message/uq6ffx3oqsulgxys?q=sentence
> 
> HTH,
> Grant
> 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Return-the-sentence-number-in-the-indexed-files-tp18543061p18553514.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Boolean expression for no terms OR matching a wildcard

2008-07-20 Thread Ronald Rudy
A query solution is preferable.. but I can programmatically filter my  
results after the fact, it just seems like something that the Lucene  
team should consider adding.. I think it would only have value for  
wildcard queries, but nonetheless it would have some value I think..


-Ron


On Jul 18, 2008, at 6:24 PM, eks dev wrote:

Analyzer that detects your condition "ALL match something", if  
possible at all...

e.g. "800123456 80034543534 80023423423" -> 800

than you put it in ALL_MATCH field and match this condition against  
it... if this prefix needs to be variable, you could extract all  
matching prefixes to this fiield an make your query work like  
"ALL_MATCH:800" and care not for the rest :) than yo would not need  
field1 at all for these queries


you were looking for something like this or you need "Query solution"?



- Original Message 

From: Chris Hostetter <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, 19 July, 2008 12:00:39 AM
Subject: Re: Boolean expression for no terms OR matching a wildcard

: Maybe this is easier ... suppose what I'm indexing is a phone  
number, and
: there are multiple phone numbers for what I'm indexing under the  
same field
: (phone) and I want the wildcard query to match only records that  
have either
: no phone numbers at all OR where ALL phone numbers are in a  
specific area code

: (e.g. 800* would match all in the 800 area code).

i can't think of anyway to accomplish the second part of your query.
specificly, given the following records...

 Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X,  
field3:Y

 Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z

...i can't think of any type of query like field1:A* which would  
match
Doc2 but not Doc1 (because there are other field1 values that do  
not start

with 'A')



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 __
Not happy with your email address?.
Get the one you really want - millions of new email addresses  
available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Doesn it make sense cache IndexReader?

2008-07-20 Thread Mark Miller
With very small indexes and no sort fields (eg. you just use relevance) 
loading an IndexReader does not take very long. I think it does always 
make sense to cache it and reuse it though - unless the index has 
changed, there is no reason to pay the price of opening a new 
IndexReader. As your index grows, and especially if you sort on fields, 
the price of 'warming' a new IndexReader can get quite high.


However, if memory is an issue, you might not cache just so that its not 
guaranteed that all of your IndexReaders for each index will be in 
memory at the same time - this is not a great solution to your memory 
problem though  - if all indecies are ever searched at the same time, 
you will need to be able to accommodate that many IndexReaders in RAM 
anyway.


- Mark

Mohsen Saboorian wrote:

Hi,
I have a set of indices in different languages (very smal indices: on
average each index directory has 10,000 documents, which has an overall size
of less than 2mb). I want to know if this is a good idea to cache
IndexReader (once opened) somewhere and further reuse it? My application is
single-threaded, but I have memory concerns, since the indexing and search
is totally done on client machines. 


Does it make sense to have all IndexReaders cached (for example if opening
an index reader takes some time). Does Lucene load some part of index in
memory as soon as a call to IndexReader.open() is done?

Mohsen.
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to avoid duplicate records in lucene

2008-07-20 Thread Mark Miller

Sebastin wrote:

Hi All,

Is there any possibility to avoid duplicate records in lucene  2.3.1? 
  
I don't believe that there is a very high performance way to do this. 
You are basically going to have to query the index for an id before 
adding a new doc. The best way I can think of off the top of my head is 
to batch - first check that ids in the batch are unique, then check all 
ids in the batch against the IndexReader, then add the ones that are not 
dupes. Of course all of your docs would have to be added through this 
single choke point so that you knew other threads had not added that id 
after the first thread had looked but before it added the doc.


I think Mark H has you covered if getting the dupes out after are okay.

- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]