Re: Return the sentence number in the indexed files
thanks Grant for the answer, to index each sentence as a separate document , i already did this and it work fine, i indexed more than 93000 sentences (Documents) approx. in 11 minutes. I though the other option might be more efficient. Farag Grant Ingersoll-6 wrote: > > > On Jul 19, 2008, at 6:00 AM, starz10de wrote: > >> >> Hi All, >> >> I have a text files that contain several sentences, there is space >> between >> each sentence. >> When searching the index , i get the path for the documents that >> match the >> query >> >> String path = doc.get("path"); >> >> >> Is it possible to get the number of the sentence that match the query >> inside the matched documents? > > Not without some extra work. This kind of thing requires post (or > pre) processing. You can use SpanQuery to know where in a document > you matched, and then do the sentence calculations. Another option is > to index each sentence as a separate document and then post process to > combine. > > If you search the archives on this list and java-dev you'll see > several discussions on the topic. See: > http://lucene.markmail.org/message/we25gm32p6qot32c?q=sentence+detection > and > http://lucene.markmail.org/message/uq6ffx3oqsulgxys?q=sentence > > HTH, > Grant > > > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Return-the-sentence-number-in-the-indexed-files-tp18543061p18553514.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boolean expression for no terms OR matching a wildcard
A query solution is preferable.. but I can programmatically filter my results after the fact, it just seems like something that the Lucene team should consider adding.. I think it would only have value for wildcard queries, but nonetheless it would have some value I think.. -Ron On Jul 18, 2008, at 6:24 PM, eks dev wrote: Analyzer that detects your condition "ALL match something", if possible at all... e.g. "800123456 80034543534 80023423423" -> 800 than you put it in ALL_MATCH field and match this condition against it... if this prefix needs to be variable, you could extract all matching prefixes to this fiield an make your query work like "ALL_MATCH:800" and care not for the rest :) than yo would not need field1 at all for these queries you were looking for something like this or you need "Query solution"? - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Saturday, 19 July, 2008 12:00:39 AM Subject: Re: Boolean expression for no terms OR matching a wildcard : Maybe this is easier ... suppose what I'm indexing is a phone number, and : there are multiple phone numbers for what I'm indexing under the same field : (phone) and I want the wildcard query to match only records that have either : no phone numbers at all OR where ALL phone numbers are in a specific area code : (e.g. 800* would match all in the 800 area code). i can't think of anyway to accomplish the second part of your query. specificly, given the following records... Doc1: field1:AAA, field1:Aaa, field1:Bb, field1:C, field2:X, field3:Y Doc2: field1:AAA, field1:Aaa, field1:Aa, field2:Z ...i can't think of any type of query like field1:A* which would match Doc2 but not Doc1 (because there are other field1 values that do not start with 'A') -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Not happy with your email address?. Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Doesn it make sense cache IndexReader?
With very small indexes and no sort fields (eg. you just use relevance) loading an IndexReader does not take very long. I think it does always make sense to cache it and reuse it though - unless the index has changed, there is no reason to pay the price of opening a new IndexReader. As your index grows, and especially if you sort on fields, the price of 'warming' a new IndexReader can get quite high. However, if memory is an issue, you might not cache just so that its not guaranteed that all of your IndexReaders for each index will be in memory at the same time - this is not a great solution to your memory problem though - if all indecies are ever searched at the same time, you will need to be able to accommodate that many IndexReaders in RAM anyway. - Mark Mohsen Saboorian wrote: Hi, I have a set of indices in different languages (very smal indices: on average each index directory has 10,000 documents, which has an overall size of less than 2mb). I want to know if this is a good idea to cache IndexReader (once opened) somewhere and further reuse it? My application is single-threaded, but I have memory concerns, since the indexing and search is totally done on client machines. Does it make sense to have all IndexReaders cached (for example if opening an index reader takes some time). Does Lucene load some part of index in memory as soon as a call to IndexReader.open() is done? Mohsen. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to avoid duplicate records in lucene
Sebastin wrote: Hi All, Is there any possibility to avoid duplicate records in lucene 2.3.1? I don't believe that there is a very high performance way to do this. You are basically going to have to query the index for an id before adding a new doc. The best way I can think of off the top of my head is to batch - first check that ids in the batch are unique, then check all ids in the batch against the IndexReader, then add the ones that are not dupes. Of course all of your docs would have to be added through this single choke point so that you knew other threads had not added that id after the first thread had looked but before it added the doc. I think Mark H has you covered if getting the dupes out after are okay. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]