Re: Combining score from two or more hits
Chris Hostetter wrote: if you are using a HitCollector, there any re-evaluation is going to happen in your code using whatever mechanism you want -- once your collect method is called on a docid, Lucene is done with that docid and no longer cares about it ... it's only whatever storage you may be maintaining of high scoring docs thta needs to know that you've decided the score has changed. your big problem is going to be that you basically need to maintain a list of *every* doc collected, if you don't know what the score of any of them are until you've processed all the rest ... since docs are collected in increasing order of docid, you might be able to make some optimizations based on how big of a gap you've got between the doc you are currently collecting and the last doc you've collected if you know that you're always going to add docs that "relate" to eachother in sequential bundles -- but this would be some very custom code depending on your use case. I only ever need to return a couple of ID fields per doc hit, so I load them with FieldCache when I start a new searcher. These IDs refer to unique objects elsewhere, but there can be one or more instances of the same Id in the index due to the way I've structured Documents. A Document = an attachment in the other system attached to the other system's object which can have 1...n attachments. My problem is I need to return only unique external Ids with some kind of combined score up to the requested maxHits from the client. Getting the unique Ids is no problem, but as you say I either have to store all hits and then sort them by score at the end once I know all unique docs, or do some clever stuff with some type of PriorityQueue that allows me to re-jig scores that already exist in the sorted queue. One idea your comments raise is the relationship of docids to the group of Documents added for the higher level object. All the Documents for the external object are added with a single writer at index time. Assuming that the Documents for a single external Id will either all exist or none, then will the doc ids always be sequential for ever for that external Id or will they 'reorganise' themselves? Thanks Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Reverse search
Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the snowballAnalyzer with English stopwords and use snowballAnalyzer with language specific keywords for the fields which will be in different languages. But I'm seeing that in your MemoryIndexTest you commented the use of SnowballAnalyzer, is it because it's too slow. In this case, I think I could use the StandardAnalyzer... what do you think? Mélanie -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 12:46 PM To: java-user@lucene.apache.org Subject: Re: Reverse search 23 mar 2007 kl. 03.07 skrev Melanie Langlois: > Thanks Karl, the performances graph is really amazing! > I have to say that it would not have think this way around would be > faster, but sounds nice if I can use this, make everything easier > to manage. I'm just wondering what did you consider when you build > your graph, only the time to run the queries? Because, I should add > the time for creating the index anytime a new document comes in (or > a subset of documents if several comes in same time), and the > indexing of these documents. The documents should not be big, > around 2KB. Did you measure this part ? Adding a document to a MemoryIndex or InstantiatedIndex takes more or less the same time it would take to add it to an empty RAMDirectory. How many clock ticks is spent really depends on what analysers you use. -- karl > > Mélanie > > -Original Message- > From: karl wettin [mailto:[EMAIL PROTECTED] > Sent: Friday, March 23, 2007 10:35 AM > To: java-user@lucene.apache.org > Subject: Re: Reverse search > > > 23 mar 2007 kl. 02.12 skrev Melanie Langlois: > >> I want to manage user subscriptions to specific documents. So I >> would like to store the subscription (query) into the lucene >> directory, and whenever I receive a new document, I will search all >> the matching subscriptions to send the documents to all subcribers. >> For instance if a user subscribes to all documents with text >> containing (WORD1 and WORD2) or WORD3, how can I match the incoming >> document based on stored subscriptions? I was thinking to have two >> subfields for each field of the subscription: the AND conditions >> and the OR conditions. >> >> -OR. I will tokenized the document field content and insert OR >> between each of them, and run the query against OR condition of >> subscription >> >> -It's for the AND that I will have an issue, because if the >> incoming text may contains more words than the sequence I want to >> search. >> >> For instance, if I subscribe for documents contents lucene and java >> for instance , if the incoming document contents is lucene is a >> great API which has been developed in java, once I removed >> stopwords my query would look like lucene and great and API and >> developed and java. >> >> As query is composed of more words than the stored subscription I >> will fail to retrieve the subscription. But if I put only or words, >> the results will not be accurate, as I can obtain subscription only >> for java for instance. >> > > I wrote such a thing way back, where I used the new document as the > query and the user subscriptions as the index. Similar to what you > describe, I had an AND, OR and NOT field. This really limited the > type of queries users could store. It does however work, particullary > well on systems with /huge/ amounts of subscriptions (many millions). > > Today I would have used something else. If you insert one document at > the time to your index, take a look at MemoryIndex in contrib. If you > insert documents in batches larger than one document at the time, > take a look at LUCENE-550 in the Jira. Add new documents to such an > index and place the subscribed queries on it. Depening on the > queries, the speed should be some 20-100 times faster than using a > RAMDirectory. One million queries should take some 20 seconds to > assemble and place on a 25 document index on my laptop. See issues.apache.org/jira/secure/attachment/ > 12353601/12353601_HitCollectionBench.jpg> for performace of > LUCENE-550. > > -- > karl > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How can I index Phrases in Lucene?
This may be of interest: http://issues.apache.org/jira/browse/LUCENE-474 Cheers Mark - Original Message From: Ryan McKinley <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 23 March, 2007 3:25:02 AM Subject: Re: How can I index Phrases in Lucene? Is there any way to find frequent phrases without knowing what you are looking for? I could index "A B C D E" as "A B C", "B C D", "C D E" etc, but that seems kind of clunky particularly if the phrase length is large. Is there any position offset magic that will surface frequent phrases automatically? thanks ryan On 3/22/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > Well, you don't index phrases, it's done for you. You should try > something like the following > > Create a SpanNearQuery with your terms. Specify an appropriate > slop (probably 0 assuming you want them all next to each other). > > Now use call getSpans and count ... You may have to do > something with overlapping spans, but you'll need to experiment > a bit to understand it. > > Erick > > On 3/22/07, Maryam <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I know how to index terms in lucene, now I wanna see > > how can I index phrases like "information retreival" > > in lucene and calculate the number of times that > > phrase has appeared in the document. Is there any way > > to do it in Lucene? > > > > Thanks > > > > > > > > > > > > It's here! Your new message! > > Get new email alerts with the free Yahoo! Toolbar. > > http://tools.search.yahoo.com/toolbar/features/mail/ > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes. http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergeFactor and MaxBufferedDocs value should ...?
"SK R" <[EMAIL PROTECTED]> wrote: > If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100 > segments will be merged in RAMDir when 100 docs arrived. At the end of > 350th > doc added to writer , RAMDir have 2 merged segment files + 50 seperate > segment files not merged together and these are flushed to FSDir. > > If wrong, please correct me. > > My doubt is whether we should set MergeFactor & MaxBufferedDocs in > proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 > ... > to reduce indexing time and get greater performance or no need to worry > about it's relation? Actually, maxBufferedDocs is how many docs are held in RAM before flushing to a single segment. So with 250, after adding the 250th doc the writer will write the first segment; after adding the 500th doc, it writes the second segment, etc. Then, mergeFactor says how many segments can be written before a merge takes place. A mergeFactor of 10 means after writing 10 such segments from above, they will be merged into a single segment with 2500 docs. After another 2500 docs you'll have 2 such segments. Then once you've added your 25000'th doc, all of the 2500 doc segments will be merged into a single 25000 segment doc, etc. To maximize indexing performance you really want maxBufferedDocs to be as large as you can handle (the bigger you make it, the more RAM is required by the writer). I believe (not certain) larger values of mergeFactor will also improve performance since it defers merging as long as possible. However, the larger you make this, the more segments are allowed to exist in your index, and at some point you will hit file handle limits with your searchers. I don't think these two parameters need to be proportional to one another. I don't think that will affect performance. Another performance boost is to turn off compound file, but, this has a severe cost of requiring far more file handles during searching. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Reverse search
23 mar 2007 kl. 09.57 skrev Melanie Langlois: Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the snowballAnalyzer with English stopwords and use snowballAnalyzer with language specific keywords for the fields which will be in different languages. But I'm seeing that in your MemoryIndexTest you commented the use of SnowballAnalyzer, is it because it's too slow. In this case, I think I could use the StandardAnalyzer... what do you think? I think that creating an index with a couple of documents takes a fraction of the time it will take to place a million queries on that index. There is no real need to optimize something that takes milliseconds when you in the same process do something that takes half a minute. -- karl Mélanie -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 12:46 PM To: java-user@lucene.apache.org Subject: Re: Reverse search 23 mar 2007 kl. 03.07 skrev Melanie Langlois: Thanks Karl, the performances graph is really amazing! I have to say that it would not have think this way around would be faster, but sounds nice if I can use this, make everything easier to manage. I'm just wondering what did you consider when you build your graph, only the time to run the queries? Because, I should add the time for creating the index anytime a new document comes in (or a subset of documents if several comes in same time), and the indexing of these documents. The documents should not be big, around 2KB. Did you measure this part ? Adding a document to a MemoryIndex or InstantiatedIndex takes more or less the same time it would take to add it to an empty RAMDirectory. How many clock ticks is spent really depends on what analysers you use. -- karl Mélanie -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Friday, March 23, 2007 10:35 AM To: java-user@lucene.apache.org Subject: Re: Reverse search 23 mar 2007 kl. 02.12 skrev Melanie Langlois: I want to manage user subscriptions to specific documents. So I would like to store the subscription (query) into the lucene directory, and whenever I receive a new document, I will search all the matching subscriptions to send the documents to all subcribers. For instance if a user subscribes to all documents with text containing (WORD1 and WORD2) or WORD3, how can I match the incoming document based on stored subscriptions? I was thinking to have two subfields for each field of the subscription: the AND conditions and the OR conditions. -OR. I will tokenized the document field content and insert OR between each of them, and run the query against OR condition of subscription -It's for the AND that I will have an issue, because if the incoming text may contains more words than the sequence I want to search. For instance, if I subscribe for documents contents lucene and java for instance , if the incoming document contents is lucene is a great API which has been developed in java, once I removed stopwords my query would look like lucene and great and API and developed and java. As query is composed of more words than the stored subscription I will fail to retrieve the subscription. But if I put only or words, the results will not be accurate, as I can obtain subscription only for java for instance. I wrote such a thing way back, where I used the new document as the query and the user subscriptions as the index. Similar to what you describe, I had an AND, OR and NOT field. This really limited the type of queries users could store. It does however work, particullary well on systems with /huge/ amounts of subscriptions (many millions). Today I would have used something else. If you insert one document at the time to your index, take a look at MemoryIndex in contrib. If you insert documents in batches larger than one document at the time, take a look at LUCENE-550 in the Jira. Add new documents to such an index and place the subscribed queries on it. Depening on the queries, the speed should be some 20-100 times faster than using a RAMDirectory. One million queries should take some 20 seconds to assemble and place on a 25 document index on my laptop. See for performace of LUCENE-550. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To un
Re: MergeFactor and MaxBufferedDocs value should ...?
Please clarify the following. 1.When will be the segments in RAMDirectory moved (flushed) in to FSDirectory? 2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge by MergeFactor happen? whether in RAMDir or FSDir? Thanks in Advance RSK On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote: "SK R" <[EMAIL PROTECTED]> wrote: > If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100 > segments will be merged in RAMDir when 100 docs arrived. At the end of > 350th > doc added to writer , RAMDir have 2 merged segment files + 50 seperate > segment files not merged together and these are flushed to FSDir. > > If wrong, please correct me. > > My doubt is whether we should set MergeFactor & MaxBufferedDocs in > proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 > ... > to reduce indexing time and get greater performance or no need to worry > about it's relation? Actually, maxBufferedDocs is how many docs are held in RAM before flushing to a single segment. So with 250, after adding the 250th doc the writer will write the first segment; after adding the 500th doc, it writes the second segment, etc. Then, mergeFactor says how many segments can be written before a merge takes place. A mergeFactor of 10 means after writing 10 such segments from above, they will be merged into a single segment with 2500 docs. After another 2500 docs you'll have 2 such segments. Then once you've added your 25000'th doc, all of the 2500 doc segments will be merged into a single 25000 segment doc, etc. To maximize indexing performance you really want maxBufferedDocs to be as large as you can handle (the bigger you make it, the more RAM is required by the writer). I believe (not certain) larger values of mergeFactor will also improve performance since it defers merging as long as possible. However, the larger you make this, the more segments are allowed to exist in your index, and at some point you will hit file handle limits with your searchers. I don't think these two parameters need to be proportional to one another. I don't think that will affect performance. Another performance boost is to turn off compound file, but, this has a severe cost of requiring far more file handles during searching. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergeFactor and MaxBufferedDocs value should ...?
I haven't used it yet, but I've seen several references to IndexWriter.ramSizeInBytes() and using it to control when the writer flushes the RAM. This seems like a more deterministic way of making things efficient than trying various combinations of maxBufferedDocs , MergeFactor, etc, all of which are guesses at best. I'd be really curious if it works for you... Erick On 3/23/07, SK R <[EMAIL PROTECTED]> wrote: Please clarify the following. 1.When will be the segments in RAMDirectory moved (flushed) in to FSDirectory? 2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge by MergeFactor happen? whether in RAMDir or FSDir? Thanks in Advance RSK On 3/23/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > "SK R" <[EMAIL PROTECTED]> wrote: > > If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100 > > segments will be merged in RAMDir when 100 docs arrived. At the end of > > 350th > > doc added to writer , RAMDir have 2 merged segment files + 50 seperate > > segment files not merged together and these are flushed to FSDir. > > > > If wrong, please correct me. > > > > My doubt is whether we should set MergeFactor & MaxBufferedDocs in > > proportional ratio (i.e) MaxBufferedDocs = n*MergeFactor where n = 1,2 > > ... > > to reduce indexing time and get greater performance or no need to worry > > about it's relation? > > Actually, maxBufferedDocs is how many docs are held in RAM before > flushing to a single segment. So with 250, after adding the 250th doc > the writer will write the first segment; after adding the 500th doc, > it writes the second segment, etc. > > Then, mergeFactor says how many segments can be written before a merge > takes place. A mergeFactor of 10 means after writing 10 such > segments from above, they will be merged into a single segment with > 2500 docs. After another 2500 docs you'll have 2 such segments. Then > once you've added your 25000'th doc, all of the 2500 doc segments will > be merged into a single 25000 segment doc, etc. > > To maximize indexing performance you really want maxBufferedDocs to be > as large as you can handle (the bigger you make it, the more RAM is > required by the writer). > > I believe (not certain) larger values of mergeFactor will also improve > performance since it defers merging as long as possible. However, the > larger you make this, the more segments are allowed to exist in your > index, and at some point you will hit file handle limits with your > searchers. > > I don't think these two parameters need to be proportional to one > another. I don't think that will affect performance. > > Another performance boost is to turn off compound file, but, this has > a severe cost of requiring far more file handles during searching. > > Mike > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Reverse search
Bear in mind that the million queries you run on the MemoryIndex can be shortlisted if you place those queries in a RAMIndex and use the source document's terms to "query the queries". The list of unique terms for your document is readily available in the MemoryIndex's TermEnum. You can take this list and find "likely related queries" to execute from your Query index. Note that for phrase queries or other forms of query with multiple mandatory terms you should only index one of the terms (preferably the rarest) to ensure that your query is not needlessly executed. For example - using this approach I need only run the phrase query for "XYZ limited" whenever I encounter a document with the rare term "XYZ" in it, rather than the much more commonplace "limited". Cheers Mark - Original Message From: karl wettin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 23 March, 2007 12:54:36 PM Subject: Re: Reverse search 23 mar 2007 kl. 09.57 skrev Melanie Langlois: > Well, I though to use the PerFieldAnalyzerWrapper which contains as > basic the snowballAnalyzer with English stopwords and use > snowballAnalyzer with language specific keywords for the fields > which will be in different languages. But I'm seeing that in your > MemoryIndexTest you commented the use of SnowballAnalyzer, is it > because it's too slow. In this case, I think I could use the > StandardAnalyzer... what do you think? I think that creating an index with a couple of documents takes a fraction of the time it will take to place a million queries on that index. There is no real need to optimize something that takes milliseconds when you in the same process do something that takes half a minute. -- karl > > Mélanie > > -Original Message- > From: karl wettin [mailto:[EMAIL PROTECTED] > Sent: Friday, March 23, 2007 12:46 PM > To: java-user@lucene.apache.org > Subject: Re: Reverse search > > > 23 mar 2007 kl. 03.07 skrev Melanie Langlois: > >> Thanks Karl, the performances graph is really amazing! >> I have to say that it would not have think this way around would be >> faster, but sounds nice if I can use this, make everything easier >> to manage. I'm just wondering what did you consider when you build >> your graph, only the time to run the queries? Because, I should add >> the time for creating the index anytime a new document comes in (or >> a subset of documents if several comes in same time), and the >> indexing of these documents. The documents should not be big, >> around 2KB. Did you measure this part ? > > Adding a document to a MemoryIndex or InstantiatedIndex takes more or > less the same time it would take to add it to an empty RAMDirectory. > How many clock ticks is spent really depends on what analysers you > use. > > -- > karl > >> >> Mélanie >> >> -Original Message- >> From: karl wettin [mailto:[EMAIL PROTECTED] >> Sent: Friday, March 23, 2007 10:35 AM >> To: java-user@lucene.apache.org >> Subject: Re: Reverse search >> >> >> 23 mar 2007 kl. 02.12 skrev Melanie Langlois: >> >>> I want to manage user subscriptions to specific documents. So I >>> would like to store the subscription (query) into the lucene >>> directory, and whenever I receive a new document, I will search all >>> the matching subscriptions to send the documents to all subcribers. >>> For instance if a user subscribes to all documents with text >>> containing (WORD1 and WORD2) or WORD3, how can I match the incoming >>> document based on stored subscriptions? I was thinking to have two >>> subfields for each field of the subscription: the AND conditions >>> and the OR conditions. >>> >>> -OR. I will tokenized the document field content and insert OR >>> between each of them, and run the query against OR condition of >>> subscription >>> >>> -It's for the AND that I will have an issue, because if the >>> incoming text may contains more words than the sequence I want to >>> search. >>> >>> For instance, if I subscribe for documents contents lucene and java >>> for instance , if the incoming document contents is lucene is a >>> great API which has been developed in java, once I removed >>> stopwords my query would look like lucene and great and API and >>> developed and java. >>> >>> As query is composed of more words than the stored subscription I >>> will fail to retrieve the subscription. But if I put only or words, >>> the results will not be accurate, as I can obtain subscription only >>> for java for instance. >>> >> >> I wrote such a thing way back, where I used the new document as the >> query and the user subscriptions as the index. Similar to what you >> describe, I had an AND, OR and NOT field. This really limited the >> type of queries users could store. It does however work, particullary >> well on systems with /huge/ amounts of subscriptions (many millions). >> >> Today I would have used something else. If you insert one document at >> the time to your index, take a look at Memo
Re: MergeFactor and MaxBufferedDocs value should ...?
"SK R" <[EMAIL PROTECTED]> wrote: > 1.When will be the segments in RAMDirectory moved (flushed) in to > FSDirectory? This is maxBufferedDocs. Right now, every added doc creates its own segment in the RAMDir. After maxBufferedDocs, all of these single documents are merged and flushed to a single segment in FSDir. This is actually not really a very efficient way for IndexWriter to use RAM. I'm working on improving this / speeding it up under this Jira issue: http://issues.apache.org/jira/browse/LUCENE-843 But it will be some time before this is stable & released! > 2.Segments creation by maxBufferedDocs occur in RAMDir. Actually, no. The segments created due to maxBufferedDocs are in FSDir. > Where merge by MergeFactor happen? whether in RAMDir or FSDir? This is always in FSDir. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergeFactor and MaxBufferedDocs value should ...?
"Erick Erickson" <[EMAIL PROTECTED]> wrote: > I haven't used it yet, but I've seen several references to > IndexWriter.ramSizeInBytes() and using it to control when the writer > flushes the RAM. This seems like a more deterministic way of > making things efficient than trying various combinations of > maxBufferedDocs , MergeFactor, etc, all of which are guesses > at best. I agree this is the most efficient way to flush. The one caveat is this Jira issue: http://issues.apache.org/jira/browse/LUCENE-845 which can cause over-merging if you make maxBufferedDocs too large. I think the rule of thumb to avoid this issue is 1) set maxBufferedDocs to be no more than 10X the "typical" number of docs you will flush, and then 2) flush by RAM usage. So for example if when you flush by RAM you typically flush "around" 200-300 docs, then setting maxBufferedDocs to eg 1000 is good since it's far above 200-300 (so it won't trigger a flush when you didn't want it to) but it's also well below 10X your range of docs (so it won't tickle the above bug). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lazy Field Loading in IndexSearcher
Hi I am seeking for making use of the latest lazy field loading in lucene 2.1. I store the orignal bytes of a document, say a PDF file for example, in a special untokenized field in the index. Though there is enough facilities in IndexReader class for lazy field loading, the search API in IndexSearcher does not contain such facilities (seemingly). Hence, the Documents I get from the Hits.doc() would not benefit from the mentioned feature. Am I missing an important point or this is a desired feature to go on the todo list? --Jafarim
index word files ( doc )
Hello, I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an early stage. Is POI advisable? Or are there better alternatives? Please give some advice. Regards, Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
Hi My experience is not much satisfactory. It breaks very easily on many files. On 3/23/07, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote: Hello, I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an early stage. Is POI advisable? Or are there better alternatives? Please give some advice. Regards, Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lazy Field Loading in IndexSearcher
please read the answer i gave you the last time you asked this question... http://www.nabble.com/Re%3A-Lazy-field-loading-in-p9604064.html : Hi : I am seeking for making use of the latest lazy field loading in lucene 2.1. : I store the orignal bytes of a document, say a PDF file for example, in a : special untokenized field in the index. Though there is enough facilities in : IndexReader class for lazy field loading, the search API in IndexSearcher : does not contain such facilities (seemingly). Hence, the Documents I get : from the Hits.doc() would not benefit from the mentioned feature. : Am I missing an important point or this is a desired feature to go on the : todo list? : --Jafarim -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lazy Field Loading in IndexSearcher
Sorry if the question is trivial but why not a Hits.doc(int,FieldSelector) method? On 3/23/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: please read the answer i gave you the last time you asked this question... http://www.nabble.com/Re%3A-Lazy-field-loading-in-p9604064.html : Hi : I am seeking for making use of the latest lazy field loading in lucene 2.1. : I store the orignal bytes of a document, say a PDF file for example, in a : special untokenized field in the index. Though there is enough facilities in : IndexReader class for lazy field loading, the search API in IndexSearcher : does not contain such facilities (seemingly). Hence, the Documents I get : from the Hits.doc() would not benefit from the mentioned feature. : Am I missing an important point or this is a desired feature to go on the : todo list? : --Jafarim -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lazy Field Loading in IndexSearcher
: Sorry if the question is trivial but why not a Hits.doc(int,FieldSelector) : method? As i said before... >> Lazy loading stored fields is really about perfermance tweaking ... if >> yoiu are that concerned baout performance, you shouldn't be using Hits at >> all. ...there is a lot of info in the archives about why Hits is not what you should be using if you are trying to tweak for speed. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search Design Question
Hello All, We allow our users to search through our index with a simple textfield. The search phrase has "content" as its default value. This allows them to search quickly through content but then when they type "to:blah AND from:foo AND content:boogie" it will know to parse,etc. What I want to do it expand it so when they type a phrase in the textfield it will search select all at once and still be smart enough to recognize a lucene query. For example, say we have these fields: to from content subject When I type "michael contract negotiation" it will look through all these fields and return hits. Then it should be able to recognize more advance searches like: to:michael AND content:foo and not go through all fields Am I making sense? Is this a good way to provide search? How would I do this? Thanks, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Design Question
I don't believe there's anything built into Lucene that helps you out here because you're really saying "do special things for my problem space in these situations". So about the only thing you can do that I know of is to construct the query yourself by making a series of additions to BooleanQuery based on your particular problem space. Which you have to do anyway because you want to turn a simple michael query into to:michael from:micheal contents:michael subject:michael (note default or) So you'll have to do something like if (term has colon) { just add boolean clause for that term in that field } else { add all four clauses } However, be warned.. this is not a trivial task if you want to support arbitrary grouping, implement precedence, etc. Search the list for "bad query bug" for an explanation, or go to the FAQ and look at something like "why don't I get what I expect from queries". In fact, you really need to look at and understand that FAQ entry before you let Lucene loose on queries with AND, OR and NOT in them. It'll be well worth your time.. One final note, it may be much easier for you to throw all the fields into a single uber-field and search that rather than implement all four separate clauses, but it's a trade off between simplicity and size. Best Erick On 3/23/07, Michael J. Prichard <[EMAIL PROTECTED]> wrote: Hello All, We allow our users to search through our index with a simple textfield. The search phrase has "content" as its default value. This allows them to search quickly through content but then when they type "to:blah AND from:foo AND content:boogie" it will know to parse,etc. What I want to do it expand it so when they type a phrase in the textfield it will search select all at once and still be smart enough to recognize a lucene query. For example, say we have these fields: to from content subject When I type "michael contract negotiation" it will look through all these fields and return hits. Then it should be able to recognize more advance searches like: to:michael AND content:foo and not go through all fields Am I making sense? Is this a good way to provide search? How would I do this? Thanks, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: index word files ( doc )
Thank you, Are there other sollutions? Van: jafarim [mailto:[EMAIL PROTECTED] Verzonden: vr 23-3-2007 18:55 Aan: java-user@lucene.apache.org Onderwerp: Re: index word files ( doc ) Hi My experience is not much satisfactory. It breaks very easily on many files. On 3/23/07, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote: > > Hello, > > I am planning to index Word 2003 files. I read I have to use Jakarta > Apache POI, but I also read on the POI site that their work with doc's is in > an early stage. > > Is POI advisable? Or are there better alternatives? > Please give some advice. > > Regards, > > Erik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
I think the code from Lucene in Action has examples that us POI and the Textmining.org API. Check manning.com/hatcher2 for the code. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, March 23, 2007 5:03:32 PM Subject: RE: index word files ( doc ) Thank you, Are there other sollutions? Van: jafarim [mailto:[EMAIL PROTECTED] Verzonden: vr 23-3-2007 18:55 Aan: java-user@lucene.apache.org Onderwerp: Re: index word files ( doc ) Hi My experience is not much satisfactory. It breaks very easily on many files. On 3/23/07, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote: > > Hello, > > I am planning to index Word 2003 files. I read I have to use Jakarta > Apache POI, but I also read on the POI site that their work with doc's is in > an early stage. > > Is POI advisable? Or are there better alternatives? > Please give some advice. > > Regards, > > Erik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Design Question
: One final note, it may be much easier for you to throw all the : fields into a single uber-field and search that rather than implement : all four separate clauses, but it's a trade off between simplicity and : size. this would be a very simple way to get the behavior you describe straight from the lucene QueryParser ... i would certinaly recommend that approach. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
www.textmining.org, but the site is no longer accessible. Check Nutch which has a Word parser - it seems to be the original textmining.org Word6+POI parser. Pre-word6 and "fast-saved" files will not work. I've not found a solution for those Antony [EMAIL PROTECTED] wrote: Thank you, Are there other sollutions? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
Antony Bowesman wrote: >> Are there other sollutions? There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. -- Sami Siren [1] http://www.winfield.demon.nl/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]