Re: Using Hits as document space for new search
Thanks a lot, i've discovered about Solr , column classification and other interesting things. ;) Best Stas hossman wrote: > > > : For example, in my case it's car searching form. > : First of all i'm telling that i want to search for BMW. System returning > set > : of results. > : In process of viewing results system shows additional criterias for > making > : search result more exact, and shows count of result set after adding > this > : criteria (..this count is smaller than current result set size, because > new > : result is just subset of current result list). > > this is generally known as "faceted searching" ... if you search the list > archives for "facet" or in some cases 'category counts" you'll find > numerous discussions on how to tackle problems like this. > > In general: you don't want to try this using something like the Hits > class, it's internal behavior is very inefficient forsoemthing like this > -- building Filters (and caching them) tends to be the way to go 9and you > can always build a Filter out of a query) > > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Using-Hits-as-document-space-for-new-search-tp19511672p19624222.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi Field search without Multifieldqueryparser
Here is what I'm trying to do: say a lucene document: name: abc ^10 organization: xyz ^3 ^10 and ^3 are boosts in the document. now if I query name: abc ^5 AND organization: xyz this will work. but if I query (default_field): abc^5 AND xyz this won't work. Now what I want is that a text can be associated with more than one field. i.e. (field1,field2,field3):value name,(default_field),title: abc^10 organization,(default_field),institute: xyz^3 then both of my queries will work. Is it possible to do so in lucene without changing the source? If no then can anyone please explain the indexing and searching mechanism for lucene, so that I can start working on it. The solution given by the java-users won't work for me as I do not want to add all the contents of the document in a single field and then search for that field, as this would increase the index size and I've to index more than 10 million documents. Also multifieldqueryparser will make it query execution inefficient, as there will be thousands of fields. If I start storing just a single field as: (default_field): "name abc organization xyz", then it is possible that some other documents might get selected that are not relevant. Also i want to boost individual fields in a document. Anshul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi Field search without Multifieldqueryparser
So, the piece I'm missing is how do you know what field for which terms. In other words how do you know xyz goes against organization and abc against name. Your wording implies that you don't know this before hand, yet you are somehow suggesting that Lucene should be able to do it. Correct me if I'm wrong. -Grant On Sep 23, 2008, at 6:51 AM, Anshul jain wrote: Here is what I'm trying to do: say a lucene document: name: abc ^10 organization: xyz ^3 ^10 and ^3 are boosts in the document. now if I query name: abc ^5 AND organization: xyz this will work. but if I query (default_field): abc^5 AND xyz this won't work. Now what I want is that a text can be associated with more than one field. i.e. (field1,field2,field3):value name,(default_field),title: abc^10 organization,(default_field),institute: xyz^3 then both of my queries will work. Is it possible to do so in lucene without changing the source? If no then can anyone please explain the indexing and searching mechanism for lucene, so that I can start working on it. The solution given by the java-users won't work for me as I do not want to add all the contents of the document in a single field and then search for that field, as this would increase the index size and I've to index more than 10 million documents. Also multifieldqueryparser will make it query execution inefficient, as there will be thousands of fields. If I start storing just a single field as: (default_field): "name abc organization xyz", then it is possible that some other documents might get selected that are not relevant. Also i want to boost individual fields in a document. Anshul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query attached words
Yes you can query *method. But you have to turn leading wildcards (which I don't have right on the tips of my fingers, but know it's been an option for some time now). But your solution doesn't scale well. If you had a.b.c.d.e.f.g.h you'd have to store many combinations in order to do what you want, quickly becoming really, really ugly. But you could store the tokens a . b . c . e . f . g . h by using the appropriate analyzer (or perhaps rolling your own). Then you could use either PhraseQuerys or SpanQuerys to do what you want Best Erick On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio <[EMAIL PROTECTED]>wrote: > Hello, > > If I had a file with the following content: > ... > object.method(); > ... > I would like to be able to query for > object > method > object.method > > My guess is that I should store not only "object.method", but also "object" > and "method" as I cannot query *method. > Any other suggestion? > > Kind regards, > > JClaude > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Multi Field search without Multifieldqueryparser
yes you are partly correct what I need is that lucene should support two type of queries for the following document: name: abc^10 organization: xyz^3 structured query: name: abc and organization: xyz unstructured query: default_field: abc ^5 and xyz But i do not want to create one more field(default_field) that will contain all the values concatenated in it. Also, even if i get all the fields during indexing and use it for multi field query parser, then the query will become very inefficient as there can be thousands of fields. I think it should clarify my point. On Tue, Sep 23, 2008 at 1:58 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > So, the piece I'm missing is how do you know what field for which terms. In > other words how do you know xyz goes against organization and abc against > name. Your wording implies that you don't know this before hand, yet you > are somehow suggesting that Lucene should be able to do it. Correct me if > I'm wrong. > > -Grant > > > On Sep 23, 2008, at 6:51 AM, Anshul jain wrote: > >> Here is what I'm trying to do: >> >> say a lucene document: >> name: abc ^10 >> organization: xyz ^3 >> >> ^10 and ^3 are boosts in the document. >> >> now if I query name: abc ^5 AND organization: xyz this will work. >> >> but if I query (default_field): abc^5 AND xyz this won't work. >> >> Now what I want is that a text can be associated with more than one field. >> i.e. >> >> (field1,field2,field3):value >> name,(default_field),title: abc^10 >> organization,(default_field),institute: xyz^3 >> >> then both of my queries will work. >> >> Is it possible to do so in lucene without changing the source? >> If no then can anyone please explain the indexing and searching >> mechanism for lucene, so that I can start working on it. >> >> The solution given by the java-users won't work for me as I do not >> want to add all the contents of the document in a single field and >> then search for that field, as this would increase the index size and >> I've to index more than 10 million documents. Also >> multifieldqueryparser will make it query execution inefficient, as >> there will be thousands of fields. >> >> If I start storing just a single field as: (default_field): "name abc >> organization xyz", then it is possible that some other documents might >> get selected that are not relevant. Also i want to boost individual >> fields in a document. >> >> Anshul >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Anshul Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exception while doing sorting
That still seems excessive. Are you measuring your first sort? Lucene builds up caches to help sort with the first few *sorts* that happen, so that's a possibility. But if that isn't the case, I think you need to slap a profiler on the problem and see where you're spending your time. I'd also be careful about what you measure when you measure your query. For instance, I've been fooled by measuring the total time to get an assembled response and it turned out that the time was spent fetching the documents, NOT searching/sorting. Try measuring various operations. In particular comment out anything having to do with assembling the response. Perhaps just substitute in making a list of the doc IDs and time *that*. Slowly build back up to your current app, and I suspect that one of the steps will cause your time to increase dramatically. How many documents are you assembling to respond? If you're assembling 40,000 hits, then 10-20 seconds may not be unreasonable. Best Erick On Tue, Sep 23, 2008 at 12:51 AM, Ganesh - yahoo <[EMAIL PROTECTED]>wrote: > System Specification: > Processor speed: 2Ghz > Ram: 3 GB > > IndexDB size 5 GB. > Total documents indexed: 5.8 million. > > To collect hits, i have replaced Hits object with TopFieldDocs. This has > improved the search performance better. Sorting is faster on date / long > field, but it is very slow on string field. In a standalone application it > took 10 - 20 secs to dispaly the results sorted on string field. [I am not > opening indexsearcher every time]. > > Regards > Ganesh > > > > - Original Message - From: "Erick Erickson" < > [EMAIL PROTECTED]> > To: > Sent: Monday, September 22, 2008 6:29 PM > > Subject: Re: Exception while doing sorting > > > Sure, your tomcat instance is assigning some amount of memory >> to the JVM that your searcher is running in. Of course, now you're >> going to ask me now to increase that number... I have no idea but >> I've seen this question multiple times in the mail archive, >> so a search there or in the tomcat docs should let you know. >> >> But 12 seconds is still a long time to wait for a search to complete. >> Can you tell us more about your search? >> >> For instance, are you opening a searcher for each request? That's bad. >> Are you sorting? that can take a long time, but again the first one >> will have a performance penalty as things are cached. >> >> There are a number of tips here: >> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed >> >> Best >> Erick >> >> On Mon, Sep 22, 2008 at 7:45 AM, Ganesh - yahoo <[EMAIL PROTECTED] >> >wrote: >> >> My index crossed 5 GB and 5 million documents are indexed. >>> My query includes searching and sorting returns 4 hits. >>> >>> If i do search from a standalone application, the results are returned in >>> 12 seconds. If i perform the same from web application running inside >>> Tomcat, out of memory exception is occured. >>> >>> Could any one clarify it? >>> >>> Regards >>> Ganesh >>> >>> - Original Message - From: "Ganesh - yahoo" < >>> [EMAIL PROTECTED] >>> > >>> To: >>> Sent: Friday, September 19, 2008 10:56 AM >>> >>> Subject: Re: Exception while doing sorting >>> >>> >>> Ok. If i distribure the indexes, whether sorting would be faster? >>> In Lucene user group mailing list, most emails suggests to use single indicies. Searching across the indexes may not be slower? Lucene uses FieldCache for sorting on non-tokenized field and tries to > maintain fields from all your 4 millions documents, even if you need >> to sort only 4000 docs. >> >> Don't know why Lucene keeps all terms in FieldCache for sorting. It > supposed to sort only the hits. Please clarify? Regards Ganesh - Original Message - From: "Otis Gospodnetic" < [EMAIL PROTECTED]> To: Sent: Thursday, September 18, 2008 12:17 PM Subject: Re: Exception while doing sorting If your index is increasing in size so fast, you should start thinking > about sharding your index (breaking it into multiple smaller indices > that > each fits on its server) and searching across them (aka distributed > search). > > Yes, Lucene can handle millions of records if run on adequate hardware > and if used correctly. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Ganesh - yahoo <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Thursday, September 18, 2008 12:53:19 AM >> Subject: Re: Exception while doing sorting >> >> My index is growing by 1 million records per day. How much memory do i >> need >> to increase. >> >> What kind of sorting algorithm being used in Lucene. Is this efficient >> enough to handle millions of records. >> >> Whether we could do sorting using our own algori
Re: Multi Field search without Multifieldqueryparser
Are you sure you want to be boosting the document fields at index time? From Hossman <<>> But Lucene isn't magic, it's an engine that you have to make do what you want. You say "But i do not want to create one more field(default_field) that will contain all the values concatenated in it" Is this for theoretical reasons or do you have evidence that this is unacceptable? You haven't told us how much data you're indexing, so we have no way to reassure (or warn) you about trying this. I suggest you try the "bag of words" solution (this should not take you more than a few hours) and see if it's unacceptable before rejecting it. Best Erick On Tue, Sep 23, 2008 at 6:51 AM, Anshul jain <[EMAIL PROTECTED]>wrote: > Here is what I'm trying to do: > > say a lucene document: > name: abc ^10 > organization: xyz ^3 > > ^10 and ^3 are boosts in the document. > > now if I query name: abc ^5 AND organization: xyz this will work. > > but if I query (default_field): abc^5 AND xyz this won't work. > > Now what I want is that a text can be associated with more than one field. > i.e. > > (field1,field2,field3):value > name,(default_field),title: abc^10 > organization,(default_field),institute: xyz^3 > > then both of my queries will work. > > Is it possible to do so in lucene without changing the source? > If no then can anyone please explain the indexing and searching > mechanism for lucene, so that I can start working on it. > > The solution given by the java-users won't work for me as I do not > want to add all the contents of the document in a single field and > then search for that field, as this would increase the index size and > I've to index more than 10 million documents. Also > multifieldqueryparser will make it query execution inefficient, as > there will be thousands of fields. > > If I start storing just a single field as: (default_field): "name abc > organization xyz", then it is possible that some other documents might > get selected that are not relevant. Also i want to boost individual > fields in a document. > > Anshul > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Multi Field search without Multifieldqueryparser
On Tue, Sep 23, 2008 at 5:28 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > So, the piece I'm missing is how do you know what field for which terms. > In other words how do you know xyz goes against organization and abc > against name. Your wording implies that you don't know this before hand, I guess this would be the case. The free flowing text search leads to this issue. > yet you are somehow suggesting that Lucene should be able to do it. > Correct me if I'm wrong. I am not sure if Lucene will be able to directly able to do it. However Indexed Terms in Lucene can certainly be used in learning the field of a particular word/token. One way, would be Lucene Index can be traversed to generated a Learning System which will be later used to learn the field name of a particular system. I suggest traversing the termDocs and extracting out the words and field information which can be stored in a separate DB/Index (Learning System). This system can then be queried 1st to determine the field type of word. The additional time that the Learning System will require should be compensated by having a smaller Index Size. Thanks Umesh > > -Grant > > > > On Sep 23, 2008, at 6:51 AM, Anshul jain wrote: > > Here is what I'm trying to do: >> >> say a lucene document: >> name: abc ^10 >> organization: xyz ^3 >> >> ^10 and ^3 are boosts in the document. >> >> now if I query name: abc ^5 AND organization: xyz this will work. >> >> but if I query (default_field): abc^5 AND xyz this won't work. >> >> Now what I want is that a text can be associated with more than one field. >> i.e. >> >> (field1,field2,field3):value >> name,(default_field),title: abc^10 >> organization,(default_field),institute: xyz^3 >> >> then both of my queries will work. >> >> Is it possible to do so in lucene without changing the source? >> If no then can anyone please explain the indexing and searching >> mechanism for lucene, so that I can start working on it. >> >> The solution given by the java-users won't work for me as I do not >> want to add all the contents of the document in a single field and >> then search for that field, as this would increase the index size and >> I've to index more than 10 million documents. Also >> multifieldqueryparser will make it query execution inefficient, as >> there will be thousands of fields. >> >> If I start storing just a single field as: (default_field): "name abc >> organization xyz", then it is possible that some other documents might >> get selected that are not relevant. Also i want to boost individual >> fields in a document. >> >> Anshul >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
RE: Multi Field search without Multifieldqueryparser
Just an idea... Along winded one. I'm not sure either.! Pardon me if I am directing you in the wrong direction If you add a lucene doc like below into your main index - Doc 1 - Field1: rainy today Field2: rainy yesterday Field3: weather forcast for tomorrow - Doc 2 - Field1: rainy tomorrow Field2: rainy today Field3: weather forcast for today ... etc And if you create something like an inverted index like below - Doc 1 - Field: Field1 Value: rainy today - Doc 2 - Field: Field2 Value: rainy yesterday - Doc 3 - Field: Field3 Value: weather forcast for tomorrow - Doc 4 - Field: Field1 Value: rainy tomorrow - Doc 5 - Field: Field2 Value: rainy today - Doc 6 - Field: Field3 Value: weather forcast for today And if you run a query on the inverted index to find out the field that is most probably to match the text you are about to search for in the main index, I have a feeling that this might work. -Original Message- From: Umesh Prasad [mailto:[EMAIL PROTECTED] Sent: 23 September 2008 13:58 To: java-user@lucene.apache.org Subject: Re: Multi Field search without Multifieldqueryparser On Tue, Sep 23, 2008 at 5:28 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > So, the piece I'm missing is how do you know what field for which terms. > In other words how do you know xyz goes against organization and abc > against name. Your wording implies that you don't know this before > hand, I guess this would be the case. The free flowing text search leads to this issue. > yet you are somehow suggesting that Lucene should be able to do it. > Correct me if I'm wrong. I am not sure if Lucene will be able to directly able to do it. However Indexed Terms in Lucene can certainly be used in learning the field of a particular word/token. One way, would be Lucene Index can be traversed to generated a Learning System which will be later used to learn the field name of a particular system. I suggest traversing the termDocs and extracting out the words and field information which can be stored in a separate DB/Index (Learning System). This system can then be queried 1st to determine the field type of word. The additional time that the Learning System will require should be compensated by having a smaller Index Size. Thanks Umesh > > -Grant > > > > On Sep 23, 2008, at 6:51 AM, Anshul jain wrote: > > Here is what I'm trying to do: >> >> say a lucene document: >> name: abc ^10 >> organization: xyz ^3 >> >> ^10 and ^3 are boosts in the document. >> >> now if I query name: abc ^5 AND organization: xyz this will work. >> >> but if I query (default_field): abc^5 AND xyz this won't work. >> >> Now what I want is that a text can be associated with more than one field. >> i.e. >> >> (field1,field2,field3):value >> name,(default_field),title: abc^10 >> organization,(default_field),institute: xyz^3 >> >> then both of my queries will work. >> >> Is it possible to do so in lucene without changing the source? >> If no then can anyone please explain the indexing and searching >> mechanism for lucene, so that I can start working on it. >> >> The solution given by the java-users won't work for me as I do not >> want to add all the contents of the document in a single field and >> then search for that field, as this would increase the index size and >> I've to index more than 10 million documents. Also >> multifieldqueryparser will make it query execution inefficient, as >> there will be thousands of fields. >> >> If I start storing just a single field as: (default_field): "name abc >> organization xyz", then it is possible that some other documents >> might get selected that are not relevant. Also i want to boost >> individual fields in a document. >> >> Anshul >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi Field search without Multifieldqueryparser
On Sep 23, 2008, at 8:35 AM, Anshul jain wrote: yes you are partly correct what I need is that lucene should support two type of queries for the following document: name: abc^10 organization: xyz^3 structured query: name: abc and organization: xyz unstructured query: default_field: abc ^5 and xyz And what field(s) should "xyz" be searched against? Again, I ask, how do you know what fields "xyz" should go against and why does abc go against the default_field? You've said it shouldn't go against all fields (b/c there are thousands of them), and you've said it shouldn't go against a catch-all field, but otherwise I still have no clue your criteria for what fields xyz should search. Are you saying that you want it to intelligently know that when "xyz" comes in that it should search the organization field? Other than seconding Umesh's or Dino's suggestions of using machine learning or heuristics or using some type of templating system, I'm not sure what else to offer. You might look at Solr's Dismax Query Parser, which allows you to specify the field structure of queries in a multi-field way, but again, I doubt that is wholly what you are looking for. But i do not want to create one more field(default_field) that will contain all the values concatenated in it. Also, even if i get all the fields during indexing and use it for multi field query parser, then the query will become very inefficient as there can be thousands of fields. I think it should clarify my point. On Tue, Sep 23, 2008 at 1:58 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: So, the piece I'm missing is how do you know what field for which terms. In other words how do you know xyz goes against organization and abc against name. Your wording implies that you don't know this before hand, yet you are somehow suggesting that Lucene should be able to do it. Correct me if I'm wrong. -Grant On Sep 23, 2008, at 6:51 AM, Anshul jain wrote: Here is what I'm trying to do: say a lucene document: name: abc ^10 organization: xyz ^3 ^10 and ^3 are boosts in the document. now if I query name: abc ^5 AND organization: xyz this will work. but if I query (default_field): abc^5 AND xyz this won't work. Now what I want is that a text can be associated with more than one field. i.e. (field1,field2,field3):value name,(default_field),title: abc^10 organization,(default_field),institute: xyz^3 then both of my queries will work. Is it possible to do so in lucene without changing the source? If no then can anyone please explain the indexing and searching mechanism for lucene, so that I can start working on it. The solution given by the java-users won't work for me as I do not want to add all the contents of the document in a single field and then search for that field, as this would increase the index size and I've to index more than 10 million documents. Also multifieldqueryparser will make it query execution inefficient, as there will be thousands of fields. If I start storing just a single field as: (default_field): "name abc organization xyz", then it is possible that some other documents might get selected that are not relevant. Also i want to boost individual fields in a document. Anshul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Anshul Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query attached words
Thanks Erick, you are right about the various combinations. Cheers, Erick Erickson wrote: Yes you can query *method. But you have to turn leading wildcards (which I don't have right on the tips of my fingers, but know it's been an option for some time now). But your solution doesn't scale well. If you had a.b.c.d.e.f.g.h you'd have to store many combinations in order to do what you want, quickly becoming really, really ugly. But you could store the tokens a . b . c . e . f . g . h by using the appropriate analyzer (or perhaps rolling your own). Then you could use either PhraseQuerys or SpanQuerys to do what you want Best Erick On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio <[EMAIL PROTECTED]>wrote: Hello, If I had a file with the following content: ... object.method(); ... I would like to be able to query for object method object.method My guess is that I should store not only "object.method", but also "object" and "method" as I cannot query *method. Any other suggestion? Kind regards, JClaude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query attached words
We have a similar requirement here at our work. In order to get around it we create two indexes, one of which punctuation is relevant, and one in which all punctuation is treated as a place to break tokens. We then do a search against both indexes and merge the results, it seems that such a technique might be able to help you here as well. (Though upon rereading it seems like perhaps you want SOME punctuation to be relevant, but others not, the technique itself though could still be applied with these rules used instead) - Matt Jean-Claude Antonio wrote: Thanks Erick, you are right about the various combinations. Cheers, Erick Erickson wrote: Yes you can query *method. But you have to turn leading wildcards (which I don't have right on the tips of my fingers, but know it's been an option for some time now). But your solution doesn't scale well. If you had a.b.c.d.e.f.g.h you'd have to store many combinations in order to do what you want, quickly becoming really, really ugly. But you could store the tokens a . b . c . e . f . g . h by using the appropriate analyzer (or perhaps rolling your own). Then you could use either PhraseQuerys or SpanQuerys to do what you want Best Erick On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio <[EMAIL PROTECTED]>wrote: Hello, If I had a file with the following content: ... object.method(); ... I would like to be able to query for object method object.method My guess is that I should store not only "object.method", but also "object" and "method" as I cannot query *method. Any other suggestion? Kind regards, JClaude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi Field search without Multifieldqueryparser
unstructured query: default_field: abc ^5 and xyz seems to have created a confusion, what I meant was while initializing the parser I have "default_field" as the default text field. So, the query should be: QueryParser parser = new QueryParser("default_field",analyzer); query = parser.parse("abc^5 and xyz"); so query will be: default_field:abc^5 and default_field:xyz^3 I am sorry for mentioning it wrong earlier. To answer Ericks question: I'll be indexing around 10-20 million documents of average size of 4 KB, but the number of documents could be mor. Now let me again clearly explain my problem: say i have a set of lucene documents as: Document 1: name: Anshul ^10 organization: EPFL ^5 sex: Male Document 2: name: Rakesh ^10 organization: IIT-B ^5 sex: Male Docuemt 3: name: erin brochowich^10 organization: ABC law firm sex: Female Document 4: title: lord of the rings ^10 directors: John ^2 actors: Kate Document 5: title: godfather ^10 directors: Kate ^2 actors: alpachino Docmuent 1, 2 and 3 belongs to a same class so there boosting parameters will be same. Similar is the case with document 4 and 5. If I give a query like: name: "Erin Brochowich" and Oranization: "ABC law firm". this query will work perfectly. but if the query is QueryParser parser = new QueryParser("default_field",analyzer); query = parser.parse("Erin Brochowich and ABC law firm"); it would not work. what i want is that default_field should be connected to the all the text somehow, but it should not take extra space for storing its own text. I think it should be clear enough now. Thank you for your responses. Regards, Anshul On Tue, Sep 23, 2008 at 4:55 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Sep 23, 2008, at 8:35 AM, Anshul jain wrote: > >> yes you are partly correct >> >> what I need is that lucene should support two type of queries for the >> following document: >> name: abc^10 >> organization: xyz^3 >> >> structured query: >> name: abc and organization: xyz >> >> unstructured query: >> default_field: abc ^5 and xyz > > And what field(s) should "xyz" be searched against? Again, I ask, how do > you know what fields "xyz" should go against and why does abc go against the > default_field? You've said it shouldn't go against all fields (b/c there > are thousands of them), and you've said it shouldn't go against a catch-all > field, but otherwise I still have no clue your criteria for what fields xyz > should search. Are you saying that you want it to intelligently know that > when "xyz" comes in that it should search the organization field? > > Other than seconding Umesh's or Dino's suggestions of using machine learning > or heuristics or using some type of templating system, I'm not sure what > else to offer. You might look at Solr's Dismax Query Parser, which allows > you to specify the field structure of queries in a multi-field way, but > again, I doubt that is wholly what you are looking for. > >> >> >> But i do not want to create one more field(default_field) that will >> contain all the values concatenated in it. Also, even if i get all the >> fields during indexing and use it for multi field query parser, then >> the query will become very inefficient as there can be thousands of >> fields. I think it should clarify my point. >> >> >> >> On Tue, Sep 23, 2008 at 1:58 PM, Grant Ingersoll <[EMAIL PROTECTED]> >> wrote: >>> >>> So, the piece I'm missing is how do you know what field for which terms. >>> In >>> other words how do you know xyz goes against organization and abc against >>> name. Your wording implies that you don't know this before hand, yet you >>> are somehow suggesting that Lucene should be able to do it. Correct me >>> if >>> I'm wrong. >>> >>> -Grant >>> >>> >>> On Sep 23, 2008, at 6:51 AM, Anshul jain wrote: >>> Here is what I'm trying to do: say a lucene document: name: abc ^10 organization: xyz ^3 ^10 and ^3 are boosts in the document. now if I query name: abc ^5 AND organization: xyz this will work. but if I query (default_field): abc^5 AND xyz this won't work. Now what I want is that a text can be associated with more than one field. i.e. (field1,field2,field3):value name,(default_field),title: abc^10 organization,(default_field),institute: xyz^3 then both of my queries will work. Is it possible to do so in lucene without changing the source? If no then can anyone please explain the indexing and searching mechanism for lucene, so that I can start working on it. The solution given by the java-users won't work for me as I do not want to add all the contents of the document in a single field and then search for that field, as this would increase the index size and I've to index more than 10 million documents. Also multifieldqueryparser will make it query execution inefficient, as there will be thousands of fields.
Re: Multi Field search without Multifieldqueryparser
But the "default_field" for your query parser is just that, the default *if nothing else is specified*. So the following would work just fine: QueryParser parser = new QueryParser("default_field", analyzer); query = parser.parse("name:Erin AND name:Brochowich AND organization:ABC AND organization:law AND organization:firm"); None of the terms would go against default_field since an explicit field is given for each. You'd have to break up the incoming queries and add the field to each, but that's not hard. Or even query = parser.parse("name:"Erin Brochowich"~3 AND organization:"ABC law firm"~3"); for phrase queries with slop. I *still* think you're misunderstanding index-time boosting. It is INDEPENDENT of query time boosting. Index time boosting has the effect of raising the importance of a particular field IN THAT DOCUMENT relative to that field IN OTHER DOCUMENTS. Boosting all the terms for a given field for ALL documents is essentially doing nothing. I very strongly recommend you get a copy of Luke and experiment with how queries are parsed. That tool has the ability to, for any given query, send it through the parser and see exactly what it looks like after parsing. I think that would allow you to get much better answers much more quickly. Just google lucene luke and you should be fine. Finally, the number of documents you're talking about will produce a pretty small index by Lucene standards. There's no reason to avoid the "bag of words" solution if that solves your problem because you fear bloating your index. Best Erick On Tue, Sep 23, 2008 at 11:54 AM, Anshul jain <[EMAIL PROTECTED]>wrote: > unstructured query: > default_field: abc ^5 and xyz > > seems to have created a confusion, what I meant was while initializing > the parser I have "default_field" as the default text field. So, the > query should be: > > QueryParser parser = new QueryParser("default_field",analyzer); > query = parser.parse("abc^5 and xyz"); > > so query will be: default_field:abc^5 and default_field:xyz^3 > > I am sorry for mentioning it wrong earlier. > > To answer Ericks question: I'll be indexing around 10-20 million > documents of average size of 4 KB, but the number of documents could > be mor. > > Now let me again clearly explain my problem: > > say i have a set of lucene documents as: > > Document 1: > name: Anshul ^10 > organization: EPFL ^5 > sex: Male > > Document 2: > name: Rakesh ^10 > organization: IIT-B ^5 > sex: Male > > Docuemt 3: > name: erin brochowich^10 > organization: ABC law firm > sex: Female > > Document 4: > title: lord of the rings ^10 > directors: John ^2 > actors: Kate > > Document 5: > title: godfather ^10 > directors: Kate ^2 > actors: alpachino > > Docmuent 1, 2 and 3 belongs to a same class so there boosting > parameters will be same. Similar is the case with document 4 and 5. > > If I give a query like: > > name: "Erin Brochowich" and Oranization: "ABC law firm". this query > will work perfectly. > > but if the query is > QueryParser parser = new QueryParser("default_field",analyzer); > query = parser.parse("Erin Brochowich and ABC law firm"); > it would not work. > > what i want is that default_field should be connected to the all the > text somehow, but it should not take extra space for storing its own > text. > > I think it should be clear enough now. > > Thank you for your responses. > Regards, > Anshul > > > > > > On Tue, Sep 23, 2008 at 4:55 PM, Grant Ingersoll <[EMAIL PROTECTED]> > wrote: > > > > On Sep 23, 2008, at 8:35 AM, Anshul jain wrote: > > > >> yes you are partly correct > >> > >> what I need is that lucene should support two type of queries for the > >> following document: > >> name: abc^10 > >> organization: xyz^3 > >> > >> structured query: > >> name: abc and organization: xyz > >> > >> unstructured query: > >> default_field: abc ^5 and xyz > > > > And what field(s) should "xyz" be searched against? Again, I ask, how do > > you know what fields "xyz" should go against and why does abc go against > the > > default_field? You've said it shouldn't go against all fields (b/c there > > are thousands of them), and you've said it shouldn't go against a > catch-all > > field, but otherwise I still have no clue your criteria for what fields > xyz > > should search. Are you saying that you want it to intelligently know > that > > when "xyz" comes in that it should search the organization field? > > > > Other than seconding Umesh's or Dino's suggestions of using machine > learning > > or heuristics or using some type of templating system, I'm not sure what > > else to offer. You might look at Solr's Dismax Query Parser, which > allows > > you to specify the field structure of queries in a multi-field way, but > > again, I doubt that is wholly what you are looking for. > > > >> > >> > >> But i do not want to create one more field(default_field) that will > >> contain all the values concatenated in it. Also, even if i get all the > >> fields during indexing and use it for
Re: Query attached words
Thanks Matt, I will go for Erick's suggestion as the combination can be messy: for a.b.c I would need to store a,b,c,a.b,b.c and a.b.c Cheers Matthew Hall wrote: We have a similar requirement here at our work. In order to get around it we create two indexes, one of which punctuation is relevant, and one in which all punctuation is treated as a place to break tokens. We then do a search against both indexes and merge the results, it seems that such a technique might be able to help you here as well. (Though upon rereading it seems like perhaps you want SOME punctuation to be relevant, but others not, the technique itself though could still be applied with these rules used instead) - Matt Jean-Claude Antonio wrote: Thanks Erick, you are right about the various combinations. Cheers, Erick Erickson wrote: Yes you can query *method. But you have to turn leading wildcards (which I don't have right on the tips of my fingers, but know it's been an option for some time now). But your solution doesn't scale well. If you had a.b.c.d.e.f.g.h you'd have to store many combinations in order to do what you want, quickly becoming really, really ugly. But you could store the tokens a . b . c . e . f . g . h by using the appropriate analyzer (or perhaps rolling your own). Then you could use either PhraseQuerys or SpanQuerys to do what you want Best Erick On Mon, Sep 22, 2008 at 5:40 PM, Jean-Claude Antonio <[EMAIL PROTECTED]>wrote: Hello, If I had a file with the following content: ... object.method(); ... I would like to be able to query for object method object.method My guess is that I should store not only "object.method", but also "object" and "method" as I cannot query *method. Any other suggestion? Kind regards, JClaude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Rsync causing search timeouts on master
Hi, I am using snappuller to sync my slave with master, i am not using rsync daemon, i am doing Rsync using remote shell. When i am serving requests from the master when the snappuller is running (after optimization, total index is arnd 4 gb it doing the transfer of whole index), the performance is very bad actually causing timeouts. Any ideas why this happens . Any suggestions will help. Thanks. -- View this message in context: http://www.nabble.com/Rsync-causing-search-timeouts-on-master-tp19641103p19641103.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Rsync causing search timeouts on master
Hi, Wrong list. :) I answered your question on solr-user. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: rahul_k123 <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, September 23, 2008 11:00:02 PM > Subject: Rsync causing search timeouts on master > > > Hi, > > I am using snappuller to sync my slave with master, i am not using rsync > daemon, i am doing Rsync using remote shell. > > When i am serving requests from the master when the snappuller is running > (after optimization, total index is arnd 4 gb it doing the transfer of whole > index), the performance is very bad actually causing timeouts. > > > > Any ideas why this happens . > > > Any suggestions will help. > > > Thanks. > -- > View this message in context: > http://www.nabble.com/Rsync-causing-search-timeouts-on-master-tp19641103p19641103.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]