Speed of grouped queries
Thanks for advanced on any insight on this one. I have a fairly large query to run, and it takes roughly 20-40 seconds to complete the way that i have it. here is the best example I can give. I have a set of roughly 25K documents indexed I have queries that get documents matching a particular actor. Then, I have a movie query that takes all of the documents found for each actor query and combines them all together to say, here are all documents that are relevant for this movie. Then, and here is the time hog, I have a genre query that says, take all movies and get their results and combine them together into this genre result set. The problem is, at indexing time, I do not have a way to say if a document is a particular genre, or a particular actor, or movie etc. If I try and say for the genre query, get all documents and then filter for the queries for movies and actors, I get heap space memory issues. The query for collecting a specific actor is around 200-300 milliseconds, and the movie one, that actually queries each actor, takes roughly 500-700 milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes 500 milliseconds*# of movies Any ideas on how I could run these queries differently? For a given actor query, there is about 5-7 boolean query clauses. Just to give some insight. I currently just create 1 HitSetCollector (I rolled my own bitsetcollector) and just run searches with it. I just get crapped on when it does that genre search. I wish there was an easier way to aggregate all of those documents together from all of those searches. After it is done, I cache the results, but the initial hit is bad. Any help would be much appreciated. Sdeck -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8132099 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speed of grouped queries
Sure. Yes, this is a metaphor for what I am actually doing, but movies are a great example. So, you go out to each of the news sites and you pull in their entertainment articles. These could be generic news, generic entertainment, whatever, so right off the bat you have no way of saying that this article is about a specific actor, movie or genre. That is what the search part is going to do for you, which is where I run into trouble. for a given actor, I would have queries like this (there are roughly 10-15 criteria matches for an actor) +title: +content:" " for a movie, it is just booleanquery of all of the actors in a movie, plus +title: So, again, those are fairly fast. Yet, when you are doing a genre query, you need to loop through each movie and combine results, something like this. pseudo code HitCollector = new HitCollector loop search(movie query, hitCollector) end loop The hitcollector handles any duplicates by using a bitset during the collect method call. The movie query is about .3-.5 seconds, but if you loop 40-50 times to combine each, that is what takes so long. I can't combine each of the movie queries together into one, because I get a memory error because of how many clauses there are (setting the clause higher did not help) Does this help refine the problem? Thanks for your help! Scott Steven Rowe wrote: > > Hi Sdeck, > > sdeck wrote: >> The query for collecting a specific actor is around 200-300 milliseconds, >> and the movie one, that actually queries each actor, takes roughly >> 500-700 >> milliseconds. Yet, for a genre, where you may have 50-100 movies, it >> takes >> 500 milliseconds*# of movies > > I'm having trouble visualizing both what your documents and your queries > look like. Can you please provide more concrete information? > Sometimes, actual code helps. > > For example, how do actors, movies and genres relate to your documents? > Do you have some external source(s) of information (i.e. external to > your Lucene index) that relate actors to movies? And movies to genres? > > If actors, movies and genres are supposed to be a metaphor for what > you're "really" representing, then you'll have to extend your metaphor a > little bit to make sense (for "me" anyway) of what you're trying to "do". > > Steve > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8142785 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speed of grouped queries
Yes, indeed. I have tried each of those, hence my frustration. So, the max clause one did not seem to work, I ran out of heap memory for some reason. I have my heap top set to -Xmx1024m so, that should be enough. Tried the query filter (that one also caused a heap memory error) yeah, I have thought about pre-indexing some fields. Basically running the queries I have now and inserting the actor, movie and genred ids, however, what if my queries need to be tweaked, or they change completely? I now have all of these stored documents with possibly incorrect matches. That index would have to be created anytime I update the current index, plus change my queries, which may be what I have to do eventually. I was just trying to fix it from the query side first and if that failed, then go the secondary (almost inverted) index route. I guess, any ideas why I would run out of heap memory by combining all of those boolean queries together and then running the query? What is happening in the background that would make that occur? Is it storing something in memory, like all of the common terms or something, to cause that to occur? Sdeck Steven Rowe wrote: > > Hi Scott, > > sdeck wrote: >> I can't combine each of the movie queries together into one, because I >> get a >> memory error because of how many clauses there are (setting the clause >> higher did not help) > > Have you tried increasing the memory available to the JVM? Sun's JVM > takes an option "-Xmx" to change the maximum amount of heap space to use > (defaults to 64MB). For Java 1.5, see > <http://java.sun.com/j2se/1.5.0/docs/tooldocs/windows/java.html#Xms> for > Windows or > <http://java.sun.com/j2se/1.5.0/docs/tooldocs/solaris/java.html#Xms> for > Solaris and Linux. > > You may have to increase the maximum # of allowed clauses too (sounds > like you're already aware of this one): > <http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)> > > If this doesn't help, you may want to look into QueryFilter > <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/QueryFilter.html>. > You might try using a ChainedFilter (from the Lucene Sandbox - note the > latest release of this class is not located in lucene-core-2.0.0.jar, > but rather in lucene-misc-2.0.0.jar) > <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/misc/ChainedFilter.html> > to connect movie QueryFilters for a genre. > > To improve performance (beyond the first query execution), you could > wrap the individual QueryFilters in CachingWrapperFilters > <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/CachingWrapperFilter.html>. > > For something completely different, since you seem to be interested in > online query performance, you could run all possible queries offline, > and use the results to construct a derived index, in which documents > contain "actor", "movie" and "genre" fields. This derived index would > be plenty fast, I expect. And if running all possible genre queries is > too resource-intensive, then you could compromise and construct your > derived index with just an "actor" field, or both an "actor" and a > "movie" field. > > In any case, it sounds like the # of documents in your index is fairly > small -- have you tried using RAMDirectory > <http://lucene.apache.org/java/docs/api/org/apache/lucene/store/RAMDirectory.html>? > > > Hope it helps, > Steve > >> Steven Rowe wrote: >>> Hi Sdeck, >>> >>> sdeck wrote: >>>> The query for collecting a specific actor is around 200-300 >>>> milliseconds, >>>> and the movie one, that actually queries each actor, takes roughly >>>> 500-700 >>>> milliseconds. Yet, for a genre, where you may have 50-100 movies, it >>>> takes >>>> 500 milliseconds*# of movies >>> I'm having trouble visualizing both what your documents and your queries >>> look like. Can you please provide more concrete information? >>> Sometimes, actual code helps. >>> >>> For example, how do actors, movies and genres relate to your documents? >>> Do you have some external source(s) of information (i.e. external to >>> your Lucene index) that relate actors to movies? And movies to genres? >>> >>> If actors, movies and genres are supposed to be a metaphor for what >>> you're "really" representing, then you'll have to extend your metaphor a >>> little bit to make sense (for "me" anyway) of what you're trying to >>> "do". >>> >>> Steve > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8144324 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speed of grouped queries
Mucho thanks. I will look into these. For more info, I have roughly 3 documents now, and about 350,000 terms When I do my queries I use the StandardAnalyzer with a whole slew of stop words. So, not sure if that might still be messing me up or not. In the end, I may have to go with the prebuilt search indexes, which is no fun. I may just have to step through the lucene code to see if it is creating large arrays somewhere that it doesn't need to, or could just cache. Not sure. Will let you know more as I work on it tonight. Sdeck Steven Rowe wrote: > > Hi Scott, > > sdeck wrote: >> I guess, any ideas why I would run out of heap memory by combining all of >> those boolean queries together and then running the query? What is >> happening >> in the background that would make that occur? Is it storing something in >> memory, like all of the common terms or something, to cause that to >> occur? > > Doug Cutting gives a formula for Lucene memory usage for queries here > (from 2001): > > <http://mail-archives.apache.org/mod_mbox/lucene-java-user/200111.mbox/[EMAIL > PROTECTED]> > > And some more info here about the term dictionary (from 2003): > > <http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200305.mbox/[EMAIL > PROTECTED]> > > You might want to look at this thread, which has some discussion about > omitting norms and the term dictionary (from 2005): > > <http://www.nabble.com/Memory-Usage-tf523535.html> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8147121 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speed of grouped queries
Sorry, one more bit of info. In the index, the contents of the article are stored/indexed. These are just the guts though, and it is around 1-3K worth of character data. The current index, as it stands with 33K of documents is about 109 megs. Again, it seems like I am just missing something somewhere. Thanks for being the sounding board. Scott Steven Rowe wrote: > > Hi Scott, > > sdeck wrote: >> I guess, any ideas why I would run out of heap memory by combining all of >> those boolean queries together and then running the query? What is >> happening >> in the background that would make that occur? Is it storing something in >> memory, like all of the common terms or something, to cause that to >> occur? > > Doug Cutting gives a formula for Lucene memory usage for queries here > (from 2001): > > <http://mail-archives.apache.org/mod_mbox/lucene-java-user/200111.mbox/[EMAIL > PROTECTED]> > > And some more info here about the term dictionary (from 2003): > > <http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200305.mbox/[EMAIL > PROTECTED]> > > You might want to look at this thread, which has some discussion about > omitting norms and the term dictionary (from 2005): > > <http://www.nabble.com/Memory-Usage-tf523535.html> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8147174 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speed of grouped queries
I guess I never saw this request. Here is my answer. Carrot would give me things like this Genre - Horror Scary Movie - (40) - Luke Perry (10) Which is not what I am going for. Basically, think someone clicks on the "Horror" tab and they see all of the articles for every movie/actor within that genre. I page the results of course, and just use a hitcollector to combine all hits. My total number of articles being returned for a genre is around 3000-4000 out of a maximum of about 42K worth of articles right now. The problem, is that I have specific queries for getting the news for an actor, and a specific query for getting results for a movie and the genre level needs to aggregate each of those queries. So, an actor may have a maximum of 20 boolean queries, a movie may have 5 boolean queries + (40 actors * 20 boolean queries) so a genre would have something like ((50* (5 * 20)) boolean queries for a genre Somewhere along the lines though that is throwing a out of memory error. I am using the queryparser, but the queries themselves are fairly simple. An example for actor would be: (+title:luke +title:perry) (+content:"Luke Perry") Some other items in there to make sure the article is about luke perry (just using him as an example) So, anyone else have any ideas why I may be running out of memory on these? Or, any other ideas on how to combine everything together quickly, besides doing it offline first and then using online. My only issue with this one is that, as soon as I change the query logic, the process to build the index must change, and then the website has to change sometime after that. It could be days to see it in production, versus hours if I have to do is change the query code in the web application. Scott Find Me wrote: > > On 1/2/07, sdeck <[EMAIL PROTECTED]> wrote: >> >> >> Thanks for advanced on any insight on this one. >> >> I have a fairly large query to run, and it takes roughly 20-40 seconds to >> complete the way that i have it. >> here is the best example I can give. >> >> I have a set of roughly 25K documents indexed >> >> I have queries that get documents matching a particular actor. >> >> Then, I have a movie query that takes all of the documents found for each >> actor query and combines them all together to say, here are all documents >> that are relevant for this movie. >> >> Then, and here is the time hog, I have a genre query that says, take all >> movies and get their results and combine them together into this genre >> result set. > > > Is there any possibility to use Carrot clustering for genre? Could you > please give examples for the final complex query as well as individual > simple queries? You can also state the aim of the query. Are you trying > to > get clustered list of movies (based on genre) for a particular actor? > > --Rajesh Munavalli > > The problem is, at indexing time, I do not have a way to say if a document >> is a particular genre, or a particular actor, or movie etc. If I try and >> say for the genre query, get all documents and then filter for the >> queries >> for movies and actors, I get heap space memory issues. >> >> The query for collecting a specific actor is around 200-300 milliseconds, >> and the movie one, that actually queries each actor, takes roughly >> 500-700 >> milliseconds. Yet, for a genre, where you may have 50-100 movies, it >> takes >> 500 milliseconds*# of movies >> >> Any ideas on how I could run these queries differently? For a given actor >> query, there is about 5-7 boolean query clauses. Just to give some >> insight. >> >> I currently just create 1 HitSetCollector (I rolled my own >> bitsetcollector) >> and just run searches with it. I just get crapped on when it does that >> genre search. I wish there was an easier way to aggregate all of those >> documents together from all of those searches. After it is done, I cache >> the results, but the initial hit is bad. >> >> Any help would be much appreciated. >> Sdeck >> >> >> >> -- >> View this message in context: >> http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8132099 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8267691 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Speed of grouped queries
So, to reply to myself with info learned (since I like reading this forum for what people have done with lucene. kudos to cnet by the way) In tracking down some of the speed, I was able to manage some speed improvements Again, the movie examples are just metaphors for what I am really working on 1) I created a very simple in memory LRU cache for all of the movies bitset collectors. The movies will boolean query each of their actors together and the run a overall query. This runs in about 20-30 ms, the actor queries them selves run in 20-30ms. So, I now have a simple and separate cache for both movies and actors and store their bitset collectors. 2) using the cached bitset collectors, I then do a merge of all movies within a genre and store that bitset collector into it's own cache. This one would really only contain 5-10 elements. So far, memory is handling things pretty well. The final item, one in which seems to sometime affect things and other times not is the highlighter process after the bitsets have been collected. This, I found, threw off timing many times. It seems as though the highlighter will run quickly on some documents, but then it ran slowly on others. Will probably need to dig into this, but since I have put in my paging system, and the results them selves will be cached, speed is now all synched up. I do have a prelim process that will go through each actor, movie and genre and get a count of number of results that would be returned for their searches and store that in a file. That way, I can at least show the user how many results are available in side of the interface without actually having to query for it. Hopefully in a month or so I will be able to give a link to the public website I am working on. Fun stuff. Scott sdeck wrote: > > Thanks for advanced on any insight on this one. > > I have a fairly large query to run, and it takes roughly 20-40 seconds to > complete the way that i have it. > here is the best example I can give. > > I have a set of roughly 25K documents indexed > > I have queries that get documents matching a particular actor. > > Then, I have a movie query that takes all of the documents found for each > actor query and combines them all together to say, here are all documents > that are relevant for this movie. > > Then, and here is the time hog, I have a genre query that says, take all > movies and get their results and combine them together into this genre > result set. > > > The problem is, at indexing time, I do not have a way to say if a document > is a particular genre, or a particular actor, or movie etc. If I try and > say for the genre query, get all documents and then filter for the queries > for movies and actors, I get heap space memory issues. > > The query for collecting a specific actor is around 200-300 milliseconds, > and the movie one, that actually queries each actor, takes roughly 500-700 > milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes > 500 milliseconds*# of movies > > Any ideas on how I could run these queries differently? For a given actor > query, there is about 5-7 boolean query clauses. Just to give some > insight. > > I currently just create 1 HitSetCollector (I rolled my own > bitsetcollector) and just run searches with it. I just get crapped on > when it does that genre search. I wish there was an easier way to > aggregate all of those documents together from all of those searches. > After it is done, I cache the results, but the initial hit is bad. > > Any help would be much appreciated. > Sdeck > > > > -- View this message in context: http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8285283 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
similarity and delete duplicates
Hey everyone. I have been trying to get a certain kind of delete duplicates working, but I need a little help. Here is my problem. I have many documents, that after a web crawl, many different sites could have documents that have similar titles. I want to remove all of those documents except for 1. So, I could have a list of titles like this 1) George the Monkey won the bowl 2) The bowl was won by George the Monkey 3) Bowl won by George the Monkey So, the way I do things now, I generate a query like this +title:George +title:Monkey +title:Bowl +title:won +title:the and then do a search. It will then pull back documents. Now, my first, bad, way of deleting the dupes was to check for scores > some number and then delete them. However, as my index/crawler(nucth) kept generating, and I kept merging indexes, the scores kept on getting weirdly different. So, I found this forum item on similarity: http://www.nabble.com/Overriding-Similarity-tf2128934.html#a5875307 and wanted to know if that was a good way of finding these duplicate title matches. Or, if someone else had a good idea on how to find them? Now, the titles are not going to be exact, but fairly similar. Thanks for your help, Scott -- View this message in context: http://www.nabble.com/similarity-and-delete-duplicates-tf313.html#a8949366 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Find related question
Hello, I run Nutch and get a whole slew of articles and when I display search results, there may be 5-6 articles that have different titles, and most of the body text is the same, but I want to group them all under one result. These are usually AP articles that all newspapers repurpose. When using the MoreLikeThis functionality, the articles that are returned may or may not be similar. When I run the query, the scores returned can range from .1 to .4 for the first 2 hits and it usually will return around 50 results, with the last score coming in fairly close to 0. Usually, the first hit is the exact same article as what I am trying to determine related articles for. I know that the score value has no real meaning though, because it is done based upon the query, and other factors and then normalized. So, should I be taking (hit score/1) to use as a percentage value to see what other articles might be similar after that first hit? Try and normalize the similarity basically? Am I off my rocker? Or, is there possibly a way to use Carrot2 to find related articles for a given document? Thanks, Scott -- View this message in context: http://www.nabble.com/Find-related-question-tf3379250.html#a9405661 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Related Article question
Hello all, I have been trying out the MoreLikeThis and many other similarity types of queries, but still run into problems with content not being matched up. Let me give an example, as well as some question that, hopefully someone can answer, to help me refine my work. Example: 1) Document A may have a title: Oden and Durant Are being recruited, and Document B would have a title Trailblazers look at Oden and Durant. Both Document A and B talk about the recruitment of Oden and Durant, just in fairly different ways. One may emphasis Oden over Durant, or vice versa. The way the MoreLikeThis and similarity queries seem to work is that they take terms and see if a lot of them match up in the documents. So, if Durant is ins doc A 10 times and 10 times in doc B, then the similarity will be higher. Here is my problem though. I run these morelike this and other similarity queries and it many of those types of articles do not get matched, because a lot of the terms are not the same, but they are talking about the same topic. Here is what I wonder 1) Should I somehow give more boost to a full name, or other names, or titles to help matching? Or, does that hinder things? 2) How does shorter content versus longer content work? I make only get around 5-6 sentences in one document, but a full page in another, but they are still talking about the same thing 3) How would term vectors help, versus not storing term vectors? To also help, the way the system is setup, I have one main index. I will run a search of the web and collect more documents. Before adding these to the main index, I will run a morelikethis query against the main index of each of the new documents to be inserted. That way, I can keep a separate place of what articles are related to each other for faster lookups. I also do a query of morelikethis against the new index, just to see what recently searched articles are similar to each other. It would seem that document frequency and term numbers will not really work in these sorts of scenarios. Not sure if I am explaining my problem as well as I can, but I would love some kind of reference to figuring out how to do related article searching and see how I can refine my results. Right now, I would say about 60-70% get correctly mapped into related articles, and about 10-20 percent get incorrectly mapped as a related article (similar terms, but perhaps not enough content, but the article is not about any of the others) Any help would be appreciated. Thanks Scott -- View this message in context: http://www.nabble.com/Related-Article-question-tf4038641.html#a11474031 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]