from:"sdeck"

Speed of grouped queries

2007-01-02 Thread sdeck


Thanks for advanced on any insight on this one.

I have a fairly large query to run, and it takes roughly 20-40 seconds to
complete the way that i have it.
here is the best example I can give.

I have a set of roughly 25K documents indexed

I have queries that get documents matching a particular actor.

Then, I have a movie query that takes all of the documents found for each
actor query and combines them all together to say, here are all documents
that are relevant for this movie.

Then, and here is the time hog, I have a genre query that says, take all
movies and get their results and combine them together into this genre
result set.


The problem is, at indexing time, I do not have a way to say if a document
is a particular genre, or a particular actor, or movie etc.  If I try and
say for the genre query, get all documents and then filter for the queries
for movies and actors, I get heap space memory issues.

The query for collecting a specific actor is around 200-300 milliseconds,
and the movie one, that actually queries each actor, takes roughly 500-700
milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes
500 milliseconds*# of movies

Any ideas on how I could run these queries differently? For a given actor
query, there is about 5-7 boolean query clauses. Just to give some insight.

I currently just create 1 HitSetCollector (I rolled my own bitsetcollector)
and just run searches with it.  I just get crapped on when it does that
genre search. I wish there was an easier way to aggregate all of those
documents together from all of those searches.  After it is done, I cache
the results, but the initial hit is bad.

Any help would be much appreciated.
Sdeck



-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8132099
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speed of grouped queries

2007-01-03 Thread sdeck

Sure.

Yes, this is a metaphor for what I am actually doing, but movies are a great
example.

So, you go out to each of the news sites and you pull in their entertainment
articles.  These could be generic news, generic entertainment, whatever, so
right off the bat you have no way of saying that this article is about a
specific actor, movie or genre. That is what the search part is going to do
for you, which is where I run into trouble.

for a given actor, I would have queries like this (there are roughly 10-15
criteria matches for an actor)
+title: +content:" "

for a movie, it is just booleanquery of all of the actors in a movie, plus
+title:

So, again, those are fairly fast.
Yet, when you are doing a genre query, you need to loop through each movie
and combine results, something like this. pseudo code

HitCollector = new HitCollector
loop
   search(movie query, hitCollector)
end loop

The hitcollector handles any duplicates by using a bitset during the collect
method call.
The movie query is about .3-.5 seconds, but if you loop 40-50 times to
combine each, that is what takes so long.
I can't combine each of the movie queries together into one, because I get a
memory error because of how many clauses there are (setting the clause
higher did not help)

Does this help refine the problem?
Thanks for your help!
Scott

Steven Rowe wrote:
> 
> Hi Sdeck,
> 
> sdeck wrote:
>> The query for collecting a specific actor is around 200-300 milliseconds,
>> and the movie one, that actually queries each actor, takes roughly
>> 500-700
>> milliseconds. Yet, for a genre, where you may have 50-100 movies, it
>> takes
>> 500 milliseconds*# of movies
> 
> I'm having trouble visualizing both what your documents and your queries
> look like.  Can you please provide more concrete information?
> Sometimes, actual code helps.
> 
> For example, how do actors, movies and genres relate to your documents?
>  Do you have some external source(s) of information (i.e. external to
> your Lucene index) that relate actors to movies?  And movies to genres?
> 
> If actors, movies and genres are supposed to be a metaphor for what
> you're "really" representing, then you'll have to extend your metaphor a
> little bit to make sense (for "me" anyway) of what you're trying to "do".
> 
> Steve
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8142785
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speed of grouped queries

2007-01-03 Thread sdeck



Yes, indeed. I have tried each of those, hence my frustration.  
So, the max clause one did not seem to work, I ran out of heap memory for
some reason. I have my heap top set to -Xmx1024m
so, that should be enough.
Tried the query filter (that one also caused a heap memory error)

yeah, I have thought about pre-indexing some fields. Basically running the
queries I have now and inserting the actor, movie and genred ids, however,
what if my queries need to be tweaked, or they change completely? I now have
all of these stored documents with possibly incorrect matches.
That index would have to be created anytime I update the current index, plus
change my queries, which may be what I have to do eventually. I was just
trying to fix it from the query side first  and if that failed, then go the
secondary (almost inverted) index route.

I guess, any ideas why I would run out of heap memory by combining all of
those boolean queries together and then running the query? What is happening
in the background that would make that occur? Is it storing something in
memory, like all of the common terms or something, to cause that to occur?

Sdeck



Steven Rowe wrote:
> 
> Hi Scott,
> 
> sdeck wrote:
>> I can't combine each of the movie queries together into one, because I
>> get a
>> memory error because of how many clauses there are (setting the clause
>> higher did not help)
> 
> Have you tried increasing the memory available to the JVM?  Sun's JVM
> takes an option "-Xmx" to change the maximum amount of heap space to use
> (defaults to 64MB). For Java 1.5, see
> <http://java.sun.com/j2se/1.5.0/docs/tooldocs/windows/java.html#Xms> for
> Windows or
> <http://java.sun.com/j2se/1.5.0/docs/tooldocs/solaris/java.html#Xms> for
> Solaris and Linux.
> 
> You may have to increase the maximum # of allowed clauses too (sounds
> like you're already aware of this one):
> <http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)>
> 
> If this doesn't help, you may want to look into QueryFilter
> <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/QueryFilter.html>.
>  You might try using a ChainedFilter (from the Lucene Sandbox - note the
> latest release of this class is not located in lucene-core-2.0.0.jar,
> but rather in lucene-misc-2.0.0.jar)
> <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/misc/ChainedFilter.html>
> to connect movie QueryFilters for a genre.
> 
> To improve performance (beyond the first query execution), you could
> wrap the individual QueryFilters in CachingWrapperFilters
> <http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/CachingWrapperFilter.html>.
> 
> For something completely different, since you seem to be interested in
> online query performance, you could run all possible queries offline,
> and use the results to construct a derived index, in which documents
> contain "actor", "movie" and "genre" fields.  This derived index would
> be plenty fast, I expect.  And if running all possible genre queries is
> too resource-intensive, then you could compromise and construct your
> derived index with just an "actor" field, or both an "actor" and a
> "movie" field.
> 
> In any case, it sounds like the # of documents in your index is fairly
> small -- have you tried using RAMDirectory
> <http://lucene.apache.org/java/docs/api/org/apache/lucene/store/RAMDirectory.html>?
> 
> 
> Hope it helps,
> Steve
> 
>> Steven Rowe wrote:
>>> Hi Sdeck,
>>>
>>> sdeck wrote:
>>>> The query for collecting a specific actor is around 200-300
>>>> milliseconds,
>>>> and the movie one, that actually queries each actor, takes roughly
>>>> 500-700
>>>> milliseconds. Yet, for a genre, where you may have 50-100 movies, it
>>>> takes
>>>> 500 milliseconds*# of movies
>>> I'm having trouble visualizing both what your documents and your queries
>>> look like.  Can you please provide more concrete information?
>>> Sometimes, actual code helps.
>>>
>>> For example, how do actors, movies and genres relate to your documents?
>>>  Do you have some external source(s) of information (i.e. external to
>>> your Lucene index) that relate actors to movies?  And movies to genres?
>>>
>>> If actors, movies and genres are supposed to be a metaphor for what
>>> you're "really" representing, then you'll have to extend your metaphor a
>>> little bit to make sense (for "me" anyway) of what you're trying to
>>> "do".
>>>
>>> Steve
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8144324
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speed of grouped queries

2007-01-03 Thread sdeck


Mucho thanks. I will look into these.
For more info, I have roughly 3 documents now, and about 350,000 terms
When I do my queries I use the StandardAnalyzer with a whole slew of stop
words.
So, not sure if that might still be messing me up or not.
In the end, I may have to go with the prebuilt search indexes, which is no
fun.

I may just have to step through the lucene code to see if it is creating
large arrays somewhere that it doesn't need to, or could just cache. Not
sure. 

Will let you know more as I work on it tonight.
Sdeck


Steven Rowe wrote:
> 
> Hi Scott,
> 
> sdeck wrote:
>> I guess, any ideas why I would run out of heap memory by combining all of
>> those boolean queries together and then running the query? What is
>> happening
>> in the background that would make that occur? Is it storing something in
>> memory, like all of the common terms or something, to cause that to
>> occur?
> 
> Doug Cutting gives a formula for Lucene memory usage for queries here
> (from 2001):
> 
> <http://mail-archives.apache.org/mod_mbox/lucene-java-user/200111.mbox/[EMAIL 
> PROTECTED]>
> 
> And some more info here about the term dictionary (from 2003):
> 
> <http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200305.mbox/[EMAIL 
> PROTECTED]>
> 
> You might want to look at this thread, which has some discussion about
> omitting norms and the term dictionary (from 2005):
> 
> <http://www.nabble.com/Memory-Usage-tf523535.html>
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8147121
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speed of grouped queries

2007-01-03 Thread sdeck


Sorry, one more bit of info.
In the index, the contents of the article are stored/indexed.  These are
just the guts though, and it is around 1-3K worth of character data.
The current index, as it stands with 33K of documents is about 109 megs.

Again, it seems like I am just missing something somewhere. Thanks for being
the sounding board.
Scott



Steven Rowe wrote:
> 
> Hi Scott,
> 
> sdeck wrote:
>> I guess, any ideas why I would run out of heap memory by combining all of
>> those boolean queries together and then running the query? What is
>> happening
>> in the background that would make that occur? Is it storing something in
>> memory, like all of the common terms or something, to cause that to
>> occur?
> 
> Doug Cutting gives a formula for Lucene memory usage for queries here
> (from 2001):
> 
> <http://mail-archives.apache.org/mod_mbox/lucene-java-user/200111.mbox/[EMAIL 
> PROTECTED]>
> 
> And some more info here about the term dictionary (from 2003):
> 
> <http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200305.mbox/[EMAIL 
> PROTECTED]>
> 
> You might want to look at this thread, which has some discussion about
> omitting norms and the term dictionary (from 2005):
> 
> <http://www.nabble.com/Memory-Usage-tf523535.html>
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8147174
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speed of grouped queries

2007-01-10 Thread sdeck

I guess I never saw this request. Here is my answer.

Carrot would give me things like this 
Genre - Horror
   Scary Movie - (40)
  - Luke Perry (10)

Which is not what I am going for.

Basically, think someone clicks on the "Horror" tab and they see all of the
articles for every movie/actor within that genre. I page the results of
course, and just use a hitcollector to combine all hits.  My total number of
articles being returned for a genre is around 3000-4000 out of a maximum of
about 42K worth of articles right now.

The problem, is that I have specific queries for getting the news for an
actor, and a specific query for getting results for a movie and the genre
level needs to aggregate each of those queries.  So, an actor may have a
maximum of 20 boolean queries, a movie may have 5 boolean queries + (40
actors * 20 boolean queries) so a genre would have something like ((50* (5 *
20)) boolean queries for a genre
Somewhere along the lines though that is throwing a out of memory error. I
am using the queryparser, but the queries themselves are fairly simple.
An example for actor would be:
(+title:luke +title:perry) (+content:"Luke Perry")
Some other items in there to make sure the article is about luke perry (just
using him as an example)

So, anyone else have any ideas why I may be running out of memory on these?

Or, any other ideas on how to combine everything together quickly, besides
doing it offline first and then using online.  My only issue with this one
is that, as soon as I change the query logic, the process to build the index
must change, and then the website has to change sometime after that. It
could be days to see it in production, versus hours if I have to do is
change the query code in the web application.

Scott

Find Me wrote:
> 
> On 1/2/07, sdeck <[EMAIL PROTECTED]> wrote:
>>
>>
>> Thanks for advanced on any insight on this one.
>>
>> I have a fairly large query to run, and it takes roughly 20-40 seconds to
>> complete the way that i have it.
>> here is the best example I can give.
>>
>> I have a set of roughly 25K documents indexed
>>
>> I have queries that get documents matching a particular actor.
>>
>> Then, I have a movie query that takes all of the documents found for each
>> actor query and combines them all together to say, here are all documents
>> that are relevant for this movie.
>>
>> Then, and here is the time hog, I have a genre query that says, take all
>> movies and get their results and combine them together into this genre
>> result set.
> 
> 
> Is there any possibility to use Carrot clustering for genre? Could you
> please give examples for the final complex query as well as individual
> simple queries?  You can also state the aim of the query. Are you trying
> to
> get clustered list of movies (based on genre) for a particular actor?
> 
> --Rajesh Munavalli
> 
> The problem is, at indexing time, I do not have a way to say if a document
>> is a particular genre, or a particular actor, or movie etc.  If I try and
>> say for the genre query, get all documents and then filter for the
>> queries
>> for movies and actors, I get heap space memory issues.
>>
>> The query for collecting a specific actor is around 200-300 milliseconds,
>> and the movie one, that actually queries each actor, takes roughly
>> 500-700
>> milliseconds. Yet, for a genre, where you may have 50-100 movies, it
>> takes
>> 500 milliseconds*# of movies
>>
>> Any ideas on how I could run these queries differently? For a given actor
>> query, there is about 5-7 boolean query clauses. Just to give some
>> insight.
>>
>> I currently just create 1 HitSetCollector (I rolled my own
>> bitsetcollector)
>> and just run searches with it.  I just get crapped on when it does that
>> genre search. I wish there was an easier way to aggregate all of those
>> documents together from all of those searches.  After it is done, I cache
>> the results, but the initial hit is bad.
>>
>> Any help would be much appreciated.
>> Sdeck
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8132099
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8267691
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Speed of grouped queries

2007-01-11 Thread sdeck

So, to reply to myself with info learned (since I like reading this forum for
what people have done with lucene. kudos to cnet by the way)

In tracking down some of the speed, I was able to manage some speed
improvements

Again, the movie examples are just metaphors for what I am really working on

1) I created a very simple in memory LRU cache for all of the movies bitset
collectors.  The movies will boolean query each of their actors together and
the run a overall query. This runs in about 20-30 ms, the actor queries them
selves run in 20-30ms.  So, I now have a simple and separate cache for both
movies and actors and store their bitset collectors.

2) using the cached bitset collectors, I then do a merge of all movies
within a genre and store that bitset collector into it's own cache. This one
would really only contain 5-10 elements.

So far, memory is handling things pretty well.

The final item, one in which seems to sometime affect things and other times
not is the highlighter process after the bitsets have been collected. This,
I found, threw off timing many times. It seems as though the highlighter
will run quickly on some documents, but then it ran slowly on others.  Will
probably need to dig into this, but since I have put in my paging system,
and the results them selves will be cached, speed is now all synched up.

I do have a prelim process that will go through each actor, movie and genre
and get a count of number of results that would be returned for their
searches and store that in a file. That way, I can at least show the user
how many results are available in side of the interface without actually
having to query for it.

Hopefully in a month or so I will be able to give a link to the public
website I am working on.
Fun stuff.

Scott

sdeck wrote:
> 
> Thanks for advanced on any insight on this one.
> 
> I have a fairly large query to run, and it takes roughly 20-40 seconds to
> complete the way that i have it.
> here is the best example I can give.
> 
> I have a set of roughly 25K documents indexed
> 
> I have queries that get documents matching a particular actor.
> 
> Then, I have a movie query that takes all of the documents found for each
> actor query and combines them all together to say, here are all documents
> that are relevant for this movie.
> 
> Then, and here is the time hog, I have a genre query that says, take all
> movies and get their results and combine them together into this genre
> result set.
> 
> 
> The problem is, at indexing time, I do not have a way to say if a document
> is a particular genre, or a particular actor, or movie etc.  If I try and
> say for the genre query, get all documents and then filter for the queries
> for movies and actors, I get heap space memory issues.
> 
> The query for collecting a specific actor is around 200-300 milliseconds,
> and the movie one, that actually queries each actor, takes roughly 500-700
> milliseconds. Yet, for a genre, where you may have 50-100 movies, it takes
> 500 milliseconds*# of movies
> 
> Any ideas on how I could run these queries differently? For a given actor
> query, there is about 5-7 boolean query clauses. Just to give some
> insight.
> 
> I currently just create 1 HitSetCollector (I rolled my own
> bitsetcollector) and just run searches with it.  I just get crapped on
> when it does that genre search. I wish there was an easier way to
> aggregate all of those documents together from all of those searches. 
> After it is done, I cache the results, but the initial hit is bad.
> 
> Any help would be much appreciated.
> Sdeck
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Speed-of-grouped-queries-tf2910499.html#a8285283
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

similarity and delete duplicates

2007-02-13 Thread sdeck


Hey everyone.
 I have been trying to get a certain kind of delete duplicates working, but
I need a little help.

Here is my problem.
I have many documents, that after a web crawl, many different sites could
have documents that have similar titles. I want to remove all of those
documents except for 1.
So, I could have a list of titles like this
1) George the Monkey won the bowl
2) The bowl was won by George the Monkey
3) Bowl won by George the Monkey

So, the way I do things now, I generate a query like this
+title:George +title:Monkey +title:Bowl +title:won +title:the

and then do a search. It will then pull back documents.
Now, my first, bad, way of deleting the dupes was to check for scores > some
number and then delete them. However, as my index/crawler(nucth) kept
generating, and I kept merging indexes, the scores kept on getting weirdly
different.

So, I found this forum item on similarity:
http://www.nabble.com/Overriding-Similarity-tf2128934.html#a5875307

and wanted to know if that was a good way of finding these duplicate title
matches. Or, if someone else had a good idea on how to find them?  Now, the
titles are not going to be exact, but fairly similar.

Thanks for your help,
Scott



-- 
View this message in context: 
http://www.nabble.com/similarity-and-delete-duplicates-tf313.html#a8949366
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Find related question

2007-03-09 Thread sdeck


Hello,
I run Nutch and get a whole slew of articles and when I display search
results, there may be 5-6 articles that have different titles, and most of
the body text is the same, but I want to group them all under one result. 
These are usually AP articles that all newspapers repurpose.

When using the MoreLikeThis functionality, the articles that are returned
may or may not be similar. When I run the query, the scores returned can
range from .1 to .4 for the first 2 hits and it usually will return around
50 results, with the last score coming in fairly close to 0. Usually, the
first hit is the exact same article as what I am trying to determine related
articles for.  I know that the score value has no real meaning though,
because it is done based upon the query, and other factors and then
normalized.

So, should I be taking (hit score/1) to use as a percentage value to see
what other articles might be similar after that first hit? Try and normalize
the similarity basically? Am I off my rocker?

Or, is there possibly a way to use Carrot2 to find related articles for a
given document?

Thanks,
Scott

-- 
View this message in context: 
http://www.nabble.com/Find-related-question-tf3379250.html#a9405661
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Speed of grouped queries

Re: Speed of grouped queries

Re: Speed of grouped queries

Re: Speed of grouped queries

Re: Speed of grouped queries

Re: Speed of grouped queries

Re: Speed of grouped queries

similarity and delete duplicates

Find related question

Related Article question

10 matches

Site Navigation

Mail list logo

Footer information