I believe that a mixed 1+3 approach should mimic quite well what Verity does.
In fact, what I would do is to index "profile net" queries in a dedicated index, using exclusively exact terms (i.e.: removing Boolean operators and wildcards). This gives you an approximate profile index you can use to select most relevant profiles for a given document (eventually, use some kind of threshold for profile cutoff). Once you have reduced the problem space to candidates profile you can apply the MemoryIndex model using the fully featured queries. I am pretty confident that this approach will reduce the number of candidates far below the 10K threshold you mentioned before, therefore removing any performance issue. For some details of how Verity works, you can give a look at their "Evaluation of content of a data set using multiple and/or complex queries" patent (http://www.patentstorm.us/patents/5778364-fulltext.html). (Final phrase of the abstract says: "First, one or more candidate queries that may be satisfied by the data set are identified by approximately evaluating each query. Second, each of the candidate queries is fully evaluated to determine whether the candidate query is satisfied by the data set.") FYI, Autonomy used to implement features like user profile search and agent search (i.e.: searching for user profiles that match a given query) indexing profiles and agents as documents in dedicated IDOL instances (but I don't know if they changed their model in the latest versions of IDOL). Hope this helps. Regards, Vieri On 17/01/2008, Marcus Falk <[EMAIL PROTECTED]> wrote: > Yes a profilenet is what Mark describes. > > In our Verity profilenet we got ~50.000 profiles (queries) the performance is > fine around 20-25 documents / second. > > From what we can tell the matches are accurate unfortunately I don't have any > ideas on how verity does this under the hood so I don't know if there is any > approximation involved. We do however get information about the each query > that hits such as score and words (with their position within the document). > > We need this kind of functionality since we are monitoring the incoming > documents for our customers. > > --- Answer to Marks first mail--- > MoreLikeThis: > I get a feeling that it would be very hard to do this using this kind of > query, how do I index the queries with operators such as NOT, NEAR and > WildCard? > > Taxonomy/Classification > I'm totally lost here ;) Any one that knows what to look for in this case? > > MemoryIndex: > We have ran benchmarks using this technique it wasn't enough, if I recall we > could run like 10.000 profiles with good performance. And as you say it > doesn't scale well at all. > > > > / > Regards > Marcus > > > > > > -----Ursprungligt meddelande----- > Från: Mark Miller [mailto:[EMAIL PROTECTED] > Skickat: den 17 januari 2008 13:58 > Till: java-user@lucene.apache.org > Ämne: Re: Inverted search / Search on profilenet > > Verity, autonomy, whatever, has a what they call a reverse query system > called profilenet. A profile is just a query (or I guess more than one > query?) and you can setup a bunch of them. Then you supply the document > and you will get the matching queries as well as a score. They say its > the opposite of doing a search with a query and getting back docs. > Instead you do a search with a doc and get back these queries. They > claim it can be used for things like taxonomy/classification among other > things. I don't know how true this is to a real reverse query system as > that would seem to be kind of slow -- my guess is its a bit of an > approximation. > > - Mark > > Endre Stølsvik wrote: > > > > May I ask: What IS a profilenet? I ask since this obviously is > > something that you two hit off on right away, while I haven't heard of..! > > > > > > Thanks, > > Endre. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]