My gut feel is there isn't really a good solution to intermingling the results, since they come from different sources, index different kinds of data etc. The irreducible problem is that a hit in one index is not comparable to a hit in another, either from a Lucene scoring perspective or from the user's viewpoint.
What people will do is fiddle with the ranking process (omitting norms, etc), but that's pretty clumsy all told. Another approach is to look at it from a presentation perspective. How can the various types of documents be presented to a user helpfully? Does it make sense to present, say, tabs across the top for each data source? Present the top 3 items from each source and let the user follow links for more results from that datasource? Facet on the data sources? If you go this route, you can put it all in a single index and just facet, group (4.x), filter, etc.(oops, got confused. grouping and faceting are Solr constructs last I knew, but you may want to check them out).... In some ways this is more a UI design issue. You have different data sets, what serves the user best? I guess my challenge for your product manager folks is to define what a good user experience is. Provide concrete use cases etc. Because I'll also guess that you can make Lucene do what's necessary to support the use-cases if they're something other than "mystically do the right thing, and I get to define the right thing dynamically" <G>... I remember when "indexes" became acceptable, and I still think it's wrong... Best Erick On Thu, Jun 2, 2011 at 7:49 PM, Clint Gilbert <clint_gilb...@hms.harvard.edu> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Thank you very much for your reply. Yeah, our indexes (indices?) > contain different types and amounts of data. :( The data being indexed > is all the same format - RDF - but it describes different numbers and > kinds of things. > > What is your gut feeling on whether or not it's a good idea for us to > roll our own? Katta is a contender, but we already have a fairly > complex system, and adding anything Hadoop-related feels like it might > push us over a tipping point into the realm of unwieldy overcomplexity. > But, this is a hard problem after all, so some amount of complexity is > inevitable. > > On 06/02/2011 07:05 PM, Erick Erickson wrote: >> As you've found out, raw scores certainly aren't comparable across >> different indexes >> #unless# the documents are fairly distributed. You're talking large >> indexes here, >> so if the documents are balanced across all your indexes, the results should >> be >> pretty comparable. This pre-supposes that the indexes share a common schema >> and that the distributions of terms are "close enough to identical" to be >> truly >> comparable. And it supposes that your indexes are similar in >> character. It wouldn't >> work if one of your indexes had, say, meta-data from videos and another had >> scholarly journal articles. >> >> Otherwise, there's work going on in Solr that might help, although I >> don't know when >> that'll be available. >> >> Other than that, I don't know what to suggest. It's not an easy >> problem or Solr/Lucene >> would already have solved it.. siiiggggh. >> >> Best >> Erick >> >> On Thu, Jun 2, 2011 at 3:51 PM, Clint Gilbert >> <clint_gilb...@hms.harvard.edu> wrote: >> Hi everyone, >> >> I searched the list archives, but couldn't find a question that closely >> matches mine. >> >> The project I'm working on is designed to allow searching a distributed >> collection of data repositories. Currently, we index each repository to >> build a central Lucene index. This works ok, but for practical (the >> central index is getting very large) and architectural (decentralization >> is a design goal) reasons, we'd like to distribute the index. >> >> In the past, we had basic federation system in place: when a user >> submitted a query, the query was broadcast to each data repository, >> which had its own independent Lucene index. Results from each repo were >> aggregated in reverse order. >> >> The problem was, of course, that since each index was constructed >> independently of all the others, and documents are distributed in the >> repos unevenly, it was impossible to rank the results from all the >> indices in a meaningful way. We basically punted and interleaved >> results, which didn't gave a bad user experience, hence the temporary >> switch to a central index. >> >> So, what options exist for searching distributed collections of Lucene >> indices and ranking results meaningfully? >> >> Katta seems promising, but I don't know enough about it yet. It also >> seems to want to open its own ports for RPC. I'd prefer something that >> could tunnel over HTTP to minimize firewall drama. (We will have 10s >> and then 100s of data repos running in separate locations.) >> >> We're also considering a home-grown scheme involving normalizing the >> denominators of all the index components in all our indices, based on >> the sums of counts obtained from all the indices. This feels like >> re-inventing the wheel, and it's not clear to me yet that the low-level >> manipulation of indices that we'd need to do is even possible. >> >> Any suggestions for distributing indices while ranking results well are >> very welcome! >>> > - --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> > >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAk3oIZUACgkQ5IyIbnMUeTuemwCeMfolvNVEjve9fIEJHy3N3TV/ > 0VIAn2Xf+ypB5PRS45ekmiXEDhmvDdhZ > =jtE9 > -----END PGP SIGNATURE----- > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org