All, I have a question about join support across multiple document types in Solr/Lucene. Let me lay out the use case.
Suppose I have 3 tables: * Table A has 3 columns, id, a1, a2. * Table B has 4 columns, id, b1, b2, and aid, which is a foreign key referencing A.id. * Table C has 4 columns, id, c1, c2, and aid, which is a foreign key referencing A.id. I want to be able to perform the following searches: * Search for rows in A by specifying just values for columns in A. For example, select * from A where A.a1 = 'value' * Search for rows in A by specifying just values for columns in B or C or both. For example, select A.*, B.* from A, B where B.b1 = 'value' and B.aid = A.id select A.*, C.* from A, C where C.c1 = 'value' and C.aid = A.id select A.*, B.*, C.* from A, B, C where B.b1 = 'value' and B.aid = A.id and C.c1 = 'value' and C.aid = A.id Suppose that I want to store the data from A, B, and C in Solr/Lucene. How would I perform these searches in a Solr/Lucene environment? It seems that there are two possible approaches: 1.) Denormalize all data into one document. That is, my query in data-config.xml for doing a full-import would be: select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.aid = A.id inner join C on C.aid = A.id I believe this means that the number of documents in my Lucene index will be on the order of the product: cardinality(A) * cardinality(B) * cardinality(C) This will result in a large amount of redundant data in my index. 2.) Store the data from each table into a separate document, say, docA, docB, docC. This would require me to perform three separate searches and to join the result based on the A.id, B.aid, C.aid columns. I am (dimly?) aware that the Solr/Lucene community is working on various solutions to this problem. For example, I've read Mike McCandless' description of the BlockJoinQuery<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>. This approach does not seem to solve our problem since (unless I am mistaken) the query requires at least one predicate to be specified for parent entity (A in my example). We, on the other hand, want to be able to have the ability to perform searches where only predicates for the child entities (B and C in my example) are specified. To give a concrete example, Table A might be a Claim table and Table B might be a Contact table and we want to search for Claims based on Contact info, for example: search for all claims where the lastName of a Contact matches 'DeRose'. Is my analysis correct? That is, is BlockJoinQuery only unidirectional from parent to child? On the other hand, Lucene "query time joining" discussed here<http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene> seems to address our problem. The following paragraph seems to imply that queries can be specified in terms of data contained in the child documents: You could also change the example and give all articles that match with a certain comment query. In this example the multipleValuesPerDocument is set to false and the fromField (the id field) only contains one value per document. However, the example would still work if multipleValuesPerDocument variable were set to true, but it would then work in a less efficient manner. That is, Lucene "query time joining" is bidirectional. Of course, this begs the question: How efficient are these queries. The reason why we thought about moving these queries from our RDBMS to Solr/Lucene is because executing equivalent queries in the RDBMS sometimes produced pathological worst-case behavior (queries taking 10's of minutes). Are we going to encounter the same problems in Solr/Lucene? So, any comments on the correctness of my analysis and any pointers to applicable resources that discuss this problem are appreciated. F _________________________________________ Frank DeRose Guidewire Software | Senior Software Engineer Cell: 510 -589-0752 fder...@guidewire.com<mailto:fder...@guidewire.com> | www.guidewire.com<http://www.guidewire.com/> Deliver insurance your way with flexible core systems from Guidewire.