All,

I have a question about join support across multiple document types in 
Solr/Lucene. Let me lay out the use case.

Suppose I have 3 tables:


*         Table A has 3 columns, id, a1, a2.

*         Table B has 4 columns, id, b1, b2, and aid, which is a foreign key 
referencing A.id.

*         Table C has 4 columns, id, c1, c2, and aid, which is a foreign key 
referencing A.id.

I want to be able to perform the following searches:


*         Search for rows in A by specifying just values for columns in A. For 
example,

select * from A where A.a1 = 'value'


*         Search for rows in A by specifying just values for columns in B or C 
or both. For example,

select A.*, B.* from A, B where B.b1 = 'value' and B.aid = A.id
select A.*, C.* from A, C where C.c1 = 'value' and C.aid = A.id
select A.*, B.*, C.* from A, B, C where B.b1 = 'value' and B.aid = A.id and 
C.c1 = 'value' and C.aid = A.id

Suppose that I want to store the data from A, B, and C in Solr/Lucene. How 
would I perform these searches in a Solr/Lucene environment?

It seems that there are two possible approaches:


1.)     Denormalize all data into one document. That is, my query in 
data-config.xml for doing a full-import would be:

select A.id, A.a1, A.a2, B.b1, B.b2, C.c1, C.c2 from A inner join B on B.aid = 
A.id inner join C on C.aid = A.id

I believe this means that the number of documents in my Lucene index will be on 
the order of the product:

cardinality(A) * cardinality(B) * cardinality(C)

This will result in a large amount of redundant data in my index.


2.)    Store the data from each table into a separate document, say, docA, 
docB, docC. This would require me to perform three separate searches and to 
join the result based on the A.id, B.aid, C.aid columns.

I am (dimly?) aware that the Solr/Lucene community is working on various 
solutions to this problem. For example, I've read Mike McCandless' description 
of the 
BlockJoinQuery<http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html>.
 This approach does not seem to solve our problem since (unless I am mistaken) 
the query requires at least one predicate to be specified for parent entity (A 
in my example). We, on the other hand, want to be able to have the ability to 
perform searches where only predicates for the child entities (B and C in my 
example) are specified. To give a concrete example, Table A might be a Claim 
table and Table B might be a Contact table and we want to search for Claims 
based on Contact info, for example: search for all claims where the lastName of 
a Contact matches 'DeRose'. Is my analysis correct? That is, is BlockJoinQuery 
only unidirectional from parent to child?

On the other hand, Lucene "query time joining" discussed 
here<http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene> 
seems to address our problem. The following paragraph seems to imply that 
queries can be specified in terms of data contained in the child documents:

You could also change the example and give all articles that match with a 
certain comment query. In this example the multipleValuesPerDocument is set to 
false and the fromField  (the id field) only contains one value per document. 
However, the example would still work if multipleValuesPerDocument  variable 
were set to true, but it would then work in a less efficient manner.

That is, Lucene "query time joining" is bidirectional. Of course, this begs the 
question: How efficient are these queries. The reason why we thought about 
moving these queries from our RDBMS to Solr/Lucene is because executing 
equivalent queries in the RDBMS sometimes produced pathological worst-case 
behavior (queries taking 10's of minutes). Are we going to encounter the same 
problems in Solr/Lucene?

So, any comments on the correctness of my analysis and any pointers to 
applicable resources that discuss this problem are appreciated.

F

_________________________________________
Frank DeRose
Guidewire Software | Senior Software Engineer
Cell: 510 -589-0752
fder...@guidewire.com<mailto:fder...@guidewire.com> | 
www.guidewire.com<http://www.guidewire.com/>
Deliver insurance your way with flexible core systems from Guidewire.


Reply via email to