[jira] [Commented] (SOLR-12298) Index Full nested document Hierarchy For Queries (umbrella issue)

David Smiley (JIRA) Wed, 09 May 2018 12:44:39 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469375#comment-16469375
 ]


David Smiley commented on SOLR-12298:
-------------------------------------

Quoting [~hossman] here inline (hoping for his input):
{quote}Are you suggesting we model child documents as objects 
(SolrInputDocuments i guess?) in a special field?
{quote}
Yes.  Not as a special field, although _anonymous_ children (those that don't 
have any particular label (no named relationship)) could use the 
_childDocuments_ key as it's consistent with existing use of this label.
  
{quote}... what if i put child documents in multiple fields? would that signify 
the different types of child?
{quote}
Yes indeed.  This is largely the point of this approach, since the current 
anonymous relationship has a loss of semantics in the relationship.

 
{quote}how would solr model that in the (lucene) Documents when giving them to 
the InddexWriter?
{quote}
In this issue, Moshe has proposed a labeled path field, e.g. "post.comment".  
This path would be added in an URP, or perhaps it would be done by 
\{{AddUpdateCommand.flatten/recUnwrap}} right when the URP chain is done.
{quote}How would solr know how to order the children in from multiple 
fields/lists when creating the block?
{quote}
Ah, I think that's a non-issue as they are indexed in the order given 
(notwistanding the hierarchy flattening with parent last).  If you meant how 
might the order be reconstituted later at retrieval time then we can rely on 
the docID order since they are kept in order and never broken up.  
{quote}Wouldn't the "type of child" information be better living in the child 
documents itself? (particularly since that "type" information needs to be in 
the child documents anyway so that the filter query for a BJQ can be specified.)
{quote}
_Ultimately_ it does in the generated Lucene Document.  
{quote}It also seems like it would require code that wants to know what 
children exist in a document to do a lot of work to find that out (need to 
iterate ever field in the SolrInputDocument and do reflection to see if they 
are child-documents or not)
{quote}
I looked at this; it's AddSchemaFieldsUpdateProcessorFactory and 
AddUpdateCommand.flatten/recUnwrap.  I'm not concerned about the former as it's 
for schema-guessing; only the latter.  Perhaps this is no big deal; it's only 
the number of distinct field names in the average document?  Also if the schema 
contained special "ChildDoc" fields or some-such, then the schema could guide 
these code paths to know which field names to lookup in the incoming document.
{quote}Another concern off the top of my head is that a lot of existing code 
(including any custom update processors people might have) would assume those 
child documents are multivaluved field values and would probably break – hence 
a new method on SolrInputDocument seems wiser (code that doens't know about may 
not do what you want, but at least it won't break it)
{quote}
Fixable on a case by case basis.  If this is worse than I imagine it is, then 
what URP would be the worst offender?

In summary, the current approach doesn't retain the semantic information of 
relationships, and I believe removing SolrInputFields.childDocuments will 
result in something _simpler_.  It also allows a cleaner separation between the 
format-specific input (JSON vs XML vs ...) and logic that should be ignorant to 
that.

The next-best alternative I can think of that doesn't disturb 
SolrInputDocument._childDocuments would be if hypothetically SolrInputDocument 
had overloaded addChildDocument to accept a relationship string.  And the impl 
would add the child document along with mutating it to have the fields moshe 
has spoken of.  But this seems trappy to me since some methods would do this 
and the existing ones wouldn't, and so the format loader would need to be 
careful to always use or or the other.

> Index Full nested document Hierarchy For Queries (umbrella issue)
> -----------------------------------------------------------------
>
>                 Key: SOLR-12298
>                 URL: https://issues.apache.org/jira/browse/SOLR-12298
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: mosh
>            Priority: Major
>
> Solr ought to have the ability to index deeply nested objects, while storing 
> the original document hierarchy.
>  Currently the client has to index the child document's full path and level 
> to manually reconstruct the original document structure, since the children 
> are flattened and returned in the reserved "__childDocuments__" key.
> Ideally you could index a nested document, having Solr transparently add the 
> required fields while providing a document transformer to rebuild the 
> original document's hierarchy.
>  
> This issue is an umbrella issue for the particular tasks that will make it 
> all happen – either subtasks or issue linking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12298) Index Full nested document Hierarchy For Queries (umbrella issue)

Reply via email to