[
https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264612#comment-14264612
]
Steve Rowe commented on SOLR-6666:
----------------------------------
+1 for Erick's patch.
Separately, I've long thought that iterating through each dynamic copy field
looking for matches was non-optimal - I think this could be converted into one
or more regular expressions, like maybe one for prefix globs and one for suffix
globs.
{quote}
We have three other test failures, it's always best to run 'ant test' before
putting up a patch. That said, I think the one I'm seeing (there are three, but
they're all the same problem) is the following:
Steve Rowe I'm particularly interested in what you think here.
The trunk code returns this fragment in TestCopyFieldCollectionResource.java
{code:javascript}{ "source":"src_sub_no_ast_i", "sourceDynamicBase":"*_i",
"dest":"title"}{code}
,
whereas the patched code returns:
{code:javascript}{ "source":"src_sub_no_ast_i", "dest":"title"}{code}
,
The schema.xml file (if I've got the right one) has this line:
{code:xml}<copyField source="src_sub_no_ast_i" dest="title"/>{code}
{quote}
Yeah the {{sourceDynamicBase}} is determined at schema parsing time and stored
in a {{DynamicCopy}} instance. While the optimization included in the patch
removes this capability -- {{DynamicCopy}} is not used in {{copyFieldsMap}} --
the idea of a single source "base" is wrongheaded for at least two reasons: a)
copyField source globs can match multiple dynamicField-s (as well as
non-dynamic fields for that matter); and b) it's not always possible to
identify the source field(s) at schema parsing time - see e.g. SOLR-4650.
Anyway, there is no real loss here in dropping support for
{{sourceDynamicBase}}.
bq. Like I said, the original code hurt my head, I suspect it was just wrong.
Steve, do you have any comments here? Or am I mis-interpreting things?
The code isn't changed much, really? Can you say why you think it was (is?)
wrong? The code is trying to handle lots of cases - see [the table on
SOLR-4567|https://issues.apache.org/jira/browse/SOLR-4567?focusedCommentId=13600847&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13600847].
That said, I think the code for handling dynamic copyField sources is more
complicated than it needs to be: except for cases 14-16 in the above-linked
table (copyField-s with a no-glob-dynamic-field-reference source and a dynamic
field destination), it's not necessary to verify at schema parsing time that
any (dynamic) field(s) are matched by copy field glob sources - see SOLR-4650
for a concrete example. I think {{sourceDynamicBase}} should go away entirely.
> Dynamic copy fields are considering all dynamic fields, causing a significant
> performance impact on indexing documents
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-6666
> URL: https://issues.apache.org/jira/browse/SOLR-6666
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis, update
> Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500
> specific CopyFields for dynamic fields, but without wildcards (the fields are
> dynamic, the copy directive is not)
> Reporter: Liram Vardi
> Assignee: Erick Erickson
> Attachments: SOLR-6666.patch, SOLR-6666.patch, SOLR-6666.patch
>
>
> Result:
> After applying a fix for this issue, tests which we conducted show more than
> 40 percent improvement on our insertion performance.
> Explanation:
> Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process.
> This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the
> following method, "getCopyFieldsList()":
> {code:title=getCopyFieldsList() |borderStyle=solid}
> final List<CopyField> result = new ArrayList<>();
> for (DynamicCopy dynamicCopy : dynamicCopyFields) {
> if (dynamicCopy.matches(sourceField)) {
> result.add(new CopyField(getField(sourceField),
> dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
> }
> }
> List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
> if (null != fixedCopyFields) {
> result.addAll(fixedCopyFields);
> }
> {code}
> This function tries to find for an input source field all its copyFields (All
> its destinations which Solr need to move this field).
> As you can probably note, the first part of the procedure is the procedure
> most “expensive” step (takes O( n ) time while N is the size of the
> "dynamicCopyFields" group).
> The next part is just a simple "hash" extraction, which takes O(1) time.
> Our schema contains over then 500 copyFields but only 70 of then are
> "indexed" fields.
> We also have one dynamic field with a wildcard ( * ), which "catches" the
> rest of the document fields.
> As you can conclude, we have more than 400 copyFields that are based on this
> dynamicField but all, except one, are fixed (i.e. does not contain any
> wildcard).
> From some reason, the copyFields registration procedure defines those 400
> fields as "DynamicCopyField " and then store them in the “dynamicCopyFields”
> array,
> This step makes getCopyFieldsList() very expensive (in CPU terms) without any
> justification: All of those 400 copyFields are not glob and therefore do not
> need any complex pattern matching to the input field. They all can be store
> at the "fixedCopyFields".
> Only copyFields with asterisks need this "special" treatment and they are
> (especially on our case) pretty rare.
> Therefore, we created a patch which fix this problem by changing the
> registerCopyField() procedure.
> Test which we conducted show that there is no change in the Indexing results.
> Moreover, the fix still successfully passes the class unit tests (i.e.
> IndexSchemaTest.java).
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]