[
https://issues.apache.org/jira/browse/SOLR-6666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245754#comment-14245754
]
Erick Erickson commented on SOLR-6666:
--------------------------------------
The problem here is that if you add this copyField directive to schema.xml
<copyField source="fail_dynamic" dest="dynamic_*"/>
the schema won't load with the patch. It fails with a message
about the source field needing an asterisk if the destination
has one. Other tests have this pattern and fail BTW, see:
TestFieldCollectionResource, TestManagedSotpFilterFactory
and TestManagedSynonymFileFactory.
The fail_dynamic field "fulfills" this requirement since it is actually a
match for "*_dynamic"
So are you saying that if you have
<field name="one"... />
<field name="two".../>
and a copyField of
<copyField source="one" dest="two" /> that bogus logic happens because
it matches a dynamic field?
Or are your source fields "explicit", but only really instantiated by matching
a
dynamic field so there's no corresponding <field> definition?
If it's the former, then it seems that doing a test way up top similar to this:
if (destSchemaField != null && sourceSchemaField != null) { // Source and
destination are explicit
List<CopyField> copyFieldList = copyFieldsMap.get(source);
if (copyFieldList == null) {
copyFieldList = new ArrayList<>();
copyFieldsMap.put(source, copyFieldList);
}
copyFieldList.add(new CopyField(sourceSchemaField, destSchemaField,
maxChars));
incrementCopyFieldTargetCount(destSchemaField);
return;
}
(and maybe taking this out from the end of the method?) would catch your case.
It's certainly an open question whether this is the way it "should" be of
course. I don't
quite know if there are shortcuts we could take that would satisfy both
situations, i.e.
shortcut non-asterisk source fields in copyField directives that happen to be
instantiations
of dynamic fields while still respecting all the ways a field could get into
the "explicit" field
("fail_dynamic" above).
It's also possible that the test that blows up above is too restrictive, I'm
not prepared
to say one way or another. But I can't commit this without getting a resolution
to that question.
Under any circumstances, it seems that beefing up the IndexSchemaTest would be
a good thing,
on a quick look they aren't all that comprehensive.
> Dynamic copy fields are considering all dynamic fields, causing a significant
> performance impact on indexing documents
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-6666
> URL: https://issues.apache.org/jira/browse/SOLR-6666
> Project: Solr
> Issue Type: Improvement
> Components: Schema and Analysis, update
> Environment: Linux, Solr 4.8, Schema with 70 fields and more than 500
> specific CopyFields for dynamic fields, but without wildcards (the fields are
> dynamic, the copy directive is not)
> Reporter: Liram Vardi
> Assignee: Erick Erickson
> Attachments: SOLR-6666.patch
>
>
> Result:
> After applying a fix for this issue, tests which we conducted show more than
> 40 percent improvement on our insertion performance.
> Explanation:
> Using JVM profiler, we found a CPU "bottleneck" during Solr indexing process.
> This bottleneck can be found at org.apache.solr.schema.IndexSchema, in the
> following method, "getCopyFieldsList()":
> {code:title=getCopyFieldsList() |borderStyle=solid}
> final List<CopyField> result = new ArrayList<>();
> for (DynamicCopy dynamicCopy : dynamicCopyFields) {
> if (dynamicCopy.matches(sourceField)) {
> result.add(new CopyField(getField(sourceField),
> dynamicCopy.getTargetField(sourceField), dynamicCopy.maxChars));
> }
> }
> List<CopyField> fixedCopyFields = copyFieldsMap.get(sourceField);
> if (null != fixedCopyFields) {
> result.addAll(fixedCopyFields);
> }
> {code}
> This function tries to find for an input source field all its copyFields (All
> its destinations which Solr need to move this field).
> As you can probably note, the first part of the procedure is the procedure
> most “expensive” step (takes O( n ) time while N is the size of the
> "dynamicCopyFields" group).
> The next part is just a simple "hash" extraction, which takes O(1) time.
> Our schema contains over then 500 copyFields but only 70 of then are
> "indexed" fields.
> We also have one dynamic field with a wildcard ( * ), which "catches" the
> rest of the document fields.
> As you can conclude, we have more than 400 copyFields that are based on this
> dynamicField but all, except one, are fixed (i.e. does not contain any
> wildcard).
> From some reason, the copyFields registration procedure defines those 400
> fields as "DynamicCopyField " and then store them in the “dynamicCopyFields”
> array,
> This step makes getCopyFieldsList() very expensive (in CPU terms) without any
> justification: All of those 400 copyFields are not glob and therefore do not
> need any complex pattern matching to the input field. They all can be store
> at the "fixedCopyFields".
> Only copyFields with asterisks need this "special" treatment and they are
> (especially on our case) pretty rare.
> Therefore, we created a patch which fix this problem by changing the
> registerCopyField() procedure.
> Test which we conducted show that there is no change in the Indexing results.
> Moreover, the fix still successfully passes the class unit tests (i.e.
> IndexSchemaTest.java).
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]