gerlowskija commented on code in PR #2395:
URL: https://github.com/apache/solr/pull/2395#discussion_r1589215019


##########
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java:
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.update.processor;
+
+import org.apache.solr.core.AbstractSolrEventListener;
+import org.apache.solr.core.SolrCore;
+import org.apache.solr.search.SolrIndexSearcher;
+import org.apache.solr.util.RefCounted;
+
+public class NumFieldsMonitor extends AbstractSolrEventListener {

Review Comment:
   > I suppose you didn't like my proposal that would eliminate the need for 
this event listener
   
   Yes, in short.  I agree it's a matter of opinion, but I prefer the 
particular complexity and benefits that NumFieldsMonitor offers to that of the 
direct-SIS or weak-reference approach.  See my reply [on the factory 
above](https://github.com/apache/solr/pull/2395/files#diff-6da5de1a5fc66295d066909ad976a15ad31e8411ed71d7909e8ed0d14f3cac78R103)
 for a bit more detail/explanation.



##########
solr/core/src/test/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorIntegrationTest.java:
##########
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.update.processor;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.UUID;
+import java.util.concurrent.TimeUnit;
+import org.apache.solr.client.solrj.request.CollectionAdminRequest;
+import org.apache.solr.cloud.SolrCloudTestCase;
+import org.apache.solr.common.SolrInputDocument;
+import org.hamcrest.Matchers;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+public class NumFieldLimitingUpdateRequestProcessorIntegrationTest extends 
SolrCloudTestCase {
+
+  private static String SINGLE_SHARD_COLL_NAME = "singleShardColl";

Review Comment:
   I think this is copy/paste from a test that used both single and multi-shard 
collections.  I've updated the name to be more generic.



##########
solr/core/src/test-files/solr/configsets/cloud-minimal-field-limiting/conf/solrconfig.xml:
##########
@@ -0,0 +1,73 @@
+<?xml version="1.0" ?>
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<!-- Minimal solrconfig.xml with /select, /admin and /update only -->
+
+<config>
+
+  <dataDir>${solr.data.dir:}</dataDir>
+
+  <directoryFactory name="DirectoryFactory"
+                    
class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}"/>
+  <schemaFactory class="ClassicIndexSchemaFactory"/>
+
+  <luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion>
+
+  <updateHandler class="solr.DirectUpdateHandler2">
+    <commitWithin>
+      <softCommit>${solr.commitwithin.softcommit:true}</softCommit>
+    </commitWithin>
+    <updateLog class="${solr.ulog:solr.UpdateLog}"></updateLog>
+  </updateHandler>
+
+  <requestHandler name="/select" class="solr.SearchHandler">
+    <lst name="defaults">
+      <str name="echoParams">explicit</str>
+      <str name="indent">true</str>
+      <str name="df">text</str>
+    </lst>
+  </requestHandler>
+
+  <updateProcessor class="solr.NumFieldLimitingUpdateRequestProcessorFactory" 
name="max-fields">
+    <int name="maxFields">${solr.test.maxFields:1234}</int>
+  </updateProcessor>

Review Comment:
   Tbh I wasn't aware that you could declare processors (that require config) 
directly *in* a processor chain.  I can't find an example of that.  Looking 
through several other solrconfig.xml files for URPs that require 
configuration...they all look similar to what I've done here afaict.
   
   (See the [_default configset Solr 
ships](https://github.com/apache/solr/blob/main/solr/server/solr/configsets/_default/conf/solrconfig.xml#L900-L944)
 and the [default configset in the 'core' modules 
tests](https://github.com/apache/solr/blob/main/solr/core/src/test-files/solr/configsets/_default/conf/solrconfig.xml#L61-L81)...both
 look similar to the example here)
   
   Maybe I'm misunderstanding you or missing something though - can you point 
to an example of how you'd prefer to see this declared, and a reason it's 
preferable? 



##########
solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc:
##########
@@ -337,6 +337,12 @@ Documents processed prior to the offender are indexed by 
Solr; documents followi
 +
 Alternatively, the processor offers a "permissive" mode 
(`permissiveMode=true`) which skips the offending document and logs a warning, 
but doesn't abort the remainder of the batch or return an error to users.
 
+{solr-javadocs}/core/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.html[NumFieldLimitingUpdateRequestProcessorFactory]::
 Fails update requests once a core has exceeded a configurable "maximum" number 
of fields.
++
+Solr performance can degrade and even become unstable if cores accumulate too 
many (e.g. more than 500) fields.  The "NumFieldLimiting" URP is offered as a 
safeguard that helps users notice potentially-dangerous schema design and/or 
mis-use of dynamic fields, before these performance and stability problems 
would manifest.
+Note that the field count an index reports can be influenced by deleted (but 
not yet purged) documents, and may vary from replica to replica.
+In order to avoid these sort of discrepancies between replicas, use of this 
URP should almost always precede DistributedUpdateProcessor in when running in 
SolrCloud mode.

Review Comment:
   Well, the URP certainly can't go _after_ DistributedUpdateProcessor for the 
reason laid out in the docs - you don't want some subset of your replicas 
rejecting certain docs because their segments have them disagree on the current 
field count.
   
   Putting it _before_ DUP also has downsides, as you pointed out.  The field 
count check is going to happen on that somewhat arbitrary "first node", instead 
of a the more authoritative "leader".  And that adds wonkiness - if replicas 
disagree on field-count, this means you could get into situations where a given 
update could either succeed or fail depending on whether it gets lucky or not 
with its "first node".  (Hopefully this isn't a big deal though, since the URP 
isn't trying to be an exact limit - but it's still ugly and potentially 
confusing.)
   
   This was what earlier versions of this PR were trying to avoid by gating the 
field-count check so that it only happened for docs that were being processed 
by their shard leader.



##########
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java:
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.update.processor;
+
+import org.apache.solr.core.AbstractSolrEventListener;
+import org.apache.solr.core.SolrCore;
+import org.apache.solr.search.SolrIndexSearcher;
+import org.apache.solr.util.RefCounted;
+
+public class NumFieldsMonitor extends AbstractSolrEventListener {
+
+  private int numFields;
+
+  public NumFieldsMonitor(SolrCore core) {
+    super(core);
+    this.numFields = -1;
+  }
+
+  @Override
+  public void postCommit() {

Review Comment:
   Yeah; will remove. 



##########
solr/core/src/java/org/apache/solr/update/processor/NumFieldsMonitor.java:
##########
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.update.processor;
+
+import org.apache.solr.core.AbstractSolrEventListener;
+import org.apache.solr.core.SolrCore;
+import org.apache.solr.search.SolrIndexSearcher;
+import org.apache.solr.util.RefCounted;
+
+public class NumFieldsMonitor extends AbstractSolrEventListener {
+
+  private int numFields;
+
+  public NumFieldsMonitor(SolrCore core) {
+    super(core);
+    this.numFields = -1;
+  }
+
+  @Override
+  public void postCommit() {
+    RefCounted<SolrIndexSearcher> indexSearcher = null;
+    try {
+      indexSearcher = getCore().getSearcher();

Review Comment:
   Ooh, nice, I wasn't aware of this.
   
   Your comment on L34 above makes this one moot a bit, but I'm happy to have 
this drawn to my attention for the future!



##########
solr/core/src/java/org/apache/solr/update/processor/NumFieldLimitingUpdateRequestProcessorFactory.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.solr.update.processor;
+
+import org.apache.solr.common.util.NamedList;
+import org.apache.solr.core.SolrCore;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.solr.search.SolrIndexSearcher;
+import org.apache.solr.util.plugin.SolrCoreAware;
+
+/**
+ * This factory generates an UpdateRequestProcessor which fails update 
requests once a core has
+ * exceeded a configurable maximum number of fields. Meant as a safeguard to 
help users notice
+ * potentially-dangerous schema design before performance and stability 
problems start to occur.
+ *
+ * <p>The URP uses the core's {@link SolrIndexSearcher} to judge the current 
number of fields.
+ * Accordingly, it undercounts the number of fields in the core - missing all 
fields added since the
+ * previous searcher was opened. As such, the URP's request-blocking is "best 
effort" - it cannot be
+ * relied on as a precise limit on the number of fields.
+ *
+ * <p>Additionally, the field-counting includes all documents present in the 
index, including any
+ * deleted docs that haven't yet been purged via segment merging. Note that 
this can differ
+ * significantly from the number of fields defined in managed-schema.xml - 
especially when dynamic
+ * fields are enabled. The only way to reduce this field count is to delete 
documents and wait until
+ * the deleted documents have been removed by segment merges. Users may of 
course speed up this
+ * process by tweaking Solr's segment-merging, triggering an "optimize" 
operation, etc.
+ *
+ * <p>{@link NumFieldLimitingUpdateRequestProcessorFactory} accepts two 
configuration parameters:
+ *
+ * <ul>
+ *   <li><code>maxFields</code> - (required) The maximum number of fields 
before update requests
+ *       should be aborted. Once this limit has been exceeded, additional 
update requests will fail
+ *       until fields have been removed or the "maxFields" is increased.
+ *   <li><code>warnOnly</code> - (optional) If <code>true</code> then the URP 
logs verbose warnings
+ *       about the limit being exceeded but doesn't abort update requests. 
Defaults to <code>false
+ *       </code> if not specified
+ * </ul>
+ *
+ * @since 9.6.0
+ */
+public class NumFieldLimitingUpdateRequestProcessorFactory extends 
UpdateRequestProcessorFactory
+    implements SolrCoreAware {
+
+  private static final String MAXIMUM_FIELDS_PARAM = "maxFields";
+  private static final String WARN_ONLY_PARAM = "warnOnly";
+
+  private NumFieldsMonitor numFieldsMonitor;
+  private int maximumFields;
+  private boolean warnOnly;
+
+  @Override
+  public void inform(final SolrCore core) {
+    // register a commit callback for monitoring the number of fields in the 
schema
+    numFieldsMonitor = new NumFieldsMonitor(core);
+    core.getUpdateHandler().registerCommitCallback(numFieldsMonitor);
+    core.registerNewSearcherListener(numFieldsMonitor);
+  }
+
+  @Override
+  public void init(NamedList<?> args) {
+    warnOnly = args.indexOf(WARN_ONLY_PARAM, 0) > 0 ? 
args.getBooleanArg(WARN_ONLY_PARAM) : false;
+
+    if (args.indexOf(MAXIMUM_FIELDS_PARAM, 0) < 0) {
+      throw new IllegalArgumentException(
+          "The "
+              + MAXIMUM_FIELDS_PARAM
+              + " parameter is required for "
+              + getClass().getName()
+              + ", but no value was provided.");
+    }
+    final Object rawMaxFields = args.get(MAXIMUM_FIELDS_PARAM);
+    if (rawMaxFields == null || !(rawMaxFields instanceof Integer)) {
+      throw new IllegalArgumentException(
+          MAXIMUM_FIELDS_PARAM + " must be configured as a non-null <int>");
+    }
+    maximumFields = (Integer) rawMaxFields;
+    if (maximumFields <= 0) {
+      throw new IllegalArgumentException(MAXIMUM_FIELDS_PARAM + " must be a 
positive integer");
+    }
+  }
+
+  @Override
+  public UpdateRequestProcessor getInstance(
+      SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor 
next) {
+
+    // TODO Should we skip to the next URP if running in SolrCloud and *not* a 
leader?
+    return new NumFieldLimitingUpdateRequestProcessor(
+        req, next, maximumFields, numFieldsMonitor.getCurrentNumFields(), 
warnOnly);

Review Comment:
   Personally, I find the NumFieldsMonitor approach preferable to using a SIS 
here directly:
   
   - it allows the URP and factory to be abstracted away from exactly how the 
currentNumFields is calculated.
   - it allows us to make as few (potentially expensive) calls to 
`SIS.getFieldInfos()` as possible.  (I know we could also avoid calls by doing 
things like hanging on to a searcher reference and checking whether the 
searcher has changed, but IMO this is at least as complex as the 
NumFieldsMonitor approach.)
   - as epugh pointed out in an earlier comment, NumFieldsMonitor seems like it 
could be applied in other places.
   
   If there are other votes I'm happy to go with whatever the majority 
considers simpler.  But otherwise I'll stick with NumFieldsMonitor approach.
   
   > Furthermore the URP could be renamed to simply "BlockNewDocsUrp"
   
   I don't love "NumFieldLimitingURP" as a name, but I like that the name 
itself suggests why a user would want to use it.  "Want to cap the fields in 
your index, use this thing..."
   
   BlockNewDocs captures what the URP does, but doesn't capture the "why".  
(And you could also interpret it as distinguishing between "new" and "existing" 
docs, which isn't the case.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to