[
https://issues.apache.org/jira/browse/SOLR-9918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15804336#comment-15804336
]
Tim Owen commented on SOLR-9918:
--------------------------------
OK I see what you mean, I can explain our use-case if that helps to understand
why we developed this processor, and when it might prove useful.
We have a Kafka queue of messages, which are a mixture of Create, Update and
Delete operations, and these are consumed and fed into two different storage
systems - Solr and a RDBMS. We want the behaviour to be consistent, so that the
two systems are in sync, and the way the Database storage app works is that
Create operations are implemented as effectively {{INSERT IF NOT EXISTS ...}}
and Update operations are the typical SQL {{UPDATE .. WHERE id = ..}} that
quietly do nothing if there is no row for {{id}}. So we want the Solr storage
to behave in the same way.
There can occasionally be duplicate messages that Create the same {{id}} due to
the hundreds of instances of the app that adds messages to Kafka, and small
race conditions that mean two or more of them will do some duplicate work. We
chose to accept this situation and de-dupe downstream by having both storage
apps behave as above.
Another scenario is that, since we have the Kafka queue as a buffer, if there's
any problems downstream we can always stop the storage apps, restore last
night's backup, rewind the Kafka consumer offset (slightly beyond the backup
point) and then replay. In this situation we don't want a lot of index churn
for the overlap Create messages.
With updates, the apps which add Update messages only have best-effort
knowledge of which document/row {{id}}s are relevant to the field/column being
changed by the update message. So we quite commonly have messages that are
optimistic updates, for a document that doesn't in fact exist (now). The
database storage handles this quietly, so we wanted the same behaviour in Solr.
Initially what happened in Solr was we'd get newly-created documents containing
only the fields changed in the AtomicUpdate, so we added a required field to
avoid that happening, which works but is noisy as we get a Solr exception each
time (and then batch updates are messy because we have to split and retry).
I looked at {{DocBasedVersionConstraintsProcessor}} but we don't have
explicitly-managed versioning for our documents in Solr. Then I looked at
{{SignatureUpdateProcessor}} but that does churn the index and overwrites
documents, which we didn't want. Also considered {{TolerantUpdateProcessor}}
but that isn't really solving the issue for inserts, it just would make some
update batches less noisy.
I'd say this processor is useful in situations where you have documents that
don't have any concept of multiple versions that can be assigned by the app,
and don't have any kind of fuzzy-ness about similar documents i.e. each
document has a strong identity, akin to what a Database unique key is.
> An UpdateRequestProcessor to skip duplicate inserts and ignore updates to
> missing docs
> --------------------------------------------------------------------------------------
>
> Key: SOLR-9918
> URL: https://issues.apache.org/jira/browse/SOLR-9918
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: update
> Reporter: Tim Owen
> Attachments: SOLR-9918.patch, SOLR-9918.patch
>
>
> This is an UpdateRequestProcessor and Factory that we have been using in
> production, to handle 2 common cases that were awkward to achieve using the
> existing update pipeline and current processor classes:
> * When inserting document(s), if some already exist then quietly skip the new
> document inserts - do not churn the index by replacing the existing documents
> and do not throw a noisy exception that breaks the batch of inserts. By
> analogy with SQL, {{insert if not exists}}. In our use-case, multiple
> application instances can (rarely) process the same input so it's easier for
> us to de-dupe these at Solr insert time than to funnel them into a global
> ordered queue first.
> * When applying AtomicUpdate documents, if a document being updated does not
> exist, quietly do nothing - do not create a new partially-populated document
> and do not throw a noisy exception about missing required fields. By analogy
> with SQL, {{update where id = ..}}. Our use-case relies on this because we
> apply updates optimistically and have best-effort knowledge about what
> documents will exist, so it's easiest to skip the updates (in the same way a
> Database would).
> I would have kept this in our own package hierarchy but it relies on some
> package-scoped methods, and seems like it could be useful to others if they
> choose to configure it. Some bits of the code were borrowed from
> {{DocBasedVersionConstraintsProcessorFactory}}.
> Attached patch has unit tests to confirm the behaviour.
> This class can be used by configuring solrconfig.xml like so..
> {noformat}
> <updateRequestProcessorChain name="skipexisting">
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor
> class="org.apache.solr.update.processor.SkipExistingDocumentsProcessorFactory">
> <bool name="skipInsertIfExists">true</bool>
> <bool name="skipUpdateIfMissing">false</bool> <!-- We will override
> this per-request -->
> </processor>
> <processor class="solr.DistributedUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
> {noformat}
> and initParams defaults of
> {noformat}
> <str name="update.chain">skipexisting</str>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]