[jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests

David Smiley (JIRA) Mon, 12 Dec 2016 14:14:20 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743318#comment-15743318
 ]


David Smiley commented on SOLR-9824:
------------------------------------

I reviewed your patch.
* Nice catch to remove that synchronized on HttpClientUtil.interceptors!  I 
like the use of CopyOnWriteArrayList there.
* I like that you managed to remove the proposed longLastingThreads flag for 
this behavior.  _not_ having the flag also keeps things simpler, I think.
* CUSC.blockUntilFinished:
** I like the escalating timeout here starting at 10ms up to 250. 
** I see that it _may_ need to add a runner conditionally (non-empty queue 
size, empty runners)... but if that happens, I don't think we should invoke 
interruptRunnerThreadsPolling?  i.e. put that into an else branch.  Reason 
being... we if add a runner, don't go interrupting it immediately after.
** there's a race due to inPoll just being a volatile variable and so it might 
be false and we might not interrupt when we actually wanted to, or vice 
versa... but I suppose it may not be a big issue since the queue is poll'ed 
with timeouts that don't take forever.  Adding comments to this effect would be 
good.

The obvious complexity of ConcurrentUpdateSolrClient worries me.  One small bit 
of evidence:  synchronized keyword on 3 different objects (this, queue, 
runners).  I was perhaps going to add more review comments but it takes awhile 
to digest what's going on and why we check for the things when we do.  I 
perhaps naively like to think it can be improved while retaining the features & 
performance it has now.  It's hard to review changes to CUSC as it is.  This 
isn't a slight against you or this patch.  Maybe we could think of ways to 
improve it?  I know any changes to CUSC are risky but if a refactor leads to 
more maintainable code then that is in and of itself less risk than continuing 
to add more complexity.  Perhaps one step might be refactoring out a 
specialized Queue + Executor which has the feature of scaling down to 0 runners 
when nothing's on the queue.  I dunno; just an idea.

> Documents indexed in bulk are replicated using too many HTTP requests
> ---------------------------------------------------------------------
>
>                 Key: SOLR-9824
>                 URL: https://issues.apache.org/jira/browse/SOLR-9824
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.3
>            Reporter: David Smiley
>            Assignee: Mark Miller
>         Attachments: SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, 
> SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch
>
>
> This takes awhile to explain; bear with me. While working on bulk indexing 
> small documents, I looked at the logs of my SolrCloud nodes.  I noticed that 
> shards would see an /update log message every ~6ms which is *way* too much.  
> These are requests from one shard (that isn't a leader/replica for these docs 
> but the recipient from my client) to the target shard leader (no additional 
> replicas).  One might ask why I'm not sending docs to the right shard in the 
> first place; I have a reason but it's besides the point -- there's a real 
> Solr perf problem here and this probably applies equally to 
> replicationFactor>1 situations too.  I could turn off the logs but that would 
> hide useful stuff, and it's disconcerting to me that so many short-lived HTTP 
> requests are happening, somehow at the bequest of DistributedUpdateProcessor. 
>  After lots of analysis and debugging and hair pulling, I finally figured it 
> out.  
> In SOLR-7333 ([~tpot]) introduced an optimization called 
> {{UpdateRequest.isLastDocInBatch()}} in which ConcurrentUpdateSolrClient will 
> poll with a '0' timeout to the internal queue, so that it can close the 
> connection without it hanging around any longer than needed.  This part makes 
> sense to me.  Currently the only spot that has the smarts to set this flag is 
> {{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}} at the 
> last document.  So if a shard received docs in a javabin stream (but not 
> other formats) one would expect the _last_ document to have this flag.  
> There's even a test.  Docs without this flag get the default poll time; for 
> javabin it's 25ms.  Okay.
> I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send 
> javabin data in a batch, the intended efficiencies of SOLR-7333 would apply.  
> I didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW 
> DistributedUpdateProcessor uses CUSC too).  CUSC uses the RequestWriter 
> (defaulting to javabin) to send each document separately without any leading 
> marker or trailing marker.  For the XML format by comparison, there is a 
> leading and trailing marker (<stream> ... </stream>).  Since there's no outer 
> container for the javabin unmarshalling to detect the last document, it marks 
> _every_ document as {{req.lastDocInBatch()}}!  Ouch!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests

Reply via email to