[jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests

Mark Miller (JIRA) Wed, 28 Dec 2016 04:44:19 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782818#comment-15782818
 ]


Mark Miller commented on SOLR-9824:
-----------------------------------

bq. put that into an else branch. 

I'll do that.

bq. there's a race due to inPoll just being a volatile variable and so it might 
be false and we might not interrupt when we actually wanted to, or vice 
versa... but I suppose it may not be a big issue since the queue is poll'ed 
with timeouts that don't take forever. Adding comments to this effect would be 
good.

Yeah, I don't think it's an issue. Distributed updates does use a very large 
timeout, but our use of blockUntilFinished will loop and interrupt again. We 
should not technically need this right now, but I like that it makes it safe 
for future code additions. For standard use it's really just a best effort to 
cut off any wait. I've done a lot of extensive testing with various update 
rates and update threads and such and have not seen an issue yet.

bq. CUSC

Yonik did almost a rewrite of it not too long ago to fix some bugs, and I don't 
have much appetite to rework it. There are tons of subtle things that can go 
wrong. It's complex, but I think they way it was written, it kind of is what it 
is. I think if we want a simpler model, we should probably create a new class 
with a different streaming design.

I think the queue synchronize is really simple, and runners as well. That is 
fairly simple multithreaded code. I think the complication is in other parts of 
the design myself.

This class is a bit advanced for sure though. You have to be willing to spend 
some time to have confidence changing it.


> Documents indexed in bulk are replicated using too many HTTP requests
> ---------------------------------------------------------------------
>
>                 Key: SOLR-9824
>                 URL: https://issues.apache.org/jira/browse/SOLR-9824
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.3
>            Reporter: David Smiley
>            Assignee: Mark Miller
>         Attachments: SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch, 
> SOLR-9824.patch, SOLR-9824.patch, SOLR-9824.patch
>
>
> This takes awhile to explain; bear with me. While working on bulk indexing 
> small documents, I looked at the logs of my SolrCloud nodes.  I noticed that 
> shards would see an /update log message every ~6ms which is *way* too much.  
> These are requests from one shard (that isn't a leader/replica for these docs 
> but the recipient from my client) to the target shard leader (no additional 
> replicas).  One might ask why I'm not sending docs to the right shard in the 
> first place; I have a reason but it's besides the point -- there's a real 
> Solr perf problem here and this probably applies equally to 
> replicationFactor>1 situations too.  I could turn off the logs but that would 
> hide useful stuff, and it's disconcerting to me that so many short-lived HTTP 
> requests are happening, somehow at the bequest of DistributedUpdateProcessor. 
>  After lots of analysis and debugging and hair pulling, I finally figured it 
> out.  
> In SOLR-7333 ([~tpot]) introduced an optimization called 
> {{UpdateRequest.isLastDocInBatch()}} in which ConcurrentUpdateSolrClient will 
> poll with a '0' timeout to the internal queue, so that it can close the 
> connection without it hanging around any longer than needed.  This part makes 
> sense to me.  Currently the only spot that has the smarts to set this flag is 
> {{JavaBinUpdateRequestCodec.unmarshal.readOuterMostDocIterator()}} at the 
> last document.  So if a shard received docs in a javabin stream (but not 
> other formats) one would expect the _last_ document to have this flag.  
> There's even a test.  Docs without this flag get the default poll time; for 
> javabin it's 25ms.  Okay.
> I _suspect_ that if someone used CloudSolrClient or HttpSolrClient to send 
> javabin data in a batch, the intended efficiencies of SOLR-7333 would apply.  
> I didn't try. In my case, I'm using ConcurrentUpdateSolrClient (and BTW 
> DistributedUpdateProcessor uses CUSC too).  CUSC uses the RequestWriter 
> (defaulting to javabin) to send each document separately without any leading 
> marker or trailing marker.  For the XML format by comparison, there is a 
> leading and trailing marker (<stream> ... </stream>).  Since there's no outer 
> container for the javabin unmarshalling to detect the last document, it marks 
> _every_ document as {{req.lastDocInBatch()}}!  Ouch!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9824) Documents indexed in bulk are replicated using too many HTTP requests

Reply via email to