[ 
https://issues.apache.org/jira/browse/HBASE-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489936#comment-15489936
 ] 

Guanghao Zhang commented on HBASE-16165:
----------------------------------------

We observed an OOM case in our production cluster. Table A in source cluster 
has 500+ regions but it only has 1 region in slave cluster.  Then the mr job 
write a lot data in source cluster. It replicate to slave cluster and all data 
write to one regionserver. Then the regionserver crashed by OOM. One fix is to 
decrease RpcServer.callQueueSize when the responder wirte out the response 
really. Another fix is nullify the param early. Upload a little fix for this 
and set the param null when send response.

> Decrease RpcServer.callQueueSize before writeResponse causes OOM
> ----------------------------------------------------------------
>
>                 Key: HBASE-16165
>                 URL: https://issues.apache.org/jira/browse/HBASE-16165
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Minor
>         Attachments: HBASE-16165.patch
>
>
> In RpcServer, we use {{callQueueSizeInBytes}} to avoid queuing too many calls 
> which causes OOM. But in {{CallRunner.run}}, we decrease it before send the 
> response back. And even after calling {{sendResponseIfReady}}, the call 
> object could stay in our heap for a long time if we can not write out the 
> response(That's why we need a Responder thread...). This makes it possible 
> that the actual size of all call object in heap is larger than 
> {{maxQueueSizeInBytes}} and causes OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to