[
https://issues.apache.org/jira/browse/HBASE-16165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489936#comment-15489936
]
Guanghao Zhang commented on HBASE-16165:
----------------------------------------
We observed an OOM case in our production cluster. Table A in source cluster
has 500+ regions but it only has 1 region in slave cluster. Then the mr job
write a lot data in source cluster. It replicate to slave cluster and all data
write to one regionserver. Then the regionserver crashed by OOM. One fix is to
decrease RpcServer.callQueueSize when the responder wirte out the response
really. Another fix is nullify the param early. Upload a little fix for this
and set the param null when send response.
> Decrease RpcServer.callQueueSize before writeResponse causes OOM
> ----------------------------------------------------------------
>
> Key: HBASE-16165
> URL: https://issues.apache.org/jira/browse/HBASE-16165
> Project: HBase
> Issue Type: Bug
> Reporter: Duo Zhang
> Priority: Minor
> Attachments: HBASE-16165.patch
>
>
> In RpcServer, we use {{callQueueSizeInBytes}} to avoid queuing too many calls
> which causes OOM. But in {{CallRunner.run}}, we decrease it before send the
> response back. And even after calling {{sendResponseIfReady}}, the call
> object could stay in our heap for a long time if we can not write out the
> response(That's why we need a Responder thread...). This makes it possible
> that the actual size of all call object in heap is larger than
> {{maxQueueSizeInBytes}} and causes OOM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)