Thanks. I left a comment on that ticket. ---- Saad
On Sat, Mar 10, 2018 at 11:57 PM, Anoop John <anoop.hb...@gmail.com> wrote: > Hi Saad > In your initial mail you mentioned that there are lots > of checkAndPut ops but on different rows. The failure in obtaining > locks (write lock as it is checkAndPut) means there is contention on > the same row key. If that is the case , ya that is the 1st step > before BC reads and it make sense. > > On the Q on why not caching the compacted file content, yes it is this > way. Even if cache on write is true. This is because some times the > compacted result file could be so large (what is major compaction) and > that will exhaust the BC if written. Also it might contain some data > which are very old. There is a jira recently raised jira which > discuss abt this. Pls see HBASE-20045 > > > -Anoop- > > On Sun, Mar 11, 2018 at 7:57 AM, Saad Mufti <saad.mu...@gmail.com> wrote: > > Although now that I think about this a bit more, all the failures we saw > > were failure to obtain a row lock, and in the thread stack traces we > always > > saw it somewhere inside getRowLockInternal and similar. Never saw any > > contention on bucket cache lock that I could see. > > > > Cheers. > > > > ---- > > Saad > > > > > > On Sat, Mar 10, 2018 at 8:04 PM, Saad Mufti <saad.mu...@gmail.com> > wrote: > > > >> Also, for now we have mitigated this problem by using the new setting in > >> HBase 1.4.0 that prevents one slow region server from blocking all > client > >> requests. Of course it causes some timeouts but our overall ecosystem > >> contains Kafka queues for retries, so we can live with that. From what I > >> can see, it looks like this setting also has the good effect of > preventing > >> clients from hammering a region server that is slow because its IPC > queues > >> are backed up, allowing it to recover faster. > >> > >> Does that make sense? > >> > >> Cheers. > >> > >> ---- > >> Saad > >> > >> > >> On Sat, Mar 10, 2018 at 7:04 PM, Saad Mufti <saad.mu...@gmail.com> > wrote: > >> > >>> So if I understand correctly, we would mitigate the problem by not > >>> evicting blocks for archived files immediately? Wouldn't this > potentially > >>> lead to problems later if the LRU algo chooses to evict blocks for > active > >>> files and leave blocks for archived files in there? > >>> > >>> I would definitely love to test this!!! Unfortunately we are running on > >>> EMR and the details of how to patch HBase under EMR are not clear to > me :-( > >>> > >>> What we would really love would be a setting for actually immediately > >>> caching blocks for a new compacted file. I have seen in the code that > even > >>> is we have the cache on write setting set to true, it will refuse to > cache > >>> blocks for a file that is a newly compacted one. In our case we have > sized > >>> the bucket cache to be big enough to hold all our data, and really > want to > >>> avoid having to go to S3 until the last possible moment. A config > setting > >>> to test this would be great. > >>> > >>> But thanks everyone for your feedback. Any more would also be welcome > on > >>> the idea to let a user cache all newly compacted files. > >>> > >>> ---- > >>> Saad > >>> > >>> > >>> On Wed, Mar 7, 2018 at 12:00 AM, Anoop John <anoop.hb...@gmail.com> > >>> wrote: > >>> > >>>> >>a) it was indeed one of the regions that was being compacted, major > >>>> compaction in one case, minor compaction in another, the issue started > >>>> just > >>>> after compaction completed blowing away bucket cached blocks for the > >>>> older > >>>> HFile's > >>>> > >>>> About this part. Ya after the compaction, there is a step where the > >>>> compacted away HFile's blocks getting removed from cache. This op > takes a > >>>> write lock for this region (In Bucket Cache layer).. Every read op > which > >>>> is part of checkAndPut will try read from BC and that in turn need a > read > >>>> lock for this region. So there is chances that the read locks starve > >>>> because of so many frequent write locks . Each block evict will > attain > >>>> the > >>>> write lock one after other. Will it be possible for you to patch this > >>>> evict and test once? We can avoid the immediate evict from BC after > >>>> compaction. I can help you with a patch if you wish > >>>> > >>>> Anoop > >>>> > >>>> > >>>> > >>>> On Mon, Mar 5, 2018 at 11:07 AM, ramkrishna vasudevan < > >>>> ramkrishna.s.vasude...@gmail.com> wrote: > >>>> > Hi Saad > >>>> > > >>>> > Your argument here > >>>> >>> The > >>>> >>>theory is that since prefetch is an async operation, a lot of the > >>>> reads > >>>> in > >>>> >>>the checkAndPut for the region in question start reading from S3 > >>>> which is > >>>> >>>slow. So the write lock obtained for the checkAndPut is held for a > >>>> longer > >>>> >>>duration than normal. This has cascading upstream effects. Does > that > >>>> sound > >>>> >>>plausible? > >>>> > > >>>> > Seems very much plausible. So before even the prefetch happens say > for > >>>> > 'block 1' - and you have already issues N checkAndPut calls for the > >>>> rows > >>>> in > >>>> > that 'block 1' - all those checkAndPut will have to read that block > >>>> from > >>>> > S3 to perform the get() and then apply the mutation. > >>>> > > >>>> > This may happen for multiple threads at the same time because we are > >>>> not > >>>> > sure when the prefetch would have actually been completed. I don > know > >>>> what > >>>> > are the general read characteristics when a read happens from S3 but > >>>> you > >>>> > could try to see how things work when a read happens from S3 and > after > >>>> the > >>>> > prefetch completes ensure the same checkandPut() is done (from cache > >>>> this > >>>> > time) to really know the difference what S3 does there. > >>>> > > >>>> > Regards > >>>> > Ram > >>>> > > >>>> > On Fri, Mar 2, 2018 at 2:57 AM, Saad Mufti <saad.mu...@gmail.com> > >>>> wrote: > >>>> > > >>>> >> So after much investigation I can confirm: > >>>> >> > >>>> >> a) it was indeed one of the regions that was being compacted, major > >>>> >> compaction in one case, minor compaction in another, the issue > started > >>>> just > >>>> >> after compaction completed blowing away bucket cached blocks for > the > >>>> older > >>>> >> HFile's > >>>> >> b) in another case there was no compaction just a newly opened > region > >>>> in > >>>> a > >>>> >> region server that hadn't finished perfetching its pages from S3 > >>>> >> > >>>> >> We have prefetch on open set to true. Our load is heavy on > checkAndPut > >>>> .The > >>>> >> theory is that since prefetch is an async operation, a lot of the > >>>> reads > >>>> in > >>>> >> the checkAndPut for the region in question start reading from S3 > >>>> which is > >>>> >> slow. So the write lock obtained for the checkAndPut is held for a > >>>> longer > >>>> >> duration than normal. This has cascading upstream effects. Does > that > >>>> sound > >>>> >> plausible? > >>>> >> > >>>> >> The part I don't understand still is all the locks held are for the > >>>> same > >>>> >> region but are all for different rows. So once the prefetch is > >>>> completed, > >>>> >> shouldn't the problem clear up quickly? Or does the slow region > slow > >>>> down > >>>> >> anyone trying to do checkAndPut on any row in the same region even > >>>> after > >>>> >> the prefetch has completed. That is, do the long held row locks > >>>> prevent > >>>> >> others from getting a row lock on a different row in the same > region? > >>>> >> > >>>> >> In any case, we trying to use > >>>> >> https://issues.apache.org/jira/browse/HBASE-16388 support in HBase > >>>> 1.4.0 > >>>> >> to > >>>> >> both insulate the app a bit from this situation and hoping that it > >>>> will > >>>> >> reduce pressure on the region server in question, allowing it to > >>>> recover > >>>> >> faster. I haven't quite tested that yet, any advice in the meantime > >>>> would > >>>> >> be appreciated. > >>>> >> > >>>> >> Cheers. > >>>> >> > >>>> >> ---- > >>>> >> Saad > >>>> >> > >>>> >> > >>>> >> > >>>> >> On Thu, Mar 1, 2018 at 9:21 AM, Saad Mufti <saad.mu...@gmail.com> > >>>> wrote: > >>>> >> > >>>> >> > Actually it happened again while some minior compactions were > >>>> running, > >>>> so > >>>> >> > don't think it related to our major compaction tool, which isn't > >>>> even > >>>> >> > running right now. I will try to capture a debug dump of threads > and > >>>> >> > everything while the event is ongoing. Seems to last at least > half > >>>> an > >>>> >> hour > >>>> >> > or so and sometimes longer. > >>>> >> > > >>>> >> > ---- > >>>> >> > Saad > >>>> >> > > >>>> >> > > >>>> >> > On Thu, Mar 1, 2018 at 7:54 AM, Saad Mufti <saad.mu...@gmail.com > > > >>>> wrote: > >>>> >> > > >>>> >> >> Unfortunately I lost the stack trace overnight. But it does seem > >>>> related > >>>> >> >> to compaction, because now that the compaction tool is done, I > >>>> don't > >>>> see > >>>> >> >> the issue anymore. I will run our incremental major compaction > tool > >>>> >> again > >>>> >> >> and see if I can reproduce the issue. > >>>> >> >> > >>>> >> >> On the plus side the system stayed stable and eventually > recovered, > >>>> >> >> although it did suffer all those timeouts. > >>>> >> >> > >>>> >> >> ---- > >>>> >> >> Saad > >>>> >> >> > >>>> >> >> > >>>> >> >> On Wed, Feb 28, 2018 at 10:18 PM, Saad Mufti < > saad.mu...@gmail.com > >>>> > > >>>> >> >> wrote: > >>>> >> >> > >>>> >> >>> I'll paste a thread dump later, writing this from my phone :-) > >>>> >> >>> > >>>> >> >>> So the same issue has happened at different times for different > >>>> >> regions, > >>>> >> >>> but I couldn't see that the region in question was the one > being > >>>> >> compacted, > >>>> >> >>> either this time or earlier. Although I might have missed an > >>>> earlier > >>>> >> >>> correlation in the logs where the issue started just after the > >>>> >> compaction > >>>> >> >>> completed. > >>>> >> >>> > >>>> >> >>> Usually a compaction for this table's regions take around 5-10 > >>>> minutes, > >>>> >> >>> much less for its smaller column family which is block cache > >>>> enabled, > >>>> >> >>> around a minute or less, and 5-10 minutes for the much larger > one > >>>> for > >>>> >> which > >>>> >> >>> we have block cache disabled in the schema, because we don't > ever > >>>> read > >>>> >> it > >>>> >> >>> in the primary cluster. So the only impact on reads would be > from > >>>> that > >>>> >> >>> smaller column family which takes less than a minute to > compact. > >>>> >> >>> > >>>> >> >>> But the issue once started doesn't seem to recover for a long > >>>> time, > >>>> >> long > >>>> >> >>> past when any compaction on the region itself could impact > >>>> anything. > >>>> >> The > >>>> >> >>> compaction tool which is our own code has long since moved to > >>>> other > >>>> >> >>> regions. > >>>> >> >>> > >>>> >> >>> Cheers. > >>>> >> >>> > >>>> >> >>> ---- > >>>> >> >>> Saad > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> On Wed, Feb 28, 2018 at 9:39 PM Ted Yu <yuzhih...@gmail.com> > >>>> wrote: > >>>> >> >>> > >>>> >> >>>> bq. timing out trying to obtain write locks on rows in that > >>>> region. > >>>> >> >>>> > >>>> >> >>>> Can you confirm that the region under contention was the one > >>>> being > >>>> >> major > >>>> >> >>>> compacted ? > >>>> >> >>>> > >>>> >> >>>> Can you pastebin thread dump so that we can have better idea > of > >>>> the > >>>> >> >>>> scenario ? > >>>> >> >>>> > >>>> >> >>>> For the region being compacted, how long would the compaction > >>>> take > >>>> >> (just > >>>> >> >>>> want to see if there was correlation between this duration and > >>>> >> timeout) > >>>> >> >>>> ? > >>>> >> >>>> > >>>> >> >>>> Cheers > >>>> >> >>>> > >>>> >> >>>> On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti < > >>>> saad.mu...@gmail.com> > >>>> >> >>>> wrote: > >>>> >> >>>> > >>>> >> >>>> > Hi, > >>>> >> >>>> > > >>>> >> >>>> > We are running on Amazon EMR based HBase 1.4.0 . We are > >>>> currently > >>>> >> >>>> seeing a > >>>> >> >>>> > situation where sometimes a particular region gets into a > >>>> situation > >>>> >> >>>> where a > >>>> >> >>>> > lot of write requests to any row in that region timeout > saying > >>>> they > >>>> >> >>>> failed > >>>> >> >>>> > to obtain a lock on a row in a region and eventually they > >>>> experience > >>>> >> >>>> an IPC > >>>> >> >>>> > timeout. This causes the IPC queue to blow up in size as > >>>> requests > >>>> >> get > >>>> >> >>>> > backed up, and that region server experiences a much higher > >>>> than > >>>> >> >>>> normal > >>>> >> >>>> > timeout rate for all requests, not just those timing out for > >>>> failing > >>>> >> >>>> to > >>>> >> >>>> > obtain the row lock. > >>>> >> >>>> > > >>>> >> >>>> > The strange thing is the rows are always different but the > >>>> region > >>>> is > >>>> >> >>>> always > >>>> >> >>>> > the same. So the question is, is there a region component to > >>>> how > >>>> >> long > >>>> >> >>>> a row > >>>> >> >>>> > write lock would be held? I looked at the debug dump and the > >>>> >> RowLocks > >>>> >> >>>> > section shows a long list of write row locks held, all of > them > >>>> are > >>>> >> >>>> from the > >>>> >> >>>> > same region but different rows. > >>>> >> >>>> > > >>>> >> >>>> > Will trying to obtain a write row lock experience delays if > no > >>>> one > >>>> >> >>>> else > >>>> >> >>>> > holds a lock on the same row but the region itself is > >>>> experiencing > >>>> >> >>>> read > >>>> >> >>>> > delays? We do have an incremental compaction tool running > that > >>>> major > >>>> >> >>>> > compacts one region per region server at a time, so that > will > >>>> drive > >>>> >> >>>> out > >>>> >> >>>> > pages from the bucket cache. But for most regions the > impact is > >>>> >> >>>> > transitional until the bucket cache gets populated by pages > >>>> from > >>>> the > >>>> >> >>>> new > >>>> >> >>>> > HFile. But for this one region we start timing out trying to > >>>> obtain > >>>> >> >>>> write > >>>> >> >>>> > locks on rows in that region. > >>>> >> >>>> > > >>>> >> >>>> > Any insight anyone can provide would be most welcome. > >>>> >> >>>> > > >>>> >> >>>> > Cheers. > >>>> >> >>>> > > >>>> >> >>>> > ---- > >>>> >> >>>> > Saad > >>>> >> >>>> > > >>>> >> >>>> > >>>> >> >>> > >>>> >> >> > >>>> >> > > >>>> >> > >>>> > >>> > >>> > >> >