> out of curiosity, did you see below messages in RS log?
>
> LOG.warn("Snapshot called again without clearing previous. " +
> "Doing nothing. Another ongoing flush or did we fail last
> attempt?”);
Nope
>
> thanks.
>
> On Tue, Sep 9, 2014 at 2:15 AM, Brian Jeltema <
> [email protected]> wrote:
>
>> I’ve resolved these problems by restarting the region server that owned
>> the region in question.
>> I don’t know what the underlying issue was, but at this point it’s not
>> worth pursuing.
>>
>> Thanks for responding.
>>
>> Brian
>>
>> On Sep 8, 2014, at 11:06 AM, Brian Jeltema <[email protected]>
>> wrote:
>>
>>> I realized today that the region server logs for the region being
>> updated (startKey=\x00DDD@) contains the following:
>>>
>>> 2014-09-08 06:25:50,223 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 11302
>>> 2014-09-08 06:26:00,222 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 21682
>>> 2014-09-08 06:26:10,223 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 5724
>>> 2014-09-08 06:26:20,223 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 11962
>>> 2014-09-08 06:26:30,223 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 7693
>>> 2014-09-08 06:26:40,224 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 5578
>>> 2014-09-08 06:26:50,223 INFO [regionserver60020.periodicFlusher]
>> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
>> flush for region
>> Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
>> after a delay of 12420
>>>
>>> a log entry being generated every 10 seconds starting about 4 days ago.
>> I presume these problems are related.
>>>
>>> On Sep 8, 2014, at 7:10 AM, Brian Jeltema <
>> [email protected]> wrote:
>>>
>>>>
>>>>> When number of attempts is greater than the value of
>>>>> hbase.client.start.log.errors.counter (default 9), AsyncProcess would
>>>>> produce logs cited below.
>>>>> The interval following 'retrying after ' is the backoff time.
>>>>>
>>>>> Which release of HBase are you using ?
>>>>>
>>>>
>>>> HBase Version 0.98.0.2.1.1.0-385-hadoop2
>>>>
>>>> The MR job is reading from an HBase snapshot, if that’s relevant.
>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> On Sun, Sep 7, 2014 at 8:50 AM, Brian Jeltema <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I have a map/reduce job that is consistently failing with timeouts.
>> The
>>>>>> failing mapper log files contain a series
>>>>>> of records similar to those below. When I look at the hbase and hdfs
>> logs
>>>>>> (on foo.net in this case) I don’t see
>>>>>> anything obvious at these timestamps. The mapper task times out
>> at/near
>>>>>> attempt=25/35. Can anyone shed light
>>>>>> on what these log entries mean?
>>>>>>
>>>>>> Thanks - Brian
>>>>>>
>>>>>>
>>>>>> 2014-09-07 09:36:51,421 INFO [htable-pool1-t1]
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
>>>>>> attempt=10/35 failed 1062 ops, last exception: null on foo.net
>> ,60020,1406043467187,
>>>>>> tracking started null, retrying after 10029 ms, replay 1062 ops
>>>>>> 2014-09-07 09:37:01,642 INFO [htable-pool1-t1]
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
>>>>>> attempt=11/35 failed 1062 ops, last exception: null on foo.net
>> ,60020,1406043467187,
>>>>>> tracking started null, retrying after 10023 ms, replay 1062 ops
>>>>>> 2014-09-07 09:37:12,064 INFO [htable-pool1-t1]
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
>>>>>> attempt=12/35 failed 1062 ops, last exception: null on foo.net
>> ,60020,1406043467187,
>>>>>> tracking started null, retrying after 20182 ms, replay 1062 ops
>>>>>> 2014-09-07 09:37:32,708 INFO [htable-pool1-t1]
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
>>>>>> attempt=13/35 failed 1062 ops, last exception: null on foo.net
>> ,60020,1406043467187,
>>>>>> tracking started null, retrying after 20140 ms, replay 1062 ops
>>>>>> 2014-09-07 09:37:52,940 INFO [htable-pool1-t1]
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
>>>>>> attempt=14/35 failed 1062 ops, last exception: null on foo.net
>> ,60020,1406043467187,
>>>>>> tracking started null, retrying after 20041 ms, replay 1062 ops
>>>>>> 2014-09-07 09:38:13,324 INFO [htable-pool1-t1]
>>>>>> org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
>>>>>> attempt=15/35 failed 1062 ops, last exception: null on foo.net
>> ,60020,1406043467187,
>>>>>> tracking started null, retrying after 20041 ms, replay 1062 ops
>>>>>>
>>>>>>
>>>>
>>>
>>
>>