Here're the config parameters related to controlling snapshot timeout:

  <property>
    <name>hbase.snapshot.master.timeoutMillis</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the time HBase master waits for the snapshot operation to
complete.
      Do not confuse this hbase.snapshot.master.timeout.millis, which
although
      sounding similar, serves a very different purpose.
      Note: This property has a completely different meaning before hbase
version
      0.94.11 and should not enabled on a cluster using snapshots and
running
      a version before 0.94.11.
    </description>
  </property>
  <property>
    <name>hbase.snapshot.master.timeout.millis</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the timeout the master indicates the client to wait when it
takes
      the snapshot. The client actually waits longer than this due to
exponential
      backoff. See HBaseAdmin.snapshot for the exact algorithm.
    </description>
  </property>
  <property>
    <name>hbase.snapshot.region.timeout</name>
    <!-- Change from default of 60s to 600s to allow for slow flushing of
tables -->
    <value>600000</value>
    <description>
      This is the time the regionserver waits to complete all of its
activities
      for a snapshot operation.
    </description>
  </property>

On Tue, Sep 23, 2014 at 6:05 AM, Ian Brooks <i.bro...@sensewhere.com> wrote:

> Hi,
>
> I'm running snapshots on 7 tables every hour, the last run managed 15
> rounds of snapshots before starting to fail.
>
> I did notice the following while getting the success count
>
> 2014-09-10 11:01:55,453 DEBUG
> [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> snapshot.FlushSnapshotSubprocedure: Closing region operation on
> rawScans,250000000000,1408114528441.ddb14a3d8f90cc18b6dfd4f28df5c534.
> 2014-09-10 11:01:55,453 DEBUG
> [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> snapshot.FlushSnapshotSubprocedure: Starting region operation on
> rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> 2014-09-10 11:01:55,453 DEBUG
> [rs(##############,16020,1410126483043)-snapshot-pool151-thread-1]
> snapshot.FlushSnapshotSubprocedure: Flush Snapshotting region
> rawScans,960000000000,1408114528443.0b3010a2712ff1b6acec0705f5d171d9.
> started...
> 2014-09-10 11:01:55,454 WARN  [member:
> '##############,16020,1410126483043' subprocedure-pool1-thread-1]
> snapshot.RegionServerSnapshotManager: Rethrowing ForeignException from
> SnapshotSubprocedurePool
> java.util.concurrent.ExecutionException:
> org.apache.hadoop.hbase.errorhandling.TimeoutException via
> timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:188)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:295)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
>         at
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
>         at
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> timer-java.util.Timer@45c34a8:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1410343254867, End:1410343314867, diff:60000, max:60000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.addRegionToSnapshot(HRegion.java:2816)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:83)
>         at
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:65)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         ... 4 more
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> elapsed! Source:Timeout caused Foreign Exception Start:1410343254867,
> End:1410343314867, diff:60000, max:60000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
> Is there a configuration option that I need to increase to allow the
> snapshots to run for more than 60 seconds on a given region? That said, I'm
> guessing the above shouldn't cause the SubProcedure pool to fill up
> completely.
>
> -Ian
>
> On Tuesday 23 September 2014 04:33:39 Ted Yu wrote:
> > Subprocedure pool is full.
> >
> > How many snapshot requests did you submit before seeing the following ?
> >
> > Cheers
> >
> > On Sep 23, 2014, at 2:28 AM, Ian Brooks <i.bro...@sensewhere.com> wrote:
> >
> > > Hi,
> > >
> > > I'm seeing an issue on our hbase cluster which is preventing snapshots
> from working. So far the only way i can get it working again is to restart
> all the regionservers which is not ideal.
> > >
> > > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are
> > >
> > > From host given the snapshot command via hbase shell
> > >
> > > 2014-09-23 10:00:16,116 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> > > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> znode:'/hbase/online-snapshot/acquired'
> > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found procedure znode:
> /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found data for
> znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> offlineScans-20140923-100009 from table offlineScans
> > > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Submitting new
> Subprocedure:offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,120 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Starting subprocedure
> 'offlineScans-20140923-100009' with timeout 60000ms
> > > 2014-09-23 10:00:16,120 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> errorhandling.TimeoutExceptionInjector: Scheduling process timer to run in:
> 60000 ms
> > > 2014-09-23 10:00:16,121 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> starting 'acquire' stage
> > > 2014-09-23 10:00:16,121 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009' locally
> acquired
> > > 2014-09-23 10:00:16,121 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.ZKProcedureMemberRpcs: Member: '##########,16020,1410126483043'
> joining acquired barrier for procedure (offlineScans-20140923-100009) in zk
> > > 2014-09-23 10:00:16,123 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.ZKProcedureMemberRpcs: Watch for global barrier
> reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,124 DEBUG [member:
> '############,16020,1410126483043' subprocedure-pool1-thread-314]
> procedure.Subprocedure: Subprocedure 'offlineScans-20140923-100009'
> coordinator notified of 'acquire', waiting on 'reached' or 'abort' from
> coordinator
> > > 2014-09-23 10:00:16,133 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Request received to abort procedure
> offlineScans-20140923-100009
> > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> via
> ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > >        at
> org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > >        ... 2 more
> > >
> > >
> > >
> > > From the regionserver it was trying to contact
> > >
> > >
> > > 2014-09-23 10:00:16,124 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received procedure start children changed
> event: /hbase/online-snapshot/acquired
> > > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Looking for new procedures under
> znode:'/hbase/online-snapshot/acquired'
> > > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found procedure znode:
> /hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: start proc data length is 59
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Found data for
> znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
> offlineScans-20140923-100009 from table offlineScans
> > > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Submitting new
> Subprocedure:offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread]
> procedure.Subprocedure: Subprocedure pool is full!
> > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009
> accepting received exception
> > > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Aborting procedure
> (offlineScans-20140923-100009) in zk
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Finished creating abort
> znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread]
> procedure.ProcedureMember: Failed to start subprocedure
> 'offlineScans-20140923-100009'
> > > 2014-09-23 10:00:16,134 INFO  [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Received created
> event:/hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread]
> procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
> /hbase/online-snapshot/abort/offlineScans-20140923-100009
> > > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread]
> procedure.ProcedureMember: Request received to abort procedure
> offlineScans-20140923-100009
> > > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
> via
> ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
> > >        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> > > Caused by:
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
> > >        at
> org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270)
> > >        at
> org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55)
> > >        at
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107)
> > >        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348)
> > >        ... 2 more
> > >
> > > Any ideas on how I can resolve this?
> > >
> > > -Ian
> > >
>

Reply via email to