If it's a Hector thing you may have better luck on the Hector user group. http://groups.google.com/group/hector-users
Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 10/03/2012, at 8:33 AM, Daning Wang wrote: > Thanks Maciej. we have default value for retryDownedHostsDelayInSeconds. I > think it is not about how long it checks the downed host, I suspect the > HostRetryService is down. Below is the very first exception, what does this > message mean - " HConnectionManager returned a null client after aquisition > - are we shutting down?" > > > > 2012-03-08 16:37:15,103 [pool-2-thread-34288] Cassandra client acquisition > interrupted > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown > Source) > at java.util.concurrent.ArrayBlockingQueue.poll(Unknown Source) > at > me.prettyprint.cassandra.connection.ConcurrentHClientPool.waitForConnection(ConcurrentHClientPool.java:117) > at > me.prettyprint.cassandra.connection.ConcurrentHClientPool.borrowClient(ConcurrentHClientPool.java:77) > at > me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:226) > at > me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97) > at me.prettyprint.cassandra.model.CqlQuery.execute(CqlQuery.java:93) > at > com.netseer.cassandra.cache.dao.CacheReader.getRows(CacheReader.java:267) > at > com.netseer.cassandra.cache.dao.CacheReader.getCache0(CacheReader.java:55) > at > com.netseer.cassandra.cache.dao.CacheDao.getCaches(CacheDao.java:85) > at com.netseer.cassandra.cache.dao.CacheDao.getCache(CacheDao.java:71) > at > com.netseer.cassandra.cache.dao.CacheDao.getCache(CacheDao.java:149) > at > com.netseer.cassandra.cache.service.CacheServiceImpl.getCache(CacheServiceImpl.java:55) > at > com.netseer.cassandra.cache.service.CacheServiceImpl.getCache(CacheServiceImpl.java:28) > at > com.netseer.dsat.cache.CassandraDSATCacheImpl.get(CassandraDSATCacheImpl.java:62) > at > com.netseer.dsat.cache.CassandraDSATCacheImpl.getTimedValue(CassandraDSATCacheImpl.java:144) > at > com.netseer.dsat.serving.GenericCacheManager$4.call(GenericCacheManager.java:427) > at > com.netseer.dsat.serving.GenericCacheManager$4.call(GenericCacheManager.java:1) > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) > at java.util.concurrent.FutureTask.run(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > 2012-03-08 16:37:15,104 [pool-2-thread-34288] Failed getting remote cache > for key=Key String = 'http://www.my-banners.com', long key = > 5630311119483252185, keyType = 'PATH' > me.prettyprint.hector.api.exceptions.HectorException: HConnectionManager > returned a null client after aquisition - are we shutting down? > at > me.prettyprint.cassandra.connection.ConcurrentHClientPool.borrowClient(ConcurrentHClientPool.java:83) > at > me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:226) > at > me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97) > at me.prettyprint.cassandra.model.CqlQuery.execute(CqlQuery.java:93) > > > > On Mon, Mar 5, 2012 at 10:56 PM, Maciej Miklas <mac.mik...@googlemail.com> > wrote: > Have you tried to change: > me.prettyprint.cassandra.service.CassandraHostConfigurator#retryDownedHostsDelayInSeconds > ? > > Hector will ping down hosts every xx seconds and recover connection. > > Regards, > Maciej > > > On Mon, Mar 5, 2012 at 8:13 PM, Daning Wang <dan...@netseer.com> wrote: > I just got this error ": All host pools marked down. Retry burden pushed out > to client." in a few clients recently, client could not recover, we have to > restart client application. we are using 0.8.0.3 hector. > > At that time we did compaction for a CF, it takes several hours, server was > busy. But I think client should recover after server load was down. > > Any bug reported about this? I did search but could not find one. > > Thanks, > > Daning > > >