Thanks Agatea for digging in. Seems like you have hit a bug. Would you mind opening a jira and adding your findings to it.
Thanks, Ashutosh On Thu, Aug 29, 2013 at 11:22 AM, agateaaa <agate...@gmail.com> wrote: > Sorry hit send too soon ... > > Hi All: > > Put some debugging code in TUGIContainingTransport.getTransport() and I > tracked it down to > > @Override > public TUGIContainingTransport getTransport(TTransport trans) { > > // UGI information is not available at connection setup time, it will be > set later > // via set_ugi() rpc. > transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); > > //return transMap.get(trans); //<-change > TUGIContainingTransport retTrans = transMap.get(trans); > > if ( retTrans == null ) { > LOGGER.error (" cannot find transport that was in map !!") > } else { > LOGGER.debug (" cannot find transport that was in map !!") > return retTrans; > } > } > > When we run this in our test environment, see that we run into the problem > just after GC runs, > and "cannot find transport that was in the map!!" message gets logged. > > Could the GC be collecting entries from transMap, just before the we get it > > Tried a minor change which seems to work > > public TUGIContainingTransport getTransport(TTransport trans) { > > TUGIContainingTransport retTrans = transMap.get(trans); > > if ( retTrans == null ) { > // UGI information is not available at connection setup time, it will be > set later > // via set_ugi() rpc. > transMap.putIfAbsent(trans, retTrans); > } > return retTrans; > } > > > My questions for hive and thrift experts > > 1.) Do we need to use a ConcurrentMap > ConcurrentMap<TTransport, TUGIContainingTransport> transMap = new > MapMaker().weakKeys().weakValues().makeMap(); > It does use == to compare keys (which might be the problem), also in this > case we cant rely on the trans to be always there in the transMap, even > after a put, so in that case change above > probably makes sense > > > 2.) Is it better idea to use WeakHashMap with WeakReference instead ? (was > looking at org.apache.thrift.transport.TSaslServerTransport, esp change > made by THRIFT-1468) > > e.g. > private static Map<TTransport, WeakReference<TUGIContainingTransport>> > transMap3 = Collections.synchronizedMap(new WeakHashMap<TTransport, > WeakReference<TUGIContainingTransport>>()); > > getTransport() would be something like > > public TUGIContainingTransport getTransport(TTransport trans) { > WeakReference<TUGIContainingTransport> ret = transMap.get(trans); > if (ret == null || ret.get() == null) { > ret = new WeakReference<TUGIContainingTransport>(new > TUGIContainingTransport(trans)); > transMap3.put(trans, ret); // No need for putIfAbsent(). > // Concurrent calls to getTransport() will pass in different TTransports. > } > return ret.get(); > } > > > I did try 1.) above in our test environment and it does seem to resolve the > problem, though i am not sure if I am introducing any other problem > > > Can someone help ? > > > Thanks > Agatea > > > > > > > > > > > > > > On Thu, Aug 29, 2013 at 10:57 AM, agateaaa <agate...@gmail.com> wrote: > > > Hi All: > > > > Put some debugging code in TUGIContainingTransport.getTransport() and I > > tracked it down to > > > > @Override > > public TUGIContainingTransport getTransport(TTransport trans) { > > > > // UGI information is not available at connection setup time, it will be > > set later > > // via set_ugi() rpc. > > transMap.putIfAbsent(trans, new TUGIContainingTransport(trans)); > > > > //return transMap.get(trans); <-change > > TUGIContainingTransport retTrans = transMap.get(trans); > > > > if ( retTrans == null ) { > > > > > > > > } > > > > > > > > > > > > On Wed, Jul 31, 2013 at 9:48 AM, agateaaa <agate...@gmail.com> wrote: > > > >> Thanks Nitin > >> > >> There arent too many connections in close_wait state only 1 or two when > >> we run into this. Most likely its because of dropped connection. > >> > >> I could not find any read or write timeouts we can set for the thrift > >> server which will tell thrift to hold on to the client connection. > >> See this https://issues.apache.org/jira/browse/HIVE-2006 but doesnt > >> seem to have been implemented yet. We do have set a client connection > >> timeout but cannot find > >> an equivalent setting for the server. > >> > >> We have a suspicion that this happens when we run two client processes > >> which modify two distinct partitions of the same hive table. We put in a > >> workaround so that the two hive client processes never run together and > so > >> far things look ok but we will keep monitoring. > >> > >> Could it be because hive metastore server is not thread safe, would > >> running two alter table statements on two distinct partitions of the > same > >> table using two client connections cause problems like these, where hive > >> metastore server closes or drops a wrong client connection and leaves > the > >> other hanging? > >> > >> Agateaaa > >> > >> > >> > >> > >> On Tue, Jul 30, 2013 at 12:49 AM, Nitin Pawar <nitinpawar...@gmail.com > >wrote: > >> > >>> The mentioned flow is called when you have unsecure mode of thrift > >>> metastore client-server connection. So one way to avoid this is have a > >>> secure way. > >>> > >>> <code> > >>> public boolean process(final TProtocol in, final TProtocol out) > >>> throwsTException { > >>> setIpAddress(in); > >>> ... > >>> ... > >>> ... > >>> @Override > >>> protected void setIpAddress(final TProtocol in) { > >>> TUGIContainingTransport ugiTrans = > >>> (TUGIContainingTransport)in.getTransport(); > >>> Socket socket = ugiTrans.getSocket(); > >>> if (socket != null) { > >>> setIpAddress(socket); > >>> > >>> </code> > >>> > >>> > >>> From the above code snippet, it looks like the null pointer exception > is > >>> not handled if the getSocket returns null. > >>> > >>> can you check whats the ulimit setting on the server? If its set to > >>> default > >>> can you set it to unlimited and restart hcat server. (This is just a > wild > >>> guess). > >>> > >>> also the getSocket method suggests "If the underlying TTransport is an > >>> instance of TSocket, it returns the Socket object which it contains. > >>> Otherwise it returns null." > >>> > >>> so someone from thirft gurus need to tell us whats happening. I have no > >>> knowledge of this depth > >>> > >>> may be Ashutosh or Thejas will be able to help on this. > >>> > >>> > >>> > >>> > >>> From the netstat close_wait, it looks like the hive metastore server > has > >>> not closed the connection (do not know why yet), may be the hive dev > guys > >>> can help.Are there too many connections in close_wait state? > >>> > >>> > >>> > >>> On Tue, Jul 30, 2013 at 5:52 AM, agateaaa <agate...@gmail.com> wrote: > >>> > >>> > Looking at the hive metastore server logs see errors like these: > >>> > > >>> > 2013-07-26 06:34:52,853 ERROR server.TThreadPoolServer > >>> > (TThreadPoolServer.java:run(182)) - Error occurred during processing > of > >>> > message. > >>> > java.lang.NullPointerException > >>> > at > >>> > > >>> > > >>> > org.apache.hadoop.hive.metastore.TUGIBasedProcessor.setIpAddress(TUGIBasedProcessor.java:183) > >>> > at > >>> > > >>> > > >>> > org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:79) > >>> > at > >>> > > >>> > > >>> > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:176) > >>> > at > >>> > > >>> > > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >>> > at > >>> > > >>> > > >>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >>> > at java.lang.Thread.run(Thread.java:662) > >>> > > >>> > approx same time as we see timeout or connection reset errors. > >>> > > >>> > Dont know if this is the cause or the side affect of he connection > >>> > timeout/connection reset errors. Does anybody have any pointers or > >>> > suggestions ? > >>> > > >>> > Thanks > >>> > > >>> > > >>> > On Mon, Jul 29, 2013 at 11:29 AM, agateaaa <agate...@gmail.com> > wrote: > >>> > > >>> > > Thanks Nitin! > >>> > > > >>> > > We have simiar setup (identical hcatalog and hive server versions) > >>> on a > >>> > > another production environment and dont see any errors (its been > >>> running > >>> > ok > >>> > > for a few months) > >>> > > > >>> > > Unfortunately we wont be able to move to hcat 0.5 and hive 0.11 or > >>> hive > >>> > > 0.10 soon. > >>> > > > >>> > > I did see that the last time we ran into this problem doing a > >>> netstat-ntp > >>> > > | grep ":10000" see that server was holding on to one socket > >>> connection > >>> > in > >>> > > CLOSE_WAIT state for a long time > >>> > > (hive metastore server is running on port 10000). Dont know if > thats > >>> > > relevant here or not > >>> > > > >>> > > Can you suggest any hive configuration settings we can tweak or > >>> > networking > >>> > > tools/tips, we can use to narrow this down ? > >>> > > > >>> > > Thanks > >>> > > Agateaaa > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > On Mon, Jul 29, 2013 at 11:02 AM, Nitin Pawar < > >>> nitinpawar...@gmail.com > >>> > >wrote: > >>> > > > >>> > >> Is there any chance you can do a update on test environment with > >>> > hcat-0.5 > >>> > >> and hive-0(11 or 10) and see if you can reproduce the issue? > >>> > >> > >>> > >> We used to see this error when there was load on hcat server or > some > >>> > >> network issue connecting to the server(second one was rare > >>> occurrence) > >>> > >> > >>> > >> > >>> > >> On Mon, Jul 29, 2013 at 11:13 PM, agateaaa <agate...@gmail.com> > >>> wrote: > >>> > >> > >>> > >>> Hi All: > >>> > >>> > >>> > >>> We are running into frequent problem using HCatalog 0.4.1 (HIve > >>> > Metastore > >>> > >>> Server 0.9) where we get connection reset or connection timeout > >>> errors. > >>> > >>> > >>> > >>> The hive metastore server has been allocated enough (12G) memory. > >>> > >>> > >>> > >>> This is a critical problem for us and would appreciate if anyone > >>> has > >>> > any > >>> > >>> pointers. > >>> > >>> > >>> > >>> We did add a retry logic in our client, which seems to help, but > I > >>> am > >>> > >>> just > >>> > >>> wondering how can we narrow down to the root cause > >>> > >>> of this problem. Could this be a hiccup in networking which > causes > >>> the > >>> > >>> hive > >>> > >>> server to get into a unresponsive state ? > >>> > >>> > >>> > >>> Thanks > >>> > >>> > >>> > >>> Agateaaa > >>> > >>> > >>> > >>> > >>> > >>> Example Connection reset error: > >>> > >>> ======================= > >>> > >>> > >>> > >>> org.apache.thrift.transport.TTransportException: > >>> > >>> java.net.SocketException: > >>> > >>> Connection reset > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) > >>> > >>> at > >>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) > >>> > >>> at > >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) > >>> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) > >>> > >>> at > org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeAlterTableAddParts(DDLSemanticAnalyzer.java:1817) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.analyzeInternal(DDLSemanticAnalyzer.java:297) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) > >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) > >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) > >>> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) > >>> > >>> at > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) > >>> > >>> at > >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) > >>> > >>> at > >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) > >>> > >>> at > >>> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) > >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) > >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) > >>> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>> > >>> at java.lang.reflect.Method.invoke(Method.java:597) > >>> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >>> > >>> Caused by: java.net.SocketException: Connection reset > >>> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:168) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) > >>> > >>> ... 30 more > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Example Connection timeout error: > >>> > >>> ========================== > >>> > >>> > >>> > >>> org.apache.thrift.transport.TTransportException: > >>> > >>> java.net.SocketTimeoutException: Read timed out > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) > >>> > >>> at > >>> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) > >>> > >>> at > >>> > org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_set_ugi(ThriftHiveMetastore.java:2136) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.set_ugi(ThriftHiveMetastore.java:2122) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.openStore(HiveMetaStoreClient.java:286) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:197) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:157) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2092) > >>> > >>> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2102) > >>> > >>> at > org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:888) > >>> > >>> at > org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:830) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:954) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:7524) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:243) > >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:431) > >>> > >>> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) > >>> > >>> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:909) > >>> > >>> at > >>> > >>> > >>> > > >>> > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) > >>> > >>> at > >>> org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:215) > >>> > >>> at > >>> > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) > >>> > >>> at > >>> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:341) > >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:642) > >>> > >>> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:557) > >>> > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>> > >>> at java.lang.reflect.Method.invoke(Method.java:597) > >>> > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > >>> > >>> Caused by: java.net.SocketTimeoutException: Read timed out > >>> > >>> at java.net.SocketInputStream.socketRead0(Native Method) > >>> > >>> at java.net.SocketInputStream.read(SocketInputStream.java:129) > >>> > >>> at > >>> > >>> > >>> > >>> > >>> > > >>> > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) > >>> > >>> ... 31 more > >>> > >>> > >>> > >> > >>> > >> > >>> > >> > >>> > >> -- > >>> > >> Nitin Pawar > >>> > >> > >>> > > > >>> > > > >>> > > >>> > >>> > >>> > >>> -- > >>> Nitin Pawar > >>> > >> > >> > > >