[ https://issues.apache.org/jira/browse/HIVE-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315109#comment-14315109 ]
Manish Malhotra commented on HIVE-9469: --------------------------------------- Continue on the similar thread. Following are the work, I did for the load testing and fixing some of the issues. It will be great, if somebody can review this and see, if there are things which Im missing. And some time still I see the SocketTimeoutException, but ETL jobs are not failing. Currently running load test with following commands / load using Hive Client APIs. a. Create Partition - 10 threads b. ListPartition - 30 threads c. Show tables – 100 threads Load on the server was around 1200 Request Per Minute. and for this test Thirft Server + MySQL looks good. The tuning and finding are: Thrift Server 1. JVM tuning : (JVM profiling shows with default settings, there were too frequent Full GC happening) Young Generation GC Algo: Parallel Old Generation GC Algo: CMS Max_Heap: 11 Gb SurvivorRatio : 6 Graph before optimization: -- Attaching as separate files. Graph after optimization: -- Attaching as separate files. 2. Database Connection Pooling: Thrift Server uses DataNucleus framework for DB operations. And it uses DBCP as the connection pooling tool, the default config for DBCP is maxConnections = 10. Changed it to 30. As that is the basic bottleneck to server more requests. Database: 1. innodb_buffer = 8gb and tmp_table_space, max_heap_space = 256 mb. The other problem I unearthed was that one of the hive-table that had more 1 million rows, and in PROD the ListPartition was happening on this table, when this happened it takes a lot of time to get the response from DB as there are too many rows in the PARTITION table. So, it started blocking threads in Thrift Server and keep using one of the DB Connection and eventually got into state where all the DB Connections are used and new request cannot get the DB Connection and started getting. This problem was eventually making our hive queries failing and restarting. When solved this problem the throughput of the Thrift Server has increased a lot. And The failure of Hive Jobs has reduced a lot. So, please let me know if these changes and solving the ListPartion problem for big table is good or there are few other things which we should take care. Regards, Manish --------------------------------------------------------- Following are the details of the PROD Infrastructure --------------------------------------------------------- Load = 500 req/min. Exception: "org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out" As the metastore we are using MySQL, that is being used by Thrift server. The flow is like this: Oozie -- > Hive Action --> ELB (AWS) --> Hive Thrift ( 2 servers) --> MySQL (Master) -- > MySQL (Slave). Software versions: Hive version : 0.10.0 Hadoop: 1.2.1 I found one related JIRA :https://issues.apache.org/jira/browse/HCATALOG-541 But this JIRA shows that Hive Thrift Server shows OOM error, but in my case I didnt see any OOM error in my case. Regards, Manish Full Exception Stack: ( The exception comes when the server is loaded and new requests are timing out ) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74) at $Proxy7.getDatabase(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1110) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099) at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:347) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:706) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 34 more 2015-01-20 22:44:12,978 ERROR exec.Task (SessionState.java:printError(401)) - FAILED: Error in metadata: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1114) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099) at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) > Hive Thrift Server throws Socket Timeout Exception: Read time out > ----------------------------------------------------------------- > > Key: HIVE-9469 > URL: https://issues.apache.org/jira/browse/HIVE-9469 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 0.10.0 > Environment: 4 core cpu, 15gb memory. 2 thrift server behind load > balancer > Reporter: Manish Malhotra > > Hi All, > Please review the following problem, I also posted same in the hive-user > group, but didnt got any response yet. > This is happening quite frequently in our environment. > So, it would be great if somebody can see and advise. > I'm using Hive Thrift Server in Production which at peak handles around 500 > req/min. > After certain point the Hive Thrift Server is going into the no response mode > and throws > Following exception > "org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out" > As the metastore we are using MySQL, that is being used by Thrift server. > The design / architecture is like this: > Oozie -- > Hive Action --> ELB (AWS) --> Hive Thrift ( 2 servers) --> MySQL > (Master) -- > MySQL (Slave). > Software versions: > Hive version : 0.10.0 > Hadoop: 1.2.1 > Looks like when the load is beyond some threshold for certain operations it > is having problem in responding. > As the hive jobs sometimes fails because of this issue, we also have a > auto-restart check to see if the Thrift server is not responding, it stops / > kills and restart the service. > Other tuning done: > Thrift Server: > Given 11gb heap, and configured CMS GC algo. > MySQL: > Tuned innodb_buffer, tmp_table and max_heap parameters. > So, can somebody please help to understand, what could be the root cause for > this or somebody faced the similar issue. > I found one related JIRA :https://issues.apache.org/jira/browse/HCATALOG-541 > But this JIRA shows that Hive Thrift Server shows OOM error, but in my case I > didnt see any OOM error in my case. > Regards, > Manish > Full Exception Stack: > at > org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) > at > org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) > at > org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:412) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:399) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:736) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:74) > at $Proxy7.getDatabase(Unknown Source) > at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1110) > at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099) > at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935) > at > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:347) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:706) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.hadoop.util.RunJar.main(RunJar.java:160) > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:150) > at java.net.SocketInputStream.read(SocketInputStream.java:121) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > at > org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) > ... 34 more > 2015-01-20 22:44:12,978 ERROR exec.Task (SessionState.java:printError(401)) - > FAILED: Error in metadata: org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.thrift.transport.TTransportException: > java.net.SocketTimeoutException: Read timed out > at org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1114) > at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1099) > at org.apache.hadoop.hive.ql.exec.DDLTask.showTables(DDLTask.java:2206) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:334) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1336) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1122) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:935) > at > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) -- This message was sent by Atlassian JIRA (v6.3.4#6332)