[ https://issues.apache.org/jira/browse/KUDU-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17296783#comment-17296783 ]
ASF subversion and git services commented on KUDU-3213: ------------------------------------------------------- Commit c167c1dc39d7089c4b1216bc62e423f3e2638479 in kudu's branch refs/heads/master from Andrew Wong [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=c167c1d ] [java] KUDU-3213: try at different server on TABLET_NOT_RUNNING Prior to this patch, if a tablet server were quiescing for a prolonged period, scan requests could time out, complaining that the tablet server is quiescing, but without ever retrying the scan at another tablet server. This is because tablet servers will return TABLET_NOT_RUNNING to clients when attempting a scan while quiescing. The behavior in the C++ client is that the location is then blacklisted and the request is retried elsewhere. The behavior in the Java client, though, is that the same location is retried until failure. This patch addresses this by treating TABLET_NOT_RUNNING errors in the Java client as we would for TABLET_NOT_FOUND, which is actually quite similar to the handling for TABLET_NOT_RUNNING in the C++ client: the location is invalidated for further attempts, and the request is retried elsewhere. Why not just have quiescing tablet servers return TABLET_NOT_FOUND, then? TABLET_NOT_FOUND errors in the C++ client actually have some behavior not present in the Java client: a tablet whose location is invalidated with TABLET_NOT_FOUND in the C++ client will be required to be looked up again, requiring a round trip to the master. This behavior doesn't exist in the Java client, so I thought it easiest to piggyback on TABLET_NOT_FOUND handling for now. Change-Id: I38ac84a52676ff361fa1ba996665b338d1bbfba1 Reviewed-on: http://gerrit.cloudera.org:8080/17124 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin <aser...@cloudera.com> > Java client should attempt a different tablet server when retrying during > tserver quiescing > ------------------------------------------------------------------------------------------- > > Key: KUDU-3213 > URL: https://issues.apache.org/jira/browse/KUDU-3213 > Project: Kudu > Issue Type: Bug > Components: java > Reporter: Andrew Wong > Priority: Major > > One of our clusters ran into the following error message when leaving a > tablet server quiesced for an extended period of time: > {code:java} > ERROR Runner: Pipeline exception occurred: org.apache.spark.SparkException: > Job aborted due to stage failure: Task 1 in stage 6.0 failed 4 times, most > recent failure: Lost task 1.3 in stage 6.0 (TID 1922, > tserver-018.edh.company.com, executor 58): > org.apache.kudu.client.NonRecoverableException: cannot complete before > timeout: ScanRequest(scannerId=null, tablet=9e17b554f85f4a7f855771d8b5c913f5, > attempt=24, KuduRpc(method=Scan, tablet=9e17b554f85f4a7f855771d8b5c913f5, > attempt=24, DeadlineTracker(timeout=30000, elapsed=27988), Traces: [0ms] > refreshing cache from master, [1ms] Sub RPC GetTableLocations: sending RPC to > server master-name-003.edh.company.com:7051, [12ms] Sub RPC > GetTableLocations: received response from server > master-name-003.edh.company.com:7051: OK, [22ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [116ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [117ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [126ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [129ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [129ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [146ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [149ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [149ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [166ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [168ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [168ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [206ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [209ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [209ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [266ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [268ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [268ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [306ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [308ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [308ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [545ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [548ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [548ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [865ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [868ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [868ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [1266ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [1269ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [1269ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [2626ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [2628ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [2628ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [4746ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [4749ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [4749ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [8206ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [8209ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [8209ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [8626ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [8629ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [8629ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [11746ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [11750ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [11750ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [14006ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [14008ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [14008ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [16366ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [16369ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [16369ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [19786ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [19788ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [19788ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [20106ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [20108ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [20108ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [22366ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [22369ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [22369ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [23946ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [23948ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [23949ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [27266ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [27269ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [27269ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [27986ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [27988ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0)))ERROR Runner: Pipeline > exception occurred: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 1 in stage 6.0 failed 4 times, most recent failure: Lost task > 1.3 in stage 6.0 (TID 1922, tserver-018.edh.company.com, executor 58): > org.apache.kudu.client.NonRecoverableException: cannot complete before > timeout: ScanRequest(scannerId=null, tablet=9e17b554f85f4a7f855771d8b5c913f5, > attempt=24, KuduRpc(method=Scan, tablet=9e17b554f85f4a7f855771d8b5c913f5, > attempt=24, DeadlineTracker(timeout=30000, elapsed=27988), Traces: [0ms] > refreshing cache from master, [1ms] Sub RPC GetTableLocations: sending RPC to > server master-name-003.edh.company.com:7051, [12ms] Sub RPC > GetTableLocations: received response from server > master-name-003.edh.company.com:7051: OK, [22ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [116ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [117ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [126ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [129ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [129ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [146ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [149ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [149ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [166ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [168ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [168ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [206ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [209ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [209ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [266ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [268ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [268ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [306ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [308ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [308ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [545ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [548ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [548ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [865ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [868ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [868ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [1266ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [1269ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [1269ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [2626ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [2628ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [2628ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [4746ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [4749ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [4749ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [8206ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [8209ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [8209ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [8626ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [8629ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [8629ms] received response > from server e1a4405443d845249b5ed15c8e882211: Service unavailable: Tablet > server is quiescing (error 0), [11746ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [11750ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [11750ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [14006ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [14008ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [14008ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [16366ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [16369ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [16369ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [19786ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [19788ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [19788ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [20106ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [20108ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [20108ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [22366ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [22369ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [22369ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [23946ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [23948ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [23949ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [27266ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [27269ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0), [27269ms] received > response from server e1a4405443d845249b5ed15c8e882211: Service unavailable: > Tablet server is quiescing (error 0), [27986ms] sending RPC to server > e1a4405443d845249b5ed15c8e882211, [27988ms] delaying RPC due to: Service > unavailable: Tablet server is quiescing (error 0))) at > org.apache.kudu.client.KuduException.transformException(KuduException.java:110) > at > org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402) > at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at > org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage11.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409){code} > It seems like the handling of the {{TABLET_NOT_RUNNING}} code is different > between the > [Java|https://github.com/apache/kudu/blob/0dc3e9e0a7306ce9b618158b8a20af9b10f4a482/java/kudu-client/src/main/java/org/apache/kudu/client/RpcProxy.java#L347] > and > [C++|https://github.com/apache/kudu/blob/36a21d6ee4de8828417d7884cf87cd5e2ba15a21/src/kudu/client/scanner-internal.cc#L162] > clients. In the C++ client, the tserver is blacklisted and another tserver > is attempted. In the Java client, the request appears to simply be retried at > the same server, as it would a {{SERVICE_UNAVAILABLE}} error. -- This message was sent by Atlassian Jira (v8.3.4#803005)