It maybe the case you have lots of tombstones in this table which is making reads slow and timeouts during bulk reads.
On Fri, Feb 4, 2022, 03:23 Joe Obernberger <joseph.obernber...@gmail.com> wrote: > So it turns out that number after PT is increments of 60 seconds. I > changed the timeout to 960000, and now I get PT16M (960000/60000). Since > I'm still getting timeouts, something else must be wrong. > > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 306 in stage 0.0 failed 4 times, most recent > failure: Lost task 306.3 in stage 0.0 (TID 1180) (172.16.100.39 executor > 0): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed > out after PT16M > at > com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206) > at > com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672) > at > com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747) > at > com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472) > at > com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:829) > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query > timed out after PT16M > at > com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206) > at > com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672) > at > com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747) > at > com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472) > at > com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > > -Joe > On 2/3/2022 3:30 PM, Joe Obernberger wrote: > > I did find this: > > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md > > And "spark.cassandra.read.timeoutMS" is set to 120000. > > Running a test now, and I think that is it. Thank you Scott. > > -Joe > On 2/3/2022 3:19 PM, Joe Obernberger wrote: > > Thank you Scott! > I am using the spark cassandra connector. Code: > > SparkSession spark = SparkSession > .builder() > .appName("SparkCassandraApp") > .config("spark.cassandra.connection.host", "chaos") > .config("spark.cassandra.connection.port", "9042") > .master("spark://aether.querymasters.com:8181") > .getOrCreate(); > > Would I set PT2M in there? Like .config("pt2m","300") ? > I'm not familiar with jshell, so I'm not sure where you're getting that > duration from. > > Right now, I'm just doing a count: > Dataset<Row> dataset = > spark.read().format("org.apache.spark.sql.cassandra") > .options(new HashMap<String, String>() { > { > put("keyspace", "doc"); > put("table", "doc"); > } > }).load(); > > dataset.count(); > > > Thank you! > > -Joe > On 2/3/2022 3:01 PM, C. Scott Andreas wrote: > > Hi Joe, it looks like "PT2M" may refer to a timeout value that could be > set by your Spark job's initialization of the client. I don't see a string > matching this in the Cassandra codebase itself, but I do see that this is > parseable as a Duration. > > ``` > jshell> java.time.Duration.parse("PT2M").getSeconds() > $7 ==> 120 > ``` > > The server-side log you see is likely an indicator of the timeout from the > server's perspective. You might consider checking lots from the replicas > for dropped reads, query aborts due to scanning more tombstones than the > configured max, or other conditions indicating overload/inability to serve > a response. > > If you're running a Spark job, I'd recommend using the DataStax Spark > Cassandra Connector which distributes your query to executors addressing > slices of the token range which will land on replica sets, avoiding the > scatter-gather behavior that can occur if using the Java driver alone. > > Cheers, > > – Scott > > > On Feb 3, 2022, at 11:42 AM, Joe Obernberger > <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote: > > > Hi all - using a Cassandra 4.0.1 and a spark job running against a large > table (~8 billion rows) and I'm getting this error on the client side: > Query timed out after PT2M > > On the server side I see a lot of messages like: > DEBUG [Native-Transport-Requests-39] 2022-02-03 14:39:56,647 > ReadCallback.java:119 - Timed out; received 0 of 1 responses > > The same code works on another table in the same Cassandra cluster that > is about 300 million rows and completes in about 2 minutes. The cluster > is 13 nodes. > > I can't find what PT2M means. Perhaps the table needs a repair? Other > ideas? > Thank you! > > -Joe > > > > > <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > Virus-free. > www.avg.com > <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> > <#m_4964341664985366658_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > >