It maybe the case you have lots of tombstones in this table which is making
reads slow and timeouts during bulk reads.

On Fri, Feb 4, 2022, 03:23 Joe Obernberger <joseph.obernber...@gmail.com>
wrote:

> So it turns out that number after PT is increments of 60 seconds.  I
> changed the timeout to 960000, and now I get PT16M (960000/60000).  Since
> I'm still getting timeouts, something else must be wrong.
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 306 in stage 0.0 failed 4 times, most recent
> failure: Lost task 306.3 in stage 0.0 (TID 1180) (172.16.100.39 executor
> 0): com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed
> out after PT16M
>         at
> com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
>         at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
>         at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
>         at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
>         at
> com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.base/java.lang.Thread.run(Thread.java:829)
>
> Driver stacktrace:
>         at
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
>         at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
>         at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
>         at
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>         at
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>         at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
>         at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
>         at
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
>         at scala.Option.foreach(Option.scala:407)
>         at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
>         at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
> Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query
> timed out after PT16M
>         at
> com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
>         at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
>         at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
>         at
> com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
>         at
> com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
> -Joe
> On 2/3/2022 3:30 PM, Joe Obernberger wrote:
>
> I did find this:
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md
>
> And "spark.cassandra.read.timeoutMS" is set to 120000.
>
> Running a test now, and I think that is it.  Thank you Scott.
>
> -Joe
> On 2/3/2022 3:19 PM, Joe Obernberger wrote:
>
> Thank you Scott!
> I am using the spark cassandra connector.  Code:
>
> SparkSession spark = SparkSession
>                 .builder()
>                 .appName("SparkCassandraApp")
>                 .config("spark.cassandra.connection.host", "chaos")
>                 .config("spark.cassandra.connection.port", "9042")
>                 .master("spark://aether.querymasters.com:8181")
>                 .getOrCreate();
>
> Would I set PT2M in there?  Like .config("pt2m","300") ?
> I'm not familiar with jshell, so I'm not sure where you're getting that
> duration from.
>
> Right now, I'm just doing a count:
> Dataset<Row> dataset =
> spark.read().format("org.apache.spark.sql.cassandra")
>                 .options(new HashMap<String, String>() {
>                     {
>                         put("keyspace", "doc");
>                         put("table", "doc");
>                     }
>                 }).load();
>
> dataset.count();
>
>
> Thank you!
>
> -Joe
> On 2/3/2022 3:01 PM, C. Scott Andreas wrote:
>
> Hi Joe, it looks like "PT2M" may refer to a timeout value that could be
> set by your Spark job's initialization of the client. I don't see a string
> matching this in the Cassandra codebase itself, but I do see that this is
> parseable as a Duration.
>
> ```
> jshell> java.time.Duration.parse("PT2M").getSeconds()
> $7 ==> 120
> ```
>
> The server-side log you see is likely an indicator of the timeout from the
> server's perspective. You might consider checking lots from the replicas
> for dropped reads, query aborts due to scanning more tombstones than the
> configured max, or other conditions indicating overload/inability to serve
> a response.
>
> If you're running a Spark job, I'd recommend using the DataStax Spark
> Cassandra Connector which distributes your query to executors addressing
> slices of the token range which will land on replica sets, avoiding the
> scatter-gather behavior that can occur if using the Java driver alone.
>
> Cheers,
>
> – Scott
>
>
> On Feb 3, 2022, at 11:42 AM, Joe Obernberger
> <joseph.obernber...@gmail.com> <joseph.obernber...@gmail.com> wrote:
>
>
> Hi all - using a Cassandra 4.0.1 and a spark job running against a large
> table (~8 billion rows) and I'm getting this error on the client side:
> Query timed out after PT2M
>
> On the server side I see a lot of messages like:
> DEBUG [Native-Transport-Requests-39] 2022-02-03 14:39:56,647
> ReadCallback.java:119 - Timed out; received 0 of 1 responses
>
> The same code works on another table in the same Cassandra cluster that
> is about 300 million rows and completes in about 2 minutes.  The cluster
> is 13 nodes.
>
> I can't find what PT2M means.  Perhaps the table needs a repair? Other
> ideas?
> Thank you!
>
> -Joe
>
>
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>  Virus-free.
> www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_4964341664985366658_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>

Reply via email to