Re: Query timed out after PT2M

Joe Obernberger Fri, 04 Feb 2022 12:29:51 -0800

I've tried several different GC settings - but still getting timeouts.
Using openJDK 11 with:
-XX:+UseG1GC
-XX:+ParallelRefProcEnabled
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
-XX:ParallelGCThreads=24
-XX:ConcGCThreads=24


Machine has 40 cores.  Xmx is set to 32G.
13 node cluster.

Any ideas on what else to try?

-Joe

On 2/4/2022 10:39 AM, Joe Obernberger wrote:

Still no go. Oddly, I can use trino and do a count OK, but with sparkI get the timeouts. I don't believe tombstones are an issue:


nodetool cfstats doc.doc
Total number of tables: 82
----------------
Keyspace : doc
        Read Count: 1514288521
        Read Latency: 0.5080819034089475 ms
        Write Count: 12716563031
        Write Latency: 0.1462260620347646 ms
        Pending Flushes: 0
                Table: doc
                SSTable count: 72
                Old SSTable count: 0
                Space used (live): 74097778114
                Space used (total): 74097778114
                Space used by snapshots (total): 0
                Off heap memory used (total): 287187173
                SSTable Compression Ratio: 0.38644718028460934
                Number of partitions (estimate): 94111032
                Memtable cell count: 175084
                Memtable data size: 36945327
                Memtable off heap memory used: 0
                Memtable switch count: 677
                Local read count: 16237350
                Local read latency: 0.639 ms
                Local write count: 314822497
                Local write latency: 0.061 ms
                Pending flushes: 0
                Percent repaired: 0.0
                Bytes repaired: 0.000KiB
                Bytes unrepaired: 164.168GiB
                Bytes pending repair: 0.000KiB
                Bloom filter false positives: 154552
                Bloom filter false ratio: 0.01059
                Bloom filter space used: 152765592
                Bloom filter off heap memory used: 152765016
                Index summary off heap memory used: 48349869
                Compression metadata off heap memory used: 86072288
                Compacted partition minimum bytes: 104
                Compacted partition maximum bytes: 943127
                Compacted partition mean bytes: 1609

Average live cells per slice (last five minutes):1108.6270918991

                Maximum live cells per slice (last five minutes): 1109
                Average tombstones per slice (last five minutes): 1.0
                Maximum tombstones per slice (last five minutes): 1
                Dropped Mutations: 0

Other things to check?

-Joe

On 2/3/2022 9:30 PM, manish khandelwal wrote:

It maybe the case you have lots of tombstones in this table which ismaking reads slow and timeouts during bulk reads.

On Fri, Feb 4, 2022, 03:23 Joe Obernberger<joseph.obernber...@gmail.com> wrote:


    So it turns out that number after PT is increments of 60
    seconds.  I changed the timeout to 960000, and now I get PT16M
    (960000/60000).  Since I'm still getting timeouts, something else
    must be wrong.

    Exception in thread "main" org.apache.spark.SparkException: Job
    aborted due to stage failure: Task 306 in stage 0.0 failed 4
    times, most recent failure: Lost task 306.3 in stage 0.0 (TID
    1180) (172.16.100.39 executor 0):
    com.datastax.oss.driver.api.core.DriverTimeoutException: Query
    timed out after PT16M
            at
    
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
            at
    
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
            at
    
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
            at
    
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
            at
    
com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
            at java.base/java.lang.Thread.run(Thread.java:829)

    Driver stacktrace:
            at
    
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
            at
    
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
            at
    
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
            at
    scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
            at
    scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
            at
    scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
            at
    org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
            at
    
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
            at
    
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
            at scala.Option.foreach(Option.scala:407)
            at
    
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
            at
    
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
            at
    
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
            at
    
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
            at
    org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    Caused by:
    com.datastax.oss.driver.api.core.DriverTimeoutException: Query
    timed out after PT16M
            at
    
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
            at
    
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
            at
    
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
            at
    
com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
            at
    
com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    -Joe

    On 2/3/2022 3:30 PM, Joe Obernberger wrote:


    I did find this:
    
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md

    And "spark.cassandra.read.timeoutMS" is set to 120000.

    Running a test now, and I think that is it.  Thank you Scott.

    -Joe

    On 2/3/2022 3:19 PM, Joe Obernberger wrote:


    Thank you Scott!
    I am using the spark cassandra connector.  Code:

    SparkSession spark = SparkSession
                    .builder()
                    .appName("SparkCassandraApp")
    .config("spark.cassandra.connection.host", "chaos")
    .config("spark.cassandra.connection.port", "9042")
                    .master("spark://aether.querymasters.com:8181
    <http://aether.querymasters.com:8181>")
                    .getOrCreate();

    Would I set PT2M in there?  Like .config("pt2m","300") ?
    I'm not familiar with jshell, so I'm not sure where you're
    getting that duration from.

    Right now, I'm just doing a count:
    Dataset<Row> dataset =
    spark.read().format("org.apache.spark.sql.cassandra")
                    .options(new HashMap<String, String>() {
                        {
                            put("keyspace", "doc");
                            put("table", "doc");
                        }
                    }).load();

    dataset.count();


    Thank you!

    -Joe

    On 2/3/2022 3:01 PM, C. Scott Andreas wrote:

    Hi Joe, it looks like "PT2M" may refer to a timeout value that
    could be set by your Spark job's initialization of the client.
    I don't see a string matching this in the Cassandra codebase
    itself, but I do see that this is parseable as a Duration.

    ```
    jshell> java.time.Duration.parse("PT2M").getSeconds()
    $7 ==> 120
    ```

    The server-side log you see is likely an indicator of the
    timeout from the server's perspective. You might consider
    checking lots from the replicas for dropped reads, query
    aborts due to scanning more tombstones than the configured
    max, or other conditions indicating overload/inability to
    serve a response.

    If you're running a Spark job, I'd recommend using the
    DataStax Spark Cassandra Connector which distributes your
    query to executors addressing slices of the token range which
    will land on replica sets, avoiding the scatter-gather
    behavior that can occur if using the Java driver alone.

    Cheers,

    – Scott

    On Feb 3, 2022, at 11:42 AM, Joe Obernberger
    <joseph.obernber...@gmail.com>
    <mailto:joseph.obernber...@gmail.com> wrote:


    Hi all - using a Cassandra 4.0.1 and a spark job running
    against a large
    table (~8 billion rows) and I'm getting this error on the
    client side:
    Query timed out after PT2M

    On the server side I see a lot of messages like:
    DEBUG [Native-Transport-Requests-39] 2022-02-03 14:39:56,647
    ReadCallback.java:119 - Timed out; received 0 of 1 responses

    The same code works on another table in the same Cassandra
    cluster that
    is about 300 million rows and completes in about 2 minutes. 
    The cluster
    is 13 nodes.

    I can't find what PT2M means.  Perhaps the table needs a
    repair? Other
    ideas?
    Thank you!

    -Joe



    
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
        Virus-free. www.avg.com
    
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>


    <#m_4964341664985366658_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: Query timed out after PT2M

Reply via email to