Re: Query timed out after PT2M

2022-02-04 Thread Joe Obernberger
Still no go.  Oddly, I can use trino and do a count OK, but with spark I 
get the timeouts.  I don't believe tombstones are an issue:


nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 1514288521
    Read Latency: 0.5080819034089475 ms
    Write Count: 12716563031
    Write Latency: 0.1462260620347646 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 72
    Old SSTable count: 0
    Space used (live): 74097778114
    Space used (total): 74097778114
    Space used by snapshots (total): 0
    Off heap memory used (total): 287187173
    SSTable Compression Ratio: 0.38644718028460934
    Number of partitions (estimate): 94111032
    Memtable cell count: 175084
    Memtable data size: 36945327
    Memtable off heap memory used: 0
    Memtable switch count: 677
    Local read count: 16237350
    Local read latency: 0.639 ms
    Local write count: 314822497
    Local write latency: 0.061 ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 164.168GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 154552
    Bloom filter false ratio: 0.01059
    Bloom filter space used: 152765592
    Bloom filter off heap memory used: 152765016
    Index summary off heap memory used: 48349869
    Compression metadata off heap memory used: 86072288
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1609
    Average live cells per slice (last five minutes): 
1108.6270918991

    Maximum live cells per slice (last five minutes): 1109
    Average tombstones per slice (last five minutes): 1.0
    Maximum tombstones per slice (last five minutes): 1
    Dropped Mutations: 0

Other things to check?

-Joe

On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60 seconds. 
I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting timeouts, something else
must be wrong.

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 306 in stage 0.0 failed 4
times, most recent failure: Lost task 306.3 in stage 0.0 (TID
1180) (172.16.100.39 executor 0):
com.datastax.oss.driver.api.core.DriverTimeoutException: Query
timed out after PT16M
    at

com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at

com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
    at

org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
    at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
    at scala.Option.foreach(Option.scala:407)
    at

org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
    at

org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
    at
 

Re: Query timed out after PT2M

2022-02-04 Thread Joe Obernberger

I've tried several different GC settings - but still getting timeouts.
Using openJDK 11 with:
-XX:+UseG1GC
-XX:+ParallelRefProcEnabled
-XX:G1RSetUpdatingPauseTimePercent=5
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=70
-XX:ParallelGCThreads=24
-XX:ConcGCThreads=24

Machine has 40 cores.  Xmx is set to 32G.
13 node cluster.

Any ideas on what else to try?

-Joe

On 2/4/2022 10:39 AM, Joe Obernberger wrote:


Still no go.  Oddly, I can use trino and do a count OK, but with spark 
I get the timeouts.  I don't believe tombstones are an issue:


nodetool cfstats doc.doc
Total number of tables: 82

Keyspace : doc
    Read Count: 1514288521
    Read Latency: 0.5080819034089475 ms
    Write Count: 12716563031
    Write Latency: 0.1462260620347646 ms
    Pending Flushes: 0
    Table: doc
    SSTable count: 72
    Old SSTable count: 0
    Space used (live): 74097778114
    Space used (total): 74097778114
    Space used by snapshots (total): 0
    Off heap memory used (total): 287187173
    SSTable Compression Ratio: 0.38644718028460934
    Number of partitions (estimate): 94111032
    Memtable cell count: 175084
    Memtable data size: 36945327
    Memtable off heap memory used: 0
    Memtable switch count: 677
    Local read count: 16237350
    Local read latency: 0.639 ms
    Local write count: 314822497
    Local write latency: 0.061 ms
    Pending flushes: 0
    Percent repaired: 0.0
    Bytes repaired: 0.000KiB
    Bytes unrepaired: 164.168GiB
    Bytes pending repair: 0.000KiB
    Bloom filter false positives: 154552
    Bloom filter false ratio: 0.01059
    Bloom filter space used: 152765592
    Bloom filter off heap memory used: 152765016
    Index summary off heap memory used: 48349869
    Compression metadata off heap memory used: 86072288
    Compacted partition minimum bytes: 104
    Compacted partition maximum bytes: 943127
    Compacted partition mean bytes: 1609
    Average live cells per slice (last five minutes): 
1108.6270918991

    Maximum live cells per slice (last five minutes): 1109
    Average tombstones per slice (last five minutes): 1.0
    Maximum tombstones per slice (last five minutes): 1
    Dropped Mutations: 0

Other things to check?

-Joe

On 2/3/2022 9:30 PM, manish khandelwal wrote:
It maybe the case you have lots of tombstones in this table which is 
making reads slow and timeouts during bulk reads.


On Fri, Feb 4, 2022, 03:23 Joe Obernberger 
 wrote:


So it turns out that number after PT is increments of 60
seconds.  I changed the timeout to 96, and now I get PT16M
(96/6).  Since I'm still getting timeouts, something else
must be wrong.

Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 306 in stage 0.0 failed 4
times, most recent failure: Lost task 306.3 in stage 0.0 (TID
1180) (172.16.100.39 executor 0):
com.datastax.oss.driver.api.core.DriverTimeoutException: Query
timed out after PT16M
    at

com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:206)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at

com.datastax.oss.driver.shaded.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at

com.datastax.oss.driver.shaded.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)

Driver stacktrace:
    at

org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
    at
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
    at

org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGSchedu