Periodic rpc_timeout errors on select query

Chap Lovejoy Wed, 05 Feb 2014 14:15:50 -0800

Hi,

We're seeing pretty regular rpc timeout errors on what appear to besimple queries. We're running a three node cluster under pretty lightload. We're averaging 30-40 writes/sec and about 8 reads/sec accordingto OpsCenter. The failures don't seem to be related to any changes inload. A single query repeated from CQLSH (about once a second or so)will fail approximately one out of ten times. I do see an increase inthe average read latency around the time of the failure, though it'sunclear if that's from the single failed request or if others areaffected. This seems to happen most on a number of similarly structuredtables. One is:


CREATE TABLE psr (
  inst_id bigint,
  prosp_id bigint,
  inter_id bigint,
  avail text,
  comments text,
  email text,
  first_name text,
  last_name text,
  m_id text,
  m_num text,
  phone text,
  info blob,
  status text,
  time timestamp,
  PRIMARY KEY ((inst_id, prosp_id), inter_id)
) WITH CLUSTERING ORDER BY (inter_id DESC) AND
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

I'm executing the query:

SELECT inter_id FROM "psr" WHERE inst_id = 1 AND prosp_id =127788649174986752 AND inter_id < 30273563814527279 LIMIT 10000;

Normally this query returns 32 rows. A total of 413 match the partitionkey.


Here is a trace for a successful run:

| timestamp |source | source_elapsed

--------------------------------------------------+--------------+---------------+----------------

execute_cql3_query | 21:52:20,831 |10.128.32.141 | 0Message received from /10.128.32.141 | 21:52:20,826 |10.128.32.140 | 69Executing single-partition query on psr | 21:52:20,827 |10.128.32.140 | 502Acquiring sstable references | 21:52:20,827 |10.128.32.140 | 517Merging memtable tombstones | 21:52:20,827 |10.128.32.140 | 576Key cache hit for sstable 54 | 21:52:20,827 |10.128.32.140 | 685Seeking to partition indexed section in data file | 21:52:20,827 |10.128.32.140 | 697Skipped 1/2 non-slice-intersecting sstables, | 21:52:20,827 |10.128.32.140 | 751

    included 0 due to tombstones

Merging data from memtables and 1 sstables | 21:52:20,827 |10.128.32.140 | 773Read 32 live and 0 tombstoned cells | 21:52:20,828 |10.128.32.140 | 2055Enqueuing response to /10.128.32.141 | 21:52:20,829 |10.128.32.140 | 2172Sending message to /10.128.32.141 | 21:52:20,829 |10.128.32.140 | 2341Parsing SELECT ... | 21:52:20,831 |10.128.32.141 | 105Preparing statement | 21:52:20,831 |10.128.32.141 | 200Sending message to /10.128.32.140 | 21:52:20,831 |10.128.32.141 | 492Message received from /10.128.32.140 | 21:52:20,836 |10.128.32.141 | 5361Processing response from /10.128.32.140 | 21:52:20,836 |10.128.32.141 | 5534Request complete | 21:52:20,837 |10.128.32.141 | 6013



And here is on unsuccessful run:

| timestamp |source | source_elapsed

---------------------------------------------------+--------------+---------------+---------------

execute_cql3_query | 21:56:19,792 |10.128.32.141 | 0Parsing SELECT ... | 21:56:19,792 |10.128.32.141 | 69Preparing statement | 21:56:19,792 |10.128.32.141 | 160Sending message to /10.128.32.137 | 21:56:19,792 |10.128.32.141 | 509Message received from /10.128.32.141 | 21:56:19,793 |10.128.32.137 | 57Executing single-partition query on psr | 21:56:19,794 |10.128.32.137 | 412Acquiring sstable references | 21:56:19,794 |10.128.32.137 | 444Merging memtable tombstones | 21:56:19,794 |10.128.32.137 | 486Key cache hit for sstable 59 | 21:56:19,794 |10.128.32.137 | 555Seeking to partition indexed section in data file | 21:56:19,794 |10.128.32.137 | 569Key cache hit for sstable 3 | 21:56:19,794 |10.128.32.137 | 609Seeking to partition indexed section in data file | 21:56:19,794 |10.128.32.137 | 621Timed out; received 0 of 1 responses | 21:56:24,792 |10.128.32.141 | 5000766Request complete | 21:56:24,792 |10.128.32.141 | 5000888


The output of cfstats is:

                SSTable count: 3
                Space used (live), bytes: 606268
                Space used (total), bytes: 612372
                SSTable Compression Ratio: 0.4011882905126778
                Number of keys (estimate): 768
                Memtable cell count: 156
                Memtable data size, bytes: 60451
                Memtable switch count: 34
                Local read count: 1439
                Local read latency: NaN ms
                Local write count: 618
                Local write latency: NaN ms
                Pending tasks: 0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used, bytes: 1840
                Compacted partition minimum bytes: 447
                Compacted partition maximum bytes: 20501
                Compacted partition mean bytes: 3478
                Average live cells per slice (last five minutes): 0.0
                Average tombstones per slice (last five minutes): 0.0

Is there a problem with the table structure or query that would becausing these failures? Should we expect timeouts as a regularoccurrence in operation and be prepared to retry every query as needed?

I'd really appreciate any input you could offer on how to improve this.These timeouts are causing some rather troublesome errors in ourapplication.


Thank you,
Chap

Periodic rpc_timeout errors on select query

Reply via email to