Occasional read timeouts seen during row scans

Clint Kelly Fri, 01 Aug 2014 17:03:06 -0700

Hi everyone,

I am seeing occasional read timeouts during multi-row queries, but I'm
having difficulty reproducing them or understanding what the problem
is.


First, some background:

Our team wrote a custom MapReduce InputFormat that looks pretty
similar to the DataStax InputFormat except that it allows queries that
touch multiple CQL tables with the same PRIMARY KEY format (it then
assembles together results from multiple tables for the same primary
key before sending them back to the user in the RecordReader).

During a large batch job in a cluster and during some integration
tests, we see errors like the following:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
timeout during read query at consistency ONE (1 responses were
required but only 0 replica responded)

Our queries look like this:

SELECT token(eid_component), eid_component, lg, family, qualifier,
version, value FROM "kiji_it0"."t_foo" WHERE lg=? AND family=? AND
qualifier=?  AND token(eid_component) >= ? AND token(eid_component) <=
?ALLOW FILTERING;

Our tables look like the following:

CREATE TABLE "kiji_it0"."t_foo" (
 eid_component varchar,
 lg varchar,
 family blob,
 qualifier blob,
 version bigint,
 value blob,
 PRIMARY KEY ((eid_component), lg, family, qualifier, version))
WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version DESC);

with an additional index on the "lg" column (the lg column is
*extremely* low cardinality).

(FWIW I realize that having "ALLOW FILTERING" is potentially a Very
Bad Idea, but we are building a framework on top of Cassandra and
MapReduce that allows our users to occasionally make queries like
this.  We don't really mind taking a performance hit since these are
batch jobs.  We are considering eventually supporting some automatic
denormalization, but have not done so yet.)

If I change the query above to remove the WHERE clauses, the errors go away.

I think I understand the problem here---there are some rows that have
huge amounts of data that we have to scan over, and occasionally those
scans take so long that there is a timeout.

I have a couple of questions:

1. What parameters in my code or in the Cassandra cluster do I need to
adjust to get rid of these timeouts?  Our table layout is designed
such that its real-time performance should be pretty good, so I don't
mind if the batch queries are a little bit slow.  Do I need to change
the read_request_timeout_in_ms parameter?  Or something else?

2. I have tried to create a test to reproduce this problem, but I have
been unable to do so.  Any suggestions on how to do this?  I tried
creating a table similar to that described above and filling in a huge
amount of data for some rows to try to increase the amount of space
that we'd need to skip over.  I also tried reducing
read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.

Let me know if anyone has any thoughts or suggestions.  At a minimum
I'd like to be able to reproduce these read timeout errors in some
integration tests.

Thanks!

Best regards,
Clint

Occasional read timeouts seen during row scans

Reply via email to