Re: Permanent ReadTimeout

Ja Sam Tue, 13 Jan 2015 11:51:14 -0800

Your response is full of information, after I read it I think that I design
something wrong in my system. I will try to present what hardware I have
and what I am trying to achieve.


*Hardware:*
I have 9 machines, every machine has 10 hdd for data (not SSD) and 64 GB of
RAM.

*Requirements*
The Cassandra storage is design for audit data, so the only operation is
INSERT.
Each even have following properties: customer, UUID, event type (there are
4 types), date-time and some other properties. Event is stored as protobuf
in blob.
There are two types of customers which generates me an events: customer
with small amount daily (up to 100 events) and with lots of events daily
(up to 100 thousand). But with customer id I don't know which type of user
it is.

There are two types of queries which I need to run:
1) Select all events for customer in for date range. The range is small -
up to few days. It is an "audit" query
2) Select all events UUID for one day - it is for reconciliation process,
we need to check if every event was stored in Cassandra.

*Key-spaces*
Each day I write into two keyspaces:
1) One for storing data for audit query. The table definition I presented
in previous mails.
2) One for reconciliation only - it is one day keyspace. After
reconciliation I can safety delete it.


*Data replication*
We have set the following replication settings:
    REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC_A' : 5, 'DC_B'
: 3};
which means that all machines in DC_A have all data. In DC_B one machine
has 3/4 of data.

*Disk usage*
When I checked disk usage not all disk have same usage and used space.

*Questions*
1) Is there a way to utilize hdd better?
2) Maybe I should write to multiple keyspaces to have better hdd
utilization?
3) Are my replication settings correct? Or maybe they are too big?
4) I can easy reduce write operation just removing reconciliation keyspace,
but still I will have to find a way to run query for getting all UUIDs for
one day.

I hope I presented enough information, if something is missing just write.
Thanks again for help.



On Tue, Jan 13, 2015 at 5:35 PM, Eric Stevens <migh...@gmail.com> wrote:

> If you have fallen far behind on compaction, this is a hard situation to
> recover from.  It means that you're writing data faster than your cluster
> can absorb it.  The right path forward depends on a lot of factors, but in
> general you either need more servers or bigger servers, or else you need to
> write less data.
>
> Safely adding servers is actually hard in this situation, lots of
> aggressive compaction produces a result where bootstrapping new nodes
> (growing your cluster) causes a lot of over-streaming, meaning data that is
> getting compacted may be streamed multiple times, in the old SSTable, and
> again in the new post-compaction SSTable, and maybe again in another
> post-compaction SSTable.  For a healthy cluster, it's a trivial amount of
> overstreaming.  For an unhealthy cluster like this, you might not actually
> ever complete streaming and be able to successfully join the cluster before
> your target server's disks are full.
>
> If you can afford the space and don't already have things set up this way,
> disable compression and switch to size tiered compaction (you'll need to
> keep at least 50% of your disk space free to be safe in size tiered).  Also
> nodetool setcompactionthroughput will let you open the flood gates to try
> to catch up on compaction quickly (at the cost of read and write
> performance into the cluster).
>
> If you still can't catch up on compaction, you have a very serious
> problem.  You need to either reduce your write volume, or grow your cluster
> unsafely (disable bootstrapping new nodes) to reduce write pressure on your
> existing nodes.  Either way you should get caught up on compaction before
> you can safely add new nodes again.
>
> If you grow unsafely, you are effectively electing to discard data.  Some
> of it may be recoverable with a nodetool repair after you're caught up on
> compaction, but you will almost certainly lose some records.
>
> On Tue, Jan 13, 2015 at 2:22 AM, Ja Sam <ptrstp...@gmail.com> wrote:
>
>> Ad 4) For sure I got a big problem. Because pending tasks: 3094
>>
>> The question is what should I change/monitor? I can present my whole
>> solution design, if it helps
>>
>> On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam <ptrstp...@gmail.com> wrote:
>>
>>> To precise your remarks:
>>>
>>> 1) About 30 sec GC. I know that after time my cluster had such problem,
>>> we added "magic" flag, but result will be in ~2 weeks (as I presented in
>>> screen on StackOverflow). If you have any idea how can fix/diagnose this
>>> problem, I will be very grateful.
>>>
>>> 2) It is probably true, but I don't think that I can change it. Our data
>>> centers are in different places and the network between them is not
>>> perfect. But as we observed network partition happened rare. Maximum is
>>> once a week for an hour.
>>>
>>> 3) We are trying to do a regular repairs (incremental), but usually they
>>> do not finish. Even local repairs have problems with finishing.
>>>
>>> 4) I will check it as soon as possible and post it here. If you have any
>>> suggestion what else should I check, you are welcome :)
>>>
>>>
>>>
>>>
>>> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <migh...@gmail.com> wrote:
>>>
>>>> If you're getting 30 second GC's, this all by itself could and probably
>>>> does explain the problem.
>>>>
>>>> If you're writing exclusively to A, and there are frequent partitions
>>>> between A and B, then A is potentially working a lot harder than B, because
>>>> it needs to keep track of hinted handoffs to replay to B whenever
>>>> connectivity is restored.  It's also acting as coordinator for writes which
>>>> need to end up in B eventually.  This in turn may be a significant
>>>> contributing factor to your GC pressure in A.
>>>>
>>>> I'd also grow suspicious of the integrity of B as a reliable backup of
>>>> A unless you're running repair on a regular basis.
>>>>
>>>> Also, if you have thousands of SSTables, then you're probably falling
>>>> behind on compaction, check nodetool compactionstats - you should typically
>>>> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
>>>> compaction, your sstable_size_in_mb might be a bad value for your use case.
>>>>
>>>> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <ptrstp...@gmail.com> wrote:
>>>>
>>>>> *Environment*
>>>>>
>>>>>
>>>>>    - Cassandra 2.1.0
>>>>>    - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>>>>    - 2500 writes per seconds, I write only to DC_A with local_quorum
>>>>>    - minimal reads (usually none, sometimes few)
>>>>>
>>>>> *Problem*
>>>>>
>>>>> After a few weeks of running I cannot read any data from my cluster,
>>>>> because I have ReadTimeoutException like following:
>>>>>
>>>>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 
>>>>> CustomTThreadPoolServer.java:219 - Error occurred during processing of 
>>>>> message.
>>>>> com.google.common.util.concurrent.UncheckedExecutionException: 
>>>>> java.lang.RuntimeException: 
>>>>> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out 
>>>>> - received only 2 responses.
>>>>>
>>>>> To be precise it is not only problem in my cluster, The second one was
>>>>> described here: Cassandra GC takes 30 seconds and hangs node
>>>>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node>
>>>>>  and
>>>>> I will try to use fix from CASSANDRA-6541
>>>>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>>>>> suggested
>>>>>
>>>>> *Diagnose *
>>>>>
>>>>> I tried to use some tools which were presented on
>>>>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>>>>> by Jon Haddad and have some strange result.
>>>>>
>>>>>
>>>>> I tried to run same query in DC_A and DC_B with tracing enabled. Query
>>>>> is simple:
>>>>>
>>>>>    SELECT * FROM X.customer_events WHERE customer='1234567' AND
>>>>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>>>>
>>>>> Where table is defiied as following:
>>>>>
>>>>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>>>>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>>>>> blob,
>>>>>
>>>>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>>>>> event_type)[...]
>>>>>
>>>>> Results of the query:
>>>>>
>>>>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>>>>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>>>>> range form -128 to 256
>>>>>
>>>>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>>>>
>>>>>    Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>>>>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782
>>>>>
>>>>> Where in DC_A it is:
>>>>>
>>>>>    Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>>>>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527
>>>>>
>>>>> 3) Total records in both DC were same.
>>>>>
>>>>>
>>>>> *Question*
>>>>>
>>>>> The question is quite simple: how can I speed up DC_A - it is my
>>>>> primary DC, DC_B is mostly for backup, and there is a lot of network
>>>>> partitions between A and B.
>>>>>
>>>>> Maybe I should check something more, but I just don't have an idea
>>>>> what it should be.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Permanent ReadTimeout

Reply via email to