Re: clarification on 100k tombstone limit in indexes

Ian Rose Tue, 12 Aug 2014 07:38:49 -0700

Makes sense - thanks again!


On Tue, Aug 12, 2014 at 9:45 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> Hello Ian
>
> "So that way each index entry *will* have quite a few entries and the
> index as a whole won't grow too big.  Is my thinking correct here?" --> In
> this case yes. Do not forget that for each date value, there will be 1
> corresponding index value + 10 updates. If you have an approximate count
> for "a few entries", a quick maths should give you an idea about how
> "large" the index partition is
>
> "I had considered an approach like this but my concern is that for any
> given minute *all* of the updates will be handled by a single node,
> right?" --> If you time resolution is a minute, yes it will be a problem.
> And depending on the insert rate, it can become a quickly a bottle neck
> during this minute.
>
>  The manual index approach suffers a lot from bottleneck issue for heavy
> workload, that's the main reason they implement a distributed secondary
> index. There is no free lunch though. What you gain in term of control and
> tuning with the manual index, you loose on the load distribution side.
>
>
>
>
> On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose <ianr...@fullstory.com> wrote:
>
>> Hi DuyHai,
>>
>> Thanks for the detailed response!  A few responses below:
>>
>> "On a side node, your usage of secondary index is not the best one.
>> Indeed, indexing the update date will lead to a situation where for one
>> date, you'll mostly have one or a few matching items (assuming that the
>> update date resolution is small enough and update rate is not intense)."
>> --> I should have mentioned this original (slipped my mind) but to deal
>> specifically with this problem I had planned to use a timestamp with a
>> resolution of 1 minute (like your minute_bucket).  So that way each index
>> entry *will* have quite a few entries and the index as a whole won't
>> grow too big.  Is my thinking correct here?
>>
>> "You better off create a manuel reverse-index to track modification
>> date, something like this"  --> I had considered an approach like this but
>> my concern is that for any given minute *all* of the updates will be
>> handled by a single node, right?  For example, if the minute_bucket is 2739
>> then for that one minute, every single item update will flow to the node at
>> HASH(2739).  Assuming I am thinking about that right, that seemed like a
>> potential scaling bottleneck, which scared me off that approach.
>>
>> Cheers,
>> Ian
>>
>>
>>
>>
>> On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> Hello Ian
>>>
>>> "It sounds like this 100k limit is, indeed, a "global" limit as opposed
>>> to a per-row limit" -->The threshold applies to each "REQUEST", not
>>> partition or globally.
>>>
>>> The threshold does not apply to a partition (physical row) simply
>>> because in one request you can fetch data from many partitions (multi get
>>> slice). There was a JIRA about this here:
>>> https://issues.apache.org/jira/browse/CASSANDRA-6865
>>>
>>> "Are these tombstones ever "GCed" out of the index?" --> Yes they are,
>>> during compactions of the index column family.
>>>
>>> "How frequently?" --> That's the real pain. Indeed you do not have any
>>> control on the tuning of secondary index CF compaction. As far as I know,
>>> the compaction settings (strategy, min/max thresholds...) inherits from the
>>> one of the base table
>>>
>>> Now, by looking very fast into your data model, it seems that you have a
>>> skinny partition patter. Since you mentioned that the date is updated only
>>> 10 times max, you should not run into the tombstonne threshold issue.
>>>
>>> On a side node, your usage of secondary index is not the best one.
>>> Indeed, indexing the update date will lead to a situation where for one
>>> date, you'll mostly have one or a few matching items (assuming that the
>>> update date resolution is small enough and update rate is not intense). It
>>> is the high-cardinality scenario to be avoided (
>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
>>> Plus, the query on the index (find all items where last_updated < [now - 30
>>> minutes]) makes things worse since it is not an exact match but inequality.
>>>
>>>  You better off create a manuel reverse-index to track modification
>>> date, something like this:
>>>
>>> CREATE TABLE last_updated_item (
>>>     minute_bucket int, // format YYYYMMDDHHmm
>>>     last_update_date timestamp,
>>>     item_id ascii,
>>>     PRIMARY KEY(minute_bucket, last_update_date)
>>> );
>>>
>>>  The last_update_date column is quite self-explanatory. The
>>> minute_bucket is trickier. The idea is to split ranges on 30 minutes into
>>> buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on.
>>> For a whole day, you'd have 48 buckets. We need to put data into buckets to
>>> avoid ultra wide rows since you mentioned that there are 10 items (so 10
>>> updates) / sec. Of course, 30 mins is just an exemple, you can tune it down
>>> to a window of 5 minutes or 1 minute, depending on the insertion rate.
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <ianr...@fullstory.com>
>>> wrote:
>>>
>>>> Hi Mark -
>>>>
>>>> Thanks for the clarification but as I'm not too familiar with the nuts
>>>> & bolts of Cassandra I'm not sure how to apply that info to my current
>>>> situation.  It sounds like this 100k limit is, indeed, a "global" limit as
>>>> opposed to a per-row limit.  Are these tombstones ever "GCed" out of the
>>>> index?  How frequently?  If not, then it seems like *any* index is at risk
>>>> of reaching this tipping point; it's just that indexes on frequently
>>>> updated columns will reach this pointer faster the indexes on rarely
>>>> updated columns.
>>>>
>>>> Basically I'm trying to get some kind of sense for what "frequently
>>>> updated
>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>"
>>>> means quantitatively.  As written, the docs make it sound dangerous to
>>>> create an index on a column that is *ever* deleted or updated since
>>>> there is no sense of how frequent is "too frequent".
>>>>
>>>> Cheers,
>>>> Ian
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <mark.re...@boxever.com>
>>>> wrote:
>>>>
>>>>> Hi Ian,
>>>>>
>>>>> The issues here, which relates to normal and index column families, is
>>>>> scanning over a large number of tombstones can cause Cassandra to fall 
>>>>> over
>>>>> due to increased GC pressure. This pressure is caused because tombstones
>>>>> will create DeletedColumn objects which consume heap. Also
>>>>> these DeletedColumn objects will have to be serialized and sent back to 
>>>>> the
>>>>> coordinator, thus increasing your response times. Take for example a row
>>>>> that does deletes and you query it with a limit of 100. In a worst case
>>>>> scenario you could end up reading say 50k tombstones to reach the 100
>>>>> 'live' column limit, all of which has to be put on heap and then sent over
>>>>> the wire to the coordinator. This would be considered a Cassandra
>>>>> anti-pattern.[1]
>>>>>
>>>>> With that in mind there was a debug warning added to 1.2 to inform the
>>>>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0
>>>>> the action was taken to drop requests reaching 100k tombstones[3] rather
>>>>> than just printing out a warning. This is a safety measure, as it is not
>>>>> advised to perform such a query and is a result of most people 'doing it
>>>>> wrong'.
>>>>>
>>>>> For those people who understand the risk of scanning over large
>>>>> numbers of tombstones there is a configuration option in the 
>>>>> cassandra.yaml
>>>>> to increase this threshold, tombstone_failure_threshold.[4]
>>>>>
>>>>>
>>>>> Mark
>>>>>
>>>>> [1]
>>>>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
>>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042
>>>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117
>>>>> [4]
>>>>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ianr...@fullstory.com>
>>>>> wrote:
>>>>>
>>>>>> Hi -
>>>>>>
>>>>>> On this page (
>>>>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html),
>>>>>> the docs state:
>>>>>>
>>>>>> Do not use an index [...] On a frequently updated or deleted column
>>>>>>
>>>>>>
>>>>>> and
>>>>>>
>>>>>>
>>>>>>> *Problems using an index on a frequently updated or deleted column*¶
>>>>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx>
>>>>>>
>>>>>> Cassandra stores tombstones in the index until the tombstone limit
>>>>>>> reaches 100K cells. After exceeding the tombstone limit, the query that
>>>>>>> uses the indexed value will fail.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'm afraid I don't really understand this limit from its (brief)
>>>>>> description.  I also saw this recent thread
>>>>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=oyo4...@mail.gmail.com%3E>
>>>>>>  but
>>>>>> I'm afraid it didn't help me much...
>>>>>>
>>>>>>
>>>>>> *SHORT VERSION*
>>>>>>
>>>>>> If I have tens or hundreds of thousands of rows in a keyspace, where
>>>>>> every row has an indexed column that is updated O(10) times during the
>>>>>> lifetime of each row, is that going to cause problems for me?  If that 
>>>>>> 100k
>>>>>> limit is *per row* then I should be fine but if that 100k limit is *per
>>>>>> keyspace* then I'd definitely exceed it quickly.
>>>>>>
>>>>>>
>>>>>> *FULL EXPLANATION*
>>>>>>
>>>>>> In our system, items are created at a rate of ~10/sec.  Each item is
>>>>>> updated ~10 times over the next few minutes (although in rare cases the
>>>>>> number of updates, and the duration, might be several times as long).  
>>>>>> Once
>>>>>> the last update is received for an item, we select it from Cassandra,
>>>>>> process the data, then delete the entire row.
>>>>>>
>>>>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't
>>>>>> actually know when the last update has been received so we use a timeout:
>>>>>> if an item hasn't been updated for 30 minutes, then we assume it is done
>>>>>> and should process it as before (select, then delete).  So I am trying to
>>>>>> design a schema that will allow for efficient queries of the form "find 
>>>>>> me
>>>>>> all items that have not been updated in the past 30 minutes."  We plan to
>>>>>> call this query once a minute.
>>>>>>
>>>>>> Here is my tentative schema:
>>>>>>
>>>>>> CREATE TABLE items (
>>>>>>   item_id ascii,
>>>>>>   last_updated timestamp,
>>>>>>   item_data list<blob>,
>>>>>>   PRIMARY KEY (item_id)
>>>>>> )
>>>>>> plus an index on last_updated.
>>>>>>
>>>>>> So updates to an existing item would just be "lookup by item_id,
>>>>>> append new data to item_data, and set last_updated to now".  And queries 
>>>>>> to
>>>>>> find items that have timed out would use the index on last_updated: "find
>>>>>> all items where last_updated < [now - 30 minutes]".
>>>>>>
>>>>>> Assuming, that is, that the aforementioned 100k tombstone limit won't
>>>>>> bring this index crashing to a halt...
>>>>>>
>>>>>> Any clarification on this limit and/or suggestions on a better way to
>>>>>> model/implement this system would be greatly appreciated!
>>>>>>
>>>>>> Cheers,
>>>>>> Ian
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: clarification on 100k tombstone limit in indexes

Reply via email to