Makes sense - thanks again!
On Tue, Aug 12, 2014 at 9:45 AM, DuyHai Doan <doanduy...@gmail.com> wrote: > Hello Ian > > "So that way each index entry *will* have quite a few entries and the > index as a whole won't grow too big. Is my thinking correct here?" --> In > this case yes. Do not forget that for each date value, there will be 1 > corresponding index value + 10 updates. If you have an approximate count > for "a few entries", a quick maths should give you an idea about how > "large" the index partition is > > "I had considered an approach like this but my concern is that for any > given minute *all* of the updates will be handled by a single node, > right?" --> If you time resolution is a minute, yes it will be a problem. > And depending on the insert rate, it can become a quickly a bottle neck > during this minute. > > The manual index approach suffers a lot from bottleneck issue for heavy > workload, that's the main reason they implement a distributed secondary > index. There is no free lunch though. What you gain in term of control and > tuning with the manual index, you loose on the load distribution side. > > > > > On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose <ianr...@fullstory.com> wrote: > >> Hi DuyHai, >> >> Thanks for the detailed response! A few responses below: >> >> "On a side node, your usage of secondary index is not the best one. >> Indeed, indexing the update date will lead to a situation where for one >> date, you'll mostly have one or a few matching items (assuming that the >> update date resolution is small enough and update rate is not intense)." >> --> I should have mentioned this original (slipped my mind) but to deal >> specifically with this problem I had planned to use a timestamp with a >> resolution of 1 minute (like your minute_bucket). So that way each index >> entry *will* have quite a few entries and the index as a whole won't >> grow too big. Is my thinking correct here? >> >> "You better off create a manuel reverse-index to track modification >> date, something like this" --> I had considered an approach like this but >> my concern is that for any given minute *all* of the updates will be >> handled by a single node, right? For example, if the minute_bucket is 2739 >> then for that one minute, every single item update will flow to the node at >> HASH(2739). Assuming I am thinking about that right, that seemed like a >> potential scaling bottleneck, which scared me off that approach. >> >> Cheers, >> Ian >> >> >> >> >> On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan <doanduy...@gmail.com> >> wrote: >> >>> Hello Ian >>> >>> "It sounds like this 100k limit is, indeed, a "global" limit as opposed >>> to a per-row limit" -->The threshold applies to each "REQUEST", not >>> partition or globally. >>> >>> The threshold does not apply to a partition (physical row) simply >>> because in one request you can fetch data from many partitions (multi get >>> slice). There was a JIRA about this here: >>> https://issues.apache.org/jira/browse/CASSANDRA-6865 >>> >>> "Are these tombstones ever "GCed" out of the index?" --> Yes they are, >>> during compactions of the index column family. >>> >>> "How frequently?" --> That's the real pain. Indeed you do not have any >>> control on the tuning of secondary index CF compaction. As far as I know, >>> the compaction settings (strategy, min/max thresholds...) inherits from the >>> one of the base table >>> >>> Now, by looking very fast into your data model, it seems that you have a >>> skinny partition patter. Since you mentioned that the date is updated only >>> 10 times max, you should not run into the tombstonne threshold issue. >>> >>> On a side node, your usage of secondary index is not the best one. >>> Indeed, indexing the update date will lead to a situation where for one >>> date, you'll mostly have one or a few matching items (assuming that the >>> update date resolution is small enough and update rate is not intense). It >>> is the high-cardinality scenario to be avoided ( >>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html). >>> Plus, the query on the index (find all items where last_updated < [now - 30 >>> minutes]) makes things worse since it is not an exact match but inequality. >>> >>> You better off create a manuel reverse-index to track modification >>> date, something like this: >>> >>> CREATE TABLE last_updated_item ( >>> minute_bucket int, // format YYYYMMDDHHmm >>> last_update_date timestamp, >>> item_id ascii, >>> PRIMARY KEY(minute_bucket, last_update_date) >>> ); >>> >>> The last_update_date column is quite self-explanatory. The >>> minute_bucket is trickier. The idea is to split ranges on 30 minutes into >>> buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. >>> For a whole day, you'd have 48 buckets. We need to put data into buckets to >>> avoid ultra wide rows since you mentioned that there are 10 items (so 10 >>> updates) / sec. Of course, 30 mins is just an exemple, you can tune it down >>> to a window of 5 minutes or 1 minute, depending on the insertion rate. >>> >>> >>> >>> >>> >>> On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose <ianr...@fullstory.com> >>> wrote: >>> >>>> Hi Mark - >>>> >>>> Thanks for the clarification but as I'm not too familiar with the nuts >>>> & bolts of Cassandra I'm not sure how to apply that info to my current >>>> situation. It sounds like this 100k limit is, indeed, a "global" limit as >>>> opposed to a per-row limit. Are these tombstones ever "GCed" out of the >>>> index? How frequently? If not, then it seems like *any* index is at risk >>>> of reaching this tipping point; it's just that indexes on frequently >>>> updated columns will reach this pointer faster the indexes on rarely >>>> updated columns. >>>> >>>> Basically I'm trying to get some kind of sense for what "frequently >>>> updated >>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html>" >>>> means quantitatively. As written, the docs make it sound dangerous to >>>> create an index on a column that is *ever* deleted or updated since >>>> there is no sense of how frequent is "too frequent". >>>> >>>> Cheers, >>>> Ian >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Aug 10, 2014 at 3:02 PM, Mark Reddy <mark.re...@boxever.com> >>>> wrote: >>>> >>>>> Hi Ian, >>>>> >>>>> The issues here, which relates to normal and index column families, is >>>>> scanning over a large number of tombstones can cause Cassandra to fall >>>>> over >>>>> due to increased GC pressure. This pressure is caused because tombstones >>>>> will create DeletedColumn objects which consume heap. Also >>>>> these DeletedColumn objects will have to be serialized and sent back to >>>>> the >>>>> coordinator, thus increasing your response times. Take for example a row >>>>> that does deletes and you query it with a limit of 100. In a worst case >>>>> scenario you could end up reading say 50k tombstones to reach the 100 >>>>> 'live' column limit, all of which has to be put on heap and then sent over >>>>> the wire to the coordinator. This would be considered a Cassandra >>>>> anti-pattern.[1] >>>>> >>>>> With that in mind there was a debug warning added to 1.2 to inform the >>>>> user when they were querying a row with 1000 tombstones [2]. Then in 2.0 >>>>> the action was taken to drop requests reaching 100k tombstones[3] rather >>>>> than just printing out a warning. This is a safety measure, as it is not >>>>> advised to perform such a query and is a result of most people 'doing it >>>>> wrong'. >>>>> >>>>> For those people who understand the risk of scanning over large >>>>> numbers of tombstones there is a configuration option in the >>>>> cassandra.yaml >>>>> to increase this threshold, tombstone_failure_threshold.[4] >>>>> >>>>> >>>>> Mark >>>>> >>>>> [1] >>>>> http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets >>>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-6042 >>>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-6117 >>>>> [4] >>>>> https://github.com/jbellis/cassandra/blob/4ac18ae805d28d8f4cb44b42e2244bfa6d2875e1/conf/cassandra.yaml#L407-L417 >>>>> >>>>> >>>>> >>>>> On Sun, Aug 10, 2014 at 7:19 PM, Ian Rose <ianr...@fullstory.com> >>>>> wrote: >>>>> >>>>>> Hi - >>>>>> >>>>>> On this page ( >>>>>> http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html), >>>>>> the docs state: >>>>>> >>>>>> Do not use an index [...] On a frequently updated or deleted column >>>>>> >>>>>> >>>>>> and >>>>>> >>>>>> >>>>>>> *Problems using an index on a frequently updated or deleted column*ΒΆ >>>>>>> <http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html?scroll=concept_ds_sgh_yzz_zj__upDatIndx> >>>>>> >>>>>> Cassandra stores tombstones in the index until the tombstone limit >>>>>>> reaches 100K cells. After exceeding the tombstone limit, the query that >>>>>>> uses the indexed value will fail. >>>>>> >>>>>> >>>>>> >>>>>> I'm afraid I don't really understand this limit from its (brief) >>>>>> description. I also saw this recent thread >>>>>> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201403.mbox/%3CCABNXB2Bf4aeoDVpMNOxJ_e7aDez2EuZswMJx=jWfb8=oyo4...@mail.gmail.com%3E> >>>>>> but >>>>>> I'm afraid it didn't help me much... >>>>>> >>>>>> >>>>>> *SHORT VERSION* >>>>>> >>>>>> If I have tens or hundreds of thousands of rows in a keyspace, where >>>>>> every row has an indexed column that is updated O(10) times during the >>>>>> lifetime of each row, is that going to cause problems for me? If that >>>>>> 100k >>>>>> limit is *per row* then I should be fine but if that 100k limit is *per >>>>>> keyspace* then I'd definitely exceed it quickly. >>>>>> >>>>>> >>>>>> *FULL EXPLANATION* >>>>>> >>>>>> In our system, items are created at a rate of ~10/sec. Each item is >>>>>> updated ~10 times over the next few minutes (although in rare cases the >>>>>> number of updates, and the duration, might be several times as long). >>>>>> Once >>>>>> the last update is received for an item, we select it from Cassandra, >>>>>> process the data, then delete the entire row. >>>>>> >>>>>> The tricky bit is that sometimes (maybe 30-40% of the time) we don't >>>>>> actually know when the last update has been received so we use a timeout: >>>>>> if an item hasn't been updated for 30 minutes, then we assume it is done >>>>>> and should process it as before (select, then delete). So I am trying to >>>>>> design a schema that will allow for efficient queries of the form "find >>>>>> me >>>>>> all items that have not been updated in the past 30 minutes." We plan to >>>>>> call this query once a minute. >>>>>> >>>>>> Here is my tentative schema: >>>>>> >>>>>> CREATE TABLE items ( >>>>>> item_id ascii, >>>>>> last_updated timestamp, >>>>>> item_data list<blob>, >>>>>> PRIMARY KEY (item_id) >>>>>> ) >>>>>> plus an index on last_updated. >>>>>> >>>>>> So updates to an existing item would just be "lookup by item_id, >>>>>> append new data to item_data, and set last_updated to now". And queries >>>>>> to >>>>>> find items that have timed out would use the index on last_updated: "find >>>>>> all items where last_updated < [now - 30 minutes]". >>>>>> >>>>>> Assuming, that is, that the aforementioned 100k tombstone limit won't >>>>>> bring this index crashing to a halt... >>>>>> >>>>>> Any clarification on this limit and/or suggestions on a better way to >>>>>> model/implement this system would be greatly appreciated! >>>>>> >>>>>> Cheers, >>>>>> Ian >>>>>> >>>>>> >>>>> >>>> >>> >> >