As documented at http://cassandra.apache.org/doc/cql3/CQL.html#collections, the lists have 3 operations that require a read before a write (and should thus be avoided in performance sensitive code), namely setting and deleting by index, and removing by value. Outside of that, collections involves no read before writes.
But, as you said, if you do overwrite a collection, the previous collection is removed (using a range tombstone) while the new one is added. This should have almost no impact on the insertion itself however (the tombstone is in the same internal mutation than the update itself, it's not 2 operations). But yes, if you do often overwrite collections in the same partition, this might have some impact on reads due to CASSANDRA-5677, and we'll look at fixing that. So in theory collections should have no special impact on writes, at least nothing that is by design. If you do observe differently and have a way to reproduce, feel free to open a JIRA issue. But I'm afraid we'll need more than "two guys on stackoverflow claims they've seem write performance degradation due to collection" to get going. -- Sylvain On Fri, Jun 28, 2013 at 7:30 AM, Theo Hultberg <t...@iconara.net> wrote: > the thing I was doing was definitely triggering the range tombstone issue, > this is what I was doing: > > UPDATE clocks SET clock = ? WHERE shard = ? > > in this table: > > CREATE TABLE clocks (shard INT PRIMARY KEY, clock MAP<TEXT, TIMESTAMP>) > > however, from the stack overflow posts it sounds like they aren't > necessarily overwriting their collections. I've tried to replicate their > problem with these two statements > > INSERT INTO clocks (shard, clock) VALUES (?, ?) > UPDATE clocks SET clock = clock + ? WHERE shard = ? > > the first one should create range tombstones because it overwrites the the > map on every insert, and the second should not because it adds to the map. > neither of those seems to have any performance issues, at least not on > inserts. > > and it's the slowdown on inserts that confuses me, both the stack overflow > questioners say that they saw a drop in insert performance. I never saw > that in my application, I just got slow reads (and Fabien's explanation > makes complete sense for that). I don't understand how insert performance > could be affected at all, and I know that for non-counter columns cassandra > doesn't read before it writes, but is it the same for collections too? they > are a bit special, but how special are they? > > T# > > > On Fri, Jun 28, 2013 at 7:04 AM, aaron morton <aa...@thelastpickle.com>wrote: > >> Can you provide details of the mutation statements you are running ? The >> Stack Overflow posts don't seem to include them. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Consultant >> New Zealand >> >> @aaronmorton >> http://www.thelastpickle.com >> >> On 27/06/2013, at 5:58 AM, Theo Hultberg <t...@iconara.net> wrote: >> >> do I understand it correctly if I think that collection modifications are >> done by reading the collection, writing a range tombstone that would cover >> the collection and then re-writing the whole collection again? or is it >> just the modified parts of the collection that are covered by the range >> tombstones, but you still get massive amounts of them and its just their >> number that is the problem. >> >> would this explain the slowdown of writes too? I guess it would if >> cassandra needed to read the collection before it wrote the new values, >> otherwise I don't understand how this affects writes, but that only says >> how much I know about how this works. >> >> T# >> >> >> On Wed, Jun 26, 2013 at 10:48 AM, Fabien Rousseau <fab...@yakaz.com>wrote: >> >>> Hi, >>> >>> I'm pretty sure that it's related to this ticket : >>> https://issues.apache.org/jira/browse/CASSANDRA-5677 >>> >>> I'd be happy if someone tests this patch. >>> It should apply easily on 1.2.5 & 1.2.6 >>> >>> After applying the patch, by default, the current implementation is >>> still used, but modify your cassandra.yaml to add the following one : >>> interval_tree_provider: IntervalTreeAvlProvider >>> >>> (Note that implementations should be interchangeable, because they share >>> the same serializers and deserializers) >>> >>> Also, please note that this patch has not been reviewed nor intensively >>> tested... So, it may not be "production ready" >>> >>> Fabien >>> >>> >>> >>> >>> >>> >>> >>> 2013/6/26 Theo Hultberg <t...@iconara.net> >>> >>>> Hi, >>>> >>>> I've seen a couple of people on Stack Overflow having problems with >>>> performance when they have maps that they continuously update, and in >>>> hindsight I think I might have run into the same problem myself (but I >>>> didn't suspect it as the reason and designed differently and by accident >>>> didn't use maps anymore). >>>> >>>> Is there any reason that maps (or lists or sets) in particular would >>>> become a performance issue when they're heavily modified? As I've >>>> understood them they're not special, and shouldn't be any different >>>> performance wise than overwriting regular columns. Is there something >>>> different going on that I'm missing? >>>> >>>> Here are the Stack Overflow questions: >>>> >>>> >>>> http://stackoverflow.com/questions/17282837/cassandra-insert-perfomance-issue-into-a-table-with-a-map-type/17290981 >>>> >>>> >>>> http://stackoverflow.com/questions/17082963/bad-performance-when-writing-log-data-to-cassandra-with-timeuuid-as-a-column-nam/17123236 >>>> >>>> yours, >>>> Theo >>>> >>> >>> >>> >>> -- >>> Fabien Rousseau >>> * >>> * >>> <aur...@yakaz.com>www.yakaz.com >>> >> >> >> >