We are processing events in Spark and store the resulting entries
(containing a map) in Cassandra. The results can be new (no entry for this
key in Cassandra) or an Update (there is already an entry with this key in
Cassandra). We use the spark-cassandra-connector to store the data in
Cassandra.

The connector will always do an insert of the data and will rely on the
upsert capabilities of cassandra. So every time an event is updated the
complete map is replaced with all the problems of tombstones.
Seems like we have to implement our own persist logic in which we check if
an element already exists and if yes update the map manually. that would
require a read before write which would be nasty. Another option would be
not to use a collection but (clustering) columns. Do you have another idea
of doing this?

(the conclusion of this whole thing for me would be: use upsert, but do
specific updates on collections as an upsert might replace the whole
collection and generate thumbstones)

2016-05-25 17:37 GMT+02:00 Tyler Hobbs <ty...@datastax.com>:

> If you replace an entire collection, whether it's a map, set, or list, a
> range tombstone will be inserted followed by the new collection.  If you
> only update a single element, no tombstones are generated.
>
> On Wed, May 25, 2016 at 9:48 AM, Matthias Niehoff <
> matthias.nieh...@codecentric.de> wrote:
>
>> Hi,
>>
>> we have a table with a Map Field. We do not delete anything in this
>> table, but to updates on the values including the Map Field (most of the
>> time a new value for an existing key, Rarely adding new keys). We now
>> encounter a huge amount of thumbstones for this Table.
>>
>> We used sstable2json to take a look into the sstables:
>>
>>
>> {"key": "Betty_StoreCatalogLines:7",
>>
>>  "cells": [["276-1-6MPQ0RI-276110031802001001:","",1463820040628001],
>>
>>            ["276-1-6MPQ0RI-276110031802001001:last_modified","2016-05-21 
>> 08:40Z",1463820040628001],
>>
>>            
>> ["276-1-6MPQ0RI-276110031802001001:last_modified_by_source:_","276-1-6MPQ0RI-276110031802001001:last_modified_by_source:!",1463040069753999,"t",1463040069],
>>
>>            
>> ["276-1-6MPQ0RI-276110031802001001:last_modified_by_source:_","276-1-6MPQ0RI-276110031802001001:last_modified_by_source:!",1463120708590002,"t",1463120708],
>>
>>            
>> ["276-1-6MPQ0RI-276110031802001001:last_modified_by_source:_","276-1-6MPQ0RI-276110031802001001:last_modified_by_source:!",1463145700735007,"t",1463145700],
>>
>>            
>> ["276-1-6MPQ0RI-276110031802001001:last_modified_by_source:_","276-1-6MPQ0RI-276110031802001001:last_modified_by_source:!",1463157430862000,"t",1463157430],
>>
>>            
>> [„276-1-6MPQ0RI-276110031802001001:last_modified_by_source:_“,“276-1-6MPQ0RI-276110031802001001:last_modified_by_source:!“,1463164595291002,"t",1463164595],
>>
>> . . .
>>
>>   
>> ["276-1-6MPQ0RI-276110031802001001:last_modified_by_source:_","276-1-6MPQ0RI-276110031802001001:last_modified_by_source:!",1463820040628000,"t",1463820040],
>>
>>            
>> ["276-1-6MPQ0RI-276110031802001001:last_modified_by_source:62657474795f73746f72655f636174616c6f675f6c696e6573","00000154d265c6b0",1463820040628001],
>>
>>            
>> [„276-1-6MPQ0RI-276110031802001001:payload“,"{\"payload\":{\"Article 
>> Id\":\"276110031802001001\",\"Row Id\":\"1-6MPQ0RI\",\"Article 
>> #\":\"31802001001\",\"Quote Item Id\":\"1-6MPWPVC\",\"Country 
>> Code\":\"276\"}}",1463820040628001]
>>
>>
>>
>> Looking at the SStables it seem like every update of a value in a Map
>> breaks down to a delete and insert in the corresponding SSTable (see all
>> the thumbstone flags „t“ in the extract of sstable2json above).
>>
>> We are using Cassandra 2.2.5.
>>
>> Can you confirm this behavior?
>>
>> Thanks!
>> --
>> Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
>> codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
>> tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
>> 172.1702676
>> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
>> www.more4fi.de
>>
>> Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
>> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
>> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen
>> Schütz
>>
>> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält
>> vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht
>> der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben,
>> informieren Sie bitte sofort den Absender und löschen Sie diese E-Mail und
>> evtl. beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder
>> Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser
>> E-Mail ist nicht gestattet
>>
>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>



-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet

Reply via email to