Re: Versioning in cassandra

Laing, Michael Wed, 04 Sep 2013 07:33:02 -0700

Dawood,

In general that will work. However it does mean that you 1) read the old
version 2) update the new version and 3) write the archive version.


Step 2 is a problem: what if someone else has updated the old version after
step 1? and there are 3 atomic operations required, at least.

However, these considerations may be mitigated using Cassandra 2 light
transactions; and it is not a problem if you have only one updater.

But another problem may be performance. You must test. The solution I
proposed does not require a read before write and does an atomic append,
even if multiple maps are being updated. It also defers deletions via ttl's
and a separate, manageable queue for 'cleanup' of large maps.

I think the most important word in my reply is: 'test'.

Cheers,

Michael


On Wed, Sep 4, 2013 at 9:05 AM, dawood abdullah
<muhammed.daw...@gmail.com>wrote:

> Michael,
>
> Your approach solves the problem, thanks for the solution. I was thinking
> of another approach as well where in I would create another column family
> say file_archive, so whenever an update is made to the File table, I will
> create a new version in the File and move the old version to the new
> file_archive table. Please let me know if the second approach is fine.
>
> Regards,
> Dawood
>
>
> On Wed, Sep 4, 2013 at 2:47 AM, Laing, Michael 
> <michael.la...@nytimes.com>wrote:
>
>> I use the technique described in my previous message to handle millions
>> of messages and their versions.
>>
>> Actually, I use timeuuid's instead of timestamps, as they have more
>> 'uniqueness'. Also I index my maps by a timeuuid that is the complement
>> (based on a future date) of a current timeuuid. Since maps are kept sorted
>> by key, this means I can just pop off the first one to get the most recent.
>>
>> The downside of this approach is that you get more stuff returned to you
>> from Cassandra than you need. To mitigate that I queue a job to examine and
>> correct the situation if, upon doing a read, the number of versions for a
>> particular key is higher than some threshold, e.g. 50. There are many ways
>> to approach this problem.
>>
>> Our actual implementation proceeds to another level, as we also have
>> replicas of versions. This happens because we process important
>> transactions in parallel and can expect up to 9 replicas of each version.
>> We journal them all and use them for reporting latencies in our processing
>> pipelines as well as for replay when we need to recover application state.
>>
>> Regards,
>>
>> Michael
>>
>>
>> On Tue, Sep 3, 2013 at 3:15 PM, Laing, Michael <michael.la...@nytimes.com
>> > wrote:
>>
>>> try the following. -ml
>>>
>>> -- put this in <file> and run using 'cqlsh -f <file>
>>>
>>> DROP KEYSPACE latest;
>>>
>>> CREATE KEYSPACE latest WITH replication = {
>>>     'class': 'SimpleStrategy',
>>>     'replication_factor' : 1
>>> };
>>>
>>> USE latest;
>>>
>>> CREATE TABLE file (
>>>     parentid text, -- row_key, same for each version
>>>     id text, -- column_key, same for each version
>>>     contenttype map<timestamp, text>, -- differs by version, version is
>>> the key to the map
>>>     PRIMARY KEY (parentid, id)
>>> );
>>>
>>> update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where
>>> parentid = 'd1' and id = 'f1';
>>> update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where
>>> parentid = 'd1' and id = 'f1';
>>> update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where
>>> parentid = 'd1' and id = 'f2';
>>> update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where
>>> parentid = 'd1' and id = 'f2';
>>>
>>> select * from file where parentid = 'd1';
>>>
>>> -- returns:
>>>
>>> -- parentid | id | contenttype
>>>
>>> ------------+----+--------------------------------------------------------------------------
>>> --       d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05
>>> 00:00:00-0500': 'pdf2'}
>>> --       d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05
>>> 00:00:00-0500': 'pdf4'}
>>>
>>> -- use an app to pop off the latest version from the map
>>>
>>> -- map other varying fields using the same technique as used for
>>> contenttype
>>>
>>>
>>>
>>> On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra <mishra.v...@gmail.com>wrote:
>>>
>>>> create table file(id text , parentid text,contenttype text,version
>>>> timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING
>>>> ORDER BY (version DESC);
>>>>
>>>> insert into file (id, parentid, version, contenttype, descr, name)
>>>> values ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1');
>>>> insert into file (id, parentid, version, contenttype, descr, name)
>>>> values ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1');
>>>> insert into file (id, parentid, version, contenttype, descr, name)
>>>> values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>> insert into file (id, parentid, version, contenttype, descr, name)
>>>> values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>>>> create index on file(parentid);
>>>>
>>>>
>>>> select * from file where id='f1' and parentid='d1' limit 1;
>>>>
>>>> select * from file where parentid='d1' limit 1;
>>>>
>>>>
>>>> Will it work for you?
>>>>
>>>> -Vivek
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra <mishra.v...@gmail.com>wrote:
>>>>
>>>>> My bad. I did miss out to read "latest version" part.
>>>>>
>>>>> -Vivek
>>>>>
>>>>>
>>>>> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah <
>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>
>>>>>> I have tried with both the options creating secondary index and also
>>>>>> tried adding parentid to primary key, but I am getting all the files with
>>>>>> parentid 'yyy', what I want is the latest version of file with the
>>>>>> combination of parentid, fileid. Say below are the records inserted in 
>>>>>> the
>>>>>> file table:
>>>>>>
>>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>>> name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1');
>>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>>> name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>>> name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1');
>>>>>> insert into file (id, parentid, version, contenttype, description,
>>>>>> name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1');
>>>>>>
>>>>>> I want to write a query which returns me second and last record and
>>>>>> not the first and third record, because for the first and third record
>>>>>> there exists a latest version, for the combination of id and parentid.
>>>>>>
>>>>>> I am confused If at all this is achievable, please suggest.
>>>>>>
>>>>>> Dawood
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra 
>>>>>> <mishra.v...@gmail.com>wrote:
>>>>>>
>>>>>>> create secondary index over parentid.
>>>>>>> OR
>>>>>>> make it part of clustering key
>>>>>>>
>>>>>>> -Vivek
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah <
>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Jan,
>>>>>>>>
>>>>>>>> The solution you gave works spot on, but there is one more
>>>>>>>> requirement I forgot to mention. Following is my table structure
>>>>>>>>
>>>>>>>> CREATE TABLE file (
>>>>>>>>   id text,
>>>>>>>>   contenttype text,
>>>>>>>>   createdby text,
>>>>>>>>   createdtime timestamp,
>>>>>>>>   description text,
>>>>>>>>   name text,
>>>>>>>>   parentid text,
>>>>>>>>   version timestamp,
>>>>>>>>   PRIMARY KEY (id, version)
>>>>>>>>
>>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>>>>>>
>>>>>>>>
>>>>>>>> The query (select * from file where id = 'xxx' limit 1;) provided
>>>>>>>> solves the problem of finding the latest version file. But I have one 
>>>>>>>> more
>>>>>>>> requirement of finding all the latest version files having parentid say
>>>>>>>> 'yyy'.
>>>>>>>>
>>>>>>>> Please suggest how can this query be achieved.
>>>>>>>>
>>>>>>>> Dawood
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah <
>>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> In my case version can be timestamp as well. What do you suggest
>>>>>>>>> version number to be, do you see any problems if I keep version as 
>>>>>>>>> counter
>>>>>>>>> / timestamp ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen <
>>>>>>>>> jan.algermis...@nordsc.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 02.09.2013, at 20:44, dawood abdullah <
>>>>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> > Requirement is like I have a column family say File
>>>>>>>>>> >
>>>>>>>>>> > create table file(id text primary key, fname text, version int,
>>>>>>>>>> mimetype text, content text);
>>>>>>>>>> >
>>>>>>>>>> > Say, I have few records inserted, when I modify an existing
>>>>>>>>>> record (content is modified) a new version needs to be created. As I 
>>>>>>>>>> need
>>>>>>>>>> to have provision to revert to back any old version whenever 
>>>>>>>>>> required.
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>> So, can version be a timestamp? Or does it need to be an integer?
>>>>>>>>>>
>>>>>>>>>> In the former case, make use of C*'s ordering like so:
>>>>>>>>>>
>>>>>>>>>> CREATE TABLE file (
>>>>>>>>>>    file_id text,
>>>>>>>>>>    version timestamp,
>>>>>>>>>>    fname text,
>>>>>>>>>>    ....
>>>>>>>>>>    PRIMARY KEY (file_id,version)
>>>>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC);
>>>>>>>>>>
>>>>>>>>>> Get the latest file version with
>>>>>>>>>>
>>>>>>>>>> select * from file where file_id = 'xxx' limit 1;
>>>>>>>>>>
>>>>>>>>>> If it has to be an integer, use counter columns.
>>>>>>>>>>
>>>>>>>>>> Jan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > Regards,
>>>>>>>>>> > Dawood
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen <
>>>>>>>>>> jan.algermis...@nordsc.com> wrote:
>>>>>>>>>> > Hi Dawood,
>>>>>>>>>> >
>>>>>>>>>> > On 02.09.2013, at 16:36, dawood abdullah <
>>>>>>>>>> muhammed.daw...@gmail.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > > Hi
>>>>>>>>>> > > I have a requirement of versioning to be done in Cassandra.
>>>>>>>>>> > >
>>>>>>>>>> > > Following is my column family definition
>>>>>>>>>> > >
>>>>>>>>>> > > create table file_details(id text primary key, fname text,
>>>>>>>>>> version int, mimetype text);
>>>>>>>>>> > >
>>>>>>>>>> > > I have a secondary index created on fname column.
>>>>>>>>>> > >
>>>>>>>>>> > > Whenever I do an insert for the same 'fname', the version
>>>>>>>>>> should be incremented. And when I retrieve a row with fname it should
>>>>>>>>>> return me the latest version row.
>>>>>>>>>> > >
>>>>>>>>>> > > Is there a better way to do in Cassandra? Please suggest what
>>>>>>>>>> approach needs to be taken.
>>>>>>>>>> >
>>>>>>>>>> > Can you explain more about your use case?
>>>>>>>>>> >
>>>>>>>>>> > If the version need not be a small number, but could be a
>>>>>>>>>> timestamp, you could make use of C*'s ordering feature , have the 
>>>>>>>>>> database
>>>>>>>>>> set the new version as a timestamp and retrieve the latest one with a
>>>>>>>>>> simple LIMIT 1 query. (I'll explain more when this is an option for 
>>>>>>>>>> you).
>>>>>>>>>> >
>>>>>>>>>> > Jan
>>>>>>>>>> >
>>>>>>>>>> > P.S. Me being a REST/HTTP head, an alarm rings when I see
>>>>>>>>>> 'version' next to 'mimetype' :-) What exactly are you versioning 
>>>>>>>>>> here?
>>>>>>>>>> Maybe we can even change the situation from a functional POV?
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > >
>>>>>>>>>> > > Regards,
>>>>>>>>>> > >
>>>>>>>>>> > > Dawood
>>>>>>>>>> > >
>>>>>>>>>> > >
>>>>>>>>>> > >
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Versioning in cassandra

Reply via email to