Dawood, In general that will work. However it does mean that you 1) read the old version 2) update the new version and 3) write the archive version.
Step 2 is a problem: what if someone else has updated the old version after step 1? and there are 3 atomic operations required, at least. However, these considerations may be mitigated using Cassandra 2 light transactions; and it is not a problem if you have only one updater. But another problem may be performance. You must test. The solution I proposed does not require a read before write and does an atomic append, even if multiple maps are being updated. It also defers deletions via ttl's and a separate, manageable queue for 'cleanup' of large maps. I think the most important word in my reply is: 'test'. Cheers, Michael On Wed, Sep 4, 2013 at 9:05 AM, dawood abdullah <muhammed.daw...@gmail.com>wrote: > Michael, > > Your approach solves the problem, thanks for the solution. I was thinking > of another approach as well where in I would create another column family > say file_archive, so whenever an update is made to the File table, I will > create a new version in the File and move the old version to the new > file_archive table. Please let me know if the second approach is fine. > > Regards, > Dawood > > > On Wed, Sep 4, 2013 at 2:47 AM, Laing, Michael > <michael.la...@nytimes.com>wrote: > >> I use the technique described in my previous message to handle millions >> of messages and their versions. >> >> Actually, I use timeuuid's instead of timestamps, as they have more >> 'uniqueness'. Also I index my maps by a timeuuid that is the complement >> (based on a future date) of a current timeuuid. Since maps are kept sorted >> by key, this means I can just pop off the first one to get the most recent. >> >> The downside of this approach is that you get more stuff returned to you >> from Cassandra than you need. To mitigate that I queue a job to examine and >> correct the situation if, upon doing a read, the number of versions for a >> particular key is higher than some threshold, e.g. 50. There are many ways >> to approach this problem. >> >> Our actual implementation proceeds to another level, as we also have >> replicas of versions. This happens because we process important >> transactions in parallel and can expect up to 9 replicas of each version. >> We journal them all and use them for reporting latencies in our processing >> pipelines as well as for replay when we need to recover application state. >> >> Regards, >> >> Michael >> >> >> On Tue, Sep 3, 2013 at 3:15 PM, Laing, Michael <michael.la...@nytimes.com >> > wrote: >> >>> try the following. -ml >>> >>> -- put this in <file> and run using 'cqlsh -f <file> >>> >>> DROP KEYSPACE latest; >>> >>> CREATE KEYSPACE latest WITH replication = { >>> 'class': 'SimpleStrategy', >>> 'replication_factor' : 1 >>> }; >>> >>> USE latest; >>> >>> CREATE TABLE file ( >>> parentid text, -- row_key, same for each version >>> id text, -- column_key, same for each version >>> contenttype map<timestamp, text>, -- differs by version, version is >>> the key to the map >>> PRIMARY KEY (parentid, id) >>> ); >>> >>> update file set contenttype = contenttype + {'2011-03-04':'pdf1'} where >>> parentid = 'd1' and id = 'f1'; >>> update file set contenttype = contenttype + {'2011-03-05':'pdf2'} where >>> parentid = 'd1' and id = 'f1'; >>> update file set contenttype = contenttype + {'2011-03-04':'pdf3'} where >>> parentid = 'd1' and id = 'f2'; >>> update file set contenttype = contenttype + {'2011-03-05':'pdf4'} where >>> parentid = 'd1' and id = 'f2'; >>> >>> select * from file where parentid = 'd1'; >>> >>> -- returns: >>> >>> -- parentid | id | contenttype >>> >>> ------------+----+-------------------------------------------------------------------------- >>> -- d1 | f1 | {'2011-03-04 00:00:00-0500': 'pdf1', '2011-03-05 >>> 00:00:00-0500': 'pdf2'} >>> -- d1 | f2 | {'2011-03-04 00:00:00-0500': 'pdf3', '2011-03-05 >>> 00:00:00-0500': 'pdf4'} >>> >>> -- use an app to pop off the latest version from the map >>> >>> -- map other varying fields using the same technique as used for >>> contenttype >>> >>> >>> >>> On Tue, Sep 3, 2013 at 2:31 PM, Vivek Mishra <mishra.v...@gmail.com>wrote: >>> >>>> create table file(id text , parentid text,contenttype text,version >>>> timestamp, descr text, name text, PRIMARY KEY(id,version) ) WITH CLUSTERING >>>> ORDER BY (version DESC); >>>> >>>> insert into file (id, parentid, version, contenttype, descr, name) >>>> values ('f2', 'd1', '2011-03-06', 'pdf', 'f2 file', 'file1'); >>>> insert into file (id, parentid, version, contenttype, descr, name) >>>> values ('f2', 'd1', '2011-03-05', 'pdf', 'f2 file', 'file1'); >>>> insert into file (id, parentid, version, contenttype, descr, name) >>>> values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); >>>> insert into file (id, parentid, version, contenttype, descr, name) >>>> values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); >>>> create index on file(parentid); >>>> >>>> >>>> select * from file where id='f1' and parentid='d1' limit 1; >>>> >>>> select * from file where parentid='d1' limit 1; >>>> >>>> >>>> Will it work for you? >>>> >>>> -Vivek >>>> >>>> >>>> >>>> >>>> On Tue, Sep 3, 2013 at 11:29 PM, Vivek Mishra <mishra.v...@gmail.com>wrote: >>>> >>>>> My bad. I did miss out to read "latest version" part. >>>>> >>>>> -Vivek >>>>> >>>>> >>>>> On Tue, Sep 3, 2013 at 11:20 PM, dawood abdullah < >>>>> muhammed.daw...@gmail.com> wrote: >>>>> >>>>>> I have tried with both the options creating secondary index and also >>>>>> tried adding parentid to primary key, but I am getting all the files with >>>>>> parentid 'yyy', what I want is the latest version of file with the >>>>>> combination of parentid, fileid. Say below are the records inserted in >>>>>> the >>>>>> file table: >>>>>> >>>>>> insert into file (id, parentid, version, contenttype, description, >>>>>> name) values ('f1', 'd1', '2011-03-04', 'pdf', 'f1 file', 'file1'); >>>>>> insert into file (id, parentid, version, contenttype, description, >>>>>> name) values ('f1', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); >>>>>> insert into file (id, parentid, version, contenttype, description, >>>>>> name) values ('f2', 'd1', '2011-03-05', 'pdf', 'f1 file', 'file1'); >>>>>> insert into file (id, parentid, version, contenttype, description, >>>>>> name) values ('f2', 'd1', '2011-03-06', 'pdf', 'f1 file', 'file1'); >>>>>> >>>>>> I want to write a query which returns me second and last record and >>>>>> not the first and third record, because for the first and third record >>>>>> there exists a latest version, for the combination of id and parentid. >>>>>> >>>>>> I am confused If at all this is achievable, please suggest. >>>>>> >>>>>> Dawood >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 3, 2013 at 10:58 PM, Vivek Mishra >>>>>> <mishra.v...@gmail.com>wrote: >>>>>> >>>>>>> create secondary index over parentid. >>>>>>> OR >>>>>>> make it part of clustering key >>>>>>> >>>>>>> -Vivek >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 3, 2013 at 10:42 PM, dawood abdullah < >>>>>>> muhammed.daw...@gmail.com> wrote: >>>>>>> >>>>>>>> Jan, >>>>>>>> >>>>>>>> The solution you gave works spot on, but there is one more >>>>>>>> requirement I forgot to mention. Following is my table structure >>>>>>>> >>>>>>>> CREATE TABLE file ( >>>>>>>> id text, >>>>>>>> contenttype text, >>>>>>>> createdby text, >>>>>>>> createdtime timestamp, >>>>>>>> description text, >>>>>>>> name text, >>>>>>>> parentid text, >>>>>>>> version timestamp, >>>>>>>> PRIMARY KEY (id, version) >>>>>>>> >>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC); >>>>>>>> >>>>>>>> >>>>>>>> The query (select * from file where id = 'xxx' limit 1;) provided >>>>>>>> solves the problem of finding the latest version file. But I have one >>>>>>>> more >>>>>>>> requirement of finding all the latest version files having parentid say >>>>>>>> 'yyy'. >>>>>>>> >>>>>>>> Please suggest how can this query be achieved. >>>>>>>> >>>>>>>> Dawood >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 3, 2013 at 12:43 AM, dawood abdullah < >>>>>>>> muhammed.daw...@gmail.com> wrote: >>>>>>>> >>>>>>>>> In my case version can be timestamp as well. What do you suggest >>>>>>>>> version number to be, do you see any problems if I keep version as >>>>>>>>> counter >>>>>>>>> / timestamp ? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 3, 2013 at 12:22 AM, Jan Algermissen < >>>>>>>>> jan.algermis...@nordsc.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 02.09.2013, at 20:44, dawood abdullah < >>>>>>>>>> muhammed.daw...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> > Requirement is like I have a column family say File >>>>>>>>>> > >>>>>>>>>> > create table file(id text primary key, fname text, version int, >>>>>>>>>> mimetype text, content text); >>>>>>>>>> > >>>>>>>>>> > Say, I have few records inserted, when I modify an existing >>>>>>>>>> record (content is modified) a new version needs to be created. As I >>>>>>>>>> need >>>>>>>>>> to have provision to revert to back any old version whenever >>>>>>>>>> required. >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> So, can version be a timestamp? Or does it need to be an integer? >>>>>>>>>> >>>>>>>>>> In the former case, make use of C*'s ordering like so: >>>>>>>>>> >>>>>>>>>> CREATE TABLE file ( >>>>>>>>>> file_id text, >>>>>>>>>> version timestamp, >>>>>>>>>> fname text, >>>>>>>>>> .... >>>>>>>>>> PRIMARY KEY (file_id,version) >>>>>>>>>> ) WITH CLUSTERING ORDER BY (version DESC); >>>>>>>>>> >>>>>>>>>> Get the latest file version with >>>>>>>>>> >>>>>>>>>> select * from file where file_id = 'xxx' limit 1; >>>>>>>>>> >>>>>>>>>> If it has to be an integer, use counter columns. >>>>>>>>>> >>>>>>>>>> Jan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> > Regards, >>>>>>>>>> > Dawood >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > On Mon, Sep 2, 2013 at 10:47 PM, Jan Algermissen < >>>>>>>>>> jan.algermis...@nordsc.com> wrote: >>>>>>>>>> > Hi Dawood, >>>>>>>>>> > >>>>>>>>>> > On 02.09.2013, at 16:36, dawood abdullah < >>>>>>>>>> muhammed.daw...@gmail.com> wrote: >>>>>>>>>> > >>>>>>>>>> > > Hi >>>>>>>>>> > > I have a requirement of versioning to be done in Cassandra. >>>>>>>>>> > > >>>>>>>>>> > > Following is my column family definition >>>>>>>>>> > > >>>>>>>>>> > > create table file_details(id text primary key, fname text, >>>>>>>>>> version int, mimetype text); >>>>>>>>>> > > >>>>>>>>>> > > I have a secondary index created on fname column. >>>>>>>>>> > > >>>>>>>>>> > > Whenever I do an insert for the same 'fname', the version >>>>>>>>>> should be incremented. And when I retrieve a row with fname it should >>>>>>>>>> return me the latest version row. >>>>>>>>>> > > >>>>>>>>>> > > Is there a better way to do in Cassandra? Please suggest what >>>>>>>>>> approach needs to be taken. >>>>>>>>>> > >>>>>>>>>> > Can you explain more about your use case? >>>>>>>>>> > >>>>>>>>>> > If the version need not be a small number, but could be a >>>>>>>>>> timestamp, you could make use of C*'s ordering feature , have the >>>>>>>>>> database >>>>>>>>>> set the new version as a timestamp and retrieve the latest one with a >>>>>>>>>> simple LIMIT 1 query. (I'll explain more when this is an option for >>>>>>>>>> you). >>>>>>>>>> > >>>>>>>>>> > Jan >>>>>>>>>> > >>>>>>>>>> > P.S. Me being a REST/HTTP head, an alarm rings when I see >>>>>>>>>> 'version' next to 'mimetype' :-) What exactly are you versioning >>>>>>>>>> here? >>>>>>>>>> Maybe we can even change the situation from a functional POV? >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > > >>>>>>>>>> > > Regards, >>>>>>>>>> > > >>>>>>>>>> > > Dawood >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >