If last_modified is a clustering column, it needs a partitioning column, which 
is what date is for (although I should have named it day, and I also forgot to 
add the order by desc clause). This is essentially what I came up with. Still 
not liking how easy it is to get duplicates.

On Jul 21, 2015, at 9:31 AM, Jack Krupansky 
<jack.krupan...@gmail.com<mailto:jack.krupan...@gmail.com>> wrote:

Keep the original document base table, but then the query table should have the 
PK as last_modified, docId, with last_modified descending, so that a query can 
get the n most recently modified documents.

Yes, you still need to manually delete the old entry for the document in the 
query table if duplicates are a problem for you.

Yeah, a TTL would be good if you don't care about documents modified a month or 
a week ago.

-- Jack Krupansky

On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso 
<i...@mrcalonso.com<mailto:i...@mrcalonso.com>> wrote:
Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == <the docId> LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonso<https://twitter.com/calonso>

On 21 July 2015 at 05:59, Robert Wille 
<rwi...@fold3.com<mailto:rwi...@fold3.com>> wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
        docId UUID,
        doc TEXT,
        last_modified TIMEUUID,
        PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
        date TEXT,
        last_modified TIMEUUID,
        docId UUID,
        PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert




Reply via email to