Data Model Review

Adam Venturella Mon, 17 Dec 2012 07:35:57 -0800

My use case is capturing some information about Instagram photos from the
API. I have 2 use cases. One, I need to capture all of the media data for
an account and two I need to be able to privately annotate that data. There
is some nuance in this, multiple http queries for example, but ignoring
that, and assuming I have obtained all of the data surrounding an accounts
photos here is how I was thinking of storing that information for use case
1.


ColumnFamily: InstagramPhotos

Row Key: <account_username>

Columns:
Coulmn Name: <date_posted_timestamp>
Coulumn Value: JSON representing the data for the individual photo (filter,
comments, likes etc, not the binary photo data).



So the idea would be to keep adding columns to the row that contain that
serialized data (in JSON) with their timestamps as the name.  Timestamps as
the column names, I figure, should help help to perform range queries,
where I make the 1st column inserted the earliest timestamp and the last
column inserted the most recent. I could probably also use TimeUUIDs here
as well since I will have things ordered prior to inserting.

The question here, does this approach make sense? Is it common to store
JSON in columns like this? I know there are super columns as well, so I
could use those I suppose instead of JSON. The extra level of indexing
would probably be useful to query specific photos for use case 2. I have
heard it best to try and avoid the use of super columns for now. I have no
information to back that claim up other than some time spent in the IRC. So
feel free to debunk that statement if it is false.

So that is use case one, use case two covers the private annotations.

I figured here:

ColumnFamily: InstagramAnnotations
row key:  Canonical Media Id

Column Name: TimeUUID
Column Value: JSON representing an annotation/internal comment


Writing out the above I can actually see where I might need to tighten some
things up around how I store the photos. I am clearly missing an obvious
connection between the InstagramPhotos and the InstagramAnnotations, maybe
super columns would help with the photos instead of JSON? Otherwise I would
need to build an index row where I tie the the canonical photo id to a
timestamp (column name) in the InstagramPhotos. I could also try to figure
out how to make a TimeUUID of my own that can double as the media's
canonical id or further look at Instagram's canonical id for photos and see
if it already counts up. In which case I could use that in place of a
timestamp.

Anyway, I figured I would see if anyone might help flush out other
potential pitfalls in the above. I am definitely new to cassandra and I am
using this project as a way to learn some more about assembling systems
using it.

Data Model Review

Reply via email to