Folks--

(Apologies up front for the length of this.)

I'm wondering if you can let me know if Riak is a good fit for a simple 
not-quite-key-value scenario described below. MongoDB or (say) Postgresql seem 
a more natural fit conceptually, but I really, really like Riak's distribution 
strategy.

## context

The basic overview is this: 

50K devices push data once a second to web services which need to store that 
data in short-term storage (Riak). Once an hour, a sweeper needs to take an 
hour's worth of data per device (if there is any) and ship it off to long term 
storage, then delete it from short-term storage. Ideally, there'd only ever be 
slightly more than 1 hour's worth of data still in short-term storage for any 
given device. The goal is to write down the data as simply and safely as 
possible, with little or no processing on that data.

Each second's worth of data is:

* A device identifier
* A timestamp (epoch seconds, integer) for the slice of time the data represents
* An opaque blob of binary data (2 to 4k)

Once an hour, I'd like to do something like:

* For each device:
        * Find (and concat) all the data between time1 and time2 (an hour).
        * Move that data to long-term storage (not Riak) as a single blob.
        * Delete that data from Riak.

For an SQL db, this is a really simple problem, conceptually. You can have a 
table with three columns: device-id, timestamp, blob. You can index the first 
two columns and roll up the data easily enough and then delete it via single 
SQL statements (or buffer as needed). The harder part is partitioning, 
replication, etc, etc.

For MongoDB, it's also fairly simple. Just use a document with the same 
device-id, timestamp and binary-array data (as JSON), make sure indexes are 
declared, and query/delete just as in SQL. MongoDB provides sharding, 
replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, 
still seems way more complicated than necessary.

These solutions also provide sorting (which, while nice, isn't a requirement 
for my case).

## question

I've been reading the Riak docs, and I'm just not sure if this simple 
"queryable" case can really fit all that well. I'm not so concerned about 
having to send 50K "deletes" to delete data. I'm more concerned about being 
able to find it. Given what I've written above, I may be blocked conceptually 
by the above index/query mentality such that I'm just not seeing the Riak way 
of doing things.

Anyway, I can "tag" (via the secondary index feature) each blob of data with 
the device-id and the timestamp. I could then do a range query similar to:

    GET /buckets/devices/index/timestamp/start/end

However, this doesn't allow me to group based on device-id. I could create a 
separate bucket for every device, such that I could do:

    GET /buckets/device-id/index/timestamp/start/end

but if I do this, how can I get a list of the device-ids I need so that I can 
create that specific URL? The docs say listing buckets and keys is problematic.

Might be that Riak just isn't a good case for this sort of thing, especially 
given I want to use it for short-term transient data, and that's fine. But I 
wanted to ask you all just to make sure that I'm not missing something 
somewhere.

For instance, might link walking help? How about a map/reduce to find a unique 
list of device-ids within a given time-horizon, and a streaming map job to 
gather the data for export? Does that seem pretty reasonable?

Thanks!

Keith
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to