Keith you are pretty close!

Everything could go into one bucket, not really an issue.  About this:

> 
> This I don't know how to do based on my reading of the docs. Something like:
> 
>    get /buckets/mydata/index/device_bin/FF345678912
> 
> which would return a list of .... what, device-timestamp compound keys? And 
> then would I feed a potentially huge list of "bucket/key" pairs into a 
> gigantic javascript query for the map-reduce phase?


You wouldn't do a get on it, you would initiate a map-reduce operation by 
issuing a POST to the /mapred endpoint.  You can see an example here:  
http://wiki.basho.com/Secondary-Indexes.html  Look for the section titled 
"Exact Match Query".

You just need a simple map-phase function that emits only the objects whose 
timestamp is in the desired range.  It's a lot faster than you think because 
the map function executes on all nodes in the cluster simultaneously.

And for best performance you would use a map function written in Erlang rather 
than JavaScript because there is some extra overhead using JavaScript that you 
don't have when using an Erlang function.

I do believe that you can use Riak very well to handle what your application 
requires.

Give me a shout off-list if you you like and I'll put together a working 
example to get you started.

--gordon


On Nov 12, 2011, at 17:43 , Keith Irwin wrote:

> On Nov 12, 2011, at 2:32 PM, Gordon Tillman wrote:
> 
>> Keith I have an idea that might work for you.  This is a bit vague but I 
>> would be glad to put together a more concrete example if you like.
> 
> Okay, thanks! Not sure I understand everything, though.
> 
>> Use secondary indexes to tag each entry with the device id.
> 
> I get the tagging part, but I'm not sure what the bucket and key being tagged 
> would look like. Are you taking a single bucket for all data?
> 
> put /buckets/mydata/keys/<device>-<timestamp>
> x-riak-index-device_bin: FF06541287AB
> 
> Something like that?
> 
>> You can then find all of the entries  for a given device by using the the 
>> secondary index to feed into a simple map phase operation that returns only 
>> the entries that you want; i.e., those that are in a given time range.
> 
> This I don't know how to do based on my reading of the docs. Something like:
> 
>    get /buckets/mydata/index/device_bin/FF345678912
> 
> which would return a list of .... what, device-timestamp compound keys? And 
> then would I feed a potentially huge list of "bucket/key" pairs into a 
> gigantic javascript query for the map-reduce phase?
> 
>> In addition, to easily find all of the registered device ids easily you can 
>> create one entry for each device.  The key can be most anything (even the 
>> device id if you encode it properly -- hash it), and you could tag each of 
>> those entries with a secondary index whose field is something like "type" or 
>> whatever and whose value is "deviceid".  The value for each entry could be 
>> just a simple text/plain value whose contents is just the device id of the 
>> registered device.
> 
> Okay, I think I get this:
> 
> When a device comes in, just do something like:
> 
> put /buckets/devices/<device-id>
> x-riak-index-type_bin: "device"
> 
> When I want a list of device IDs, I can:
> 
> get /buckets/devices/index/type_bin/device
> 
> and get them all, right? This is more efficient than the various list 
> functions? That makes sense to me.
> 
> I guess I'll have to try a few examples and see what happens. What you're 
> telling me is that what I want to do is possible, or is at least not pressing 
> against Riak's particular trade-offs too much. Or at least I hope that's what 
> you're telling me. ;)
> 
> Keith
> 
> 
>> 
>> --gordon
>> 
>> On Nov 12, 2011, at 16:19 , Keith Irwin wrote:
>> 
>>> Folks--
>>> 
>>> (Apologies up front for the length of this.)
>>> 
>>> I'm wondering if you can let me know if Riak is a good fit for a simple 
>>> not-quite-key-value scenario described below. MongoDB or (say) Postgresql 
>>> seem a more natural fit conceptually, but I really, really like Riak's 
>>> distribution strategy.
>>> 
>>> ## context
>>> 
>>> The basic overview is this: 
>>> 
>>> 50K devices push data once a second to web services which need to store 
>>> that data in short-term storage (Riak). Once an hour, a sweeper needs to 
>>> take an hour's worth of data per device (if there is any) and ship it off 
>>> to long term storage, then delete it from short-term storage. Ideally, 
>>> there'd only ever be slightly more than 1 hour's worth of data still in 
>>> short-term storage for any given device. The goal is to write down the data 
>>> as simply and safely as possible, with little or no processing on that data.
>>> 
>>> Each second's worth of data is:
>>> 
>>> * A device identifier
>>> * A timestamp (epoch seconds, integer) for the slice of time the data 
>>> represents
>>> * An opaque blob of binary data (2 to 4k)
>>> 
>>> Once an hour, I'd like to do something like:
>>> 
>>> * For each device:
>>>     * Find (and concat) all the data between time1 and time2 (an hour).
>>>     * Move that data to long-term storage (not Riak) as a single blob.
>>>     * Delete that data from Riak.
>>> 
>>> For an SQL db, this is a really simple problem, conceptually. You can have 
>>> a table with three columns: device-id, timestamp, blob. You can index the 
>>> first two columns and roll up the data easily enough and then delete it via 
>>> single SQL statements (or buffer as needed). The harder part is 
>>> partitioning, replication, etc, etc.
>>> 
>>> For MongoDB, it's also fairly simple. Just use a document with the same 
>>> device-id, timestamp and binary-array data (as JSON), make sure indexes are 
>>> declared, and query/delete just as in SQL. MongoDB provides sharding, 
>>> replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, 
>>> still seems way more complicated than necessary.
>>> 
>>> These solutions also provide sorting (which, while nice, isn't a 
>>> requirement for my case).
>>> 
>>> ## question
>>> 
>>> I've been reading the Riak docs, and I'm just not sure if this simple 
>>> "queryable" case can really fit all that well. I'm not so concerned about 
>>> having to send 50K "deletes" to delete data. I'm more concerned about being 
>>> able to find it. Given what I've written above, I may be blocked 
>>> conceptually by the above index/query mentality such that I'm just not 
>>> seeing the Riak way of doing things.
>>> 
>>> Anyway, I can "tag" (via the secondary index feature) each blob of data 
>>> with the device-id and the timestamp. I could then do a range query similar 
>>> to:
>>> 
>>>  GET /buckets/devices/index/timestamp/start/end
>>> 
>>> However, this doesn't allow me to group based on device-id. I could create 
>>> a separate bucket for every device, such that I could do:
>>> 
>>>  GET /buckets/device-id/index/timestamp/start/end
>>> 
>>> but if I do this, how can I get a list of the device-ids I need so that I 
>>> can create that specific URL? The docs say listing buckets and keys is 
>>> problematic.
>>> 
>>> Might be that Riak just isn't a good case for this sort of thing, 
>>> especially given I want to use it for short-term transient data, and that's 
>>> fine. But I wanted to ask you all just to make sure that I'm not missing 
>>> something somewhere.
>>> 
>>> For instance, might link walking help? How about a map/reduce to find a 
>>> unique list of device-ids within a given time-horizon, and a streaming map 
>>> job to gather the data for export? Does that seem pretty reasonable?
>>> 
>>> Thanks!
>>> 
>>> Keith
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to