Keith you are pretty close! Everything could go into one bucket, not really an issue. About this:
> > This I don't know how to do based on my reading of the docs. Something like: > > get /buckets/mydata/index/device_bin/FF345678912 > > which would return a list of .... what, device-timestamp compound keys? And > then would I feed a potentially huge list of "bucket/key" pairs into a > gigantic javascript query for the map-reduce phase? You wouldn't do a get on it, you would initiate a map-reduce operation by issuing a POST to the /mapred endpoint. You can see an example here: http://wiki.basho.com/Secondary-Indexes.html Look for the section titled "Exact Match Query". You just need a simple map-phase function that emits only the objects whose timestamp is in the desired range. It's a lot faster than you think because the map function executes on all nodes in the cluster simultaneously. And for best performance you would use a map function written in Erlang rather than JavaScript because there is some extra overhead using JavaScript that you don't have when using an Erlang function. I do believe that you can use Riak very well to handle what your application requires. Give me a shout off-list if you you like and I'll put together a working example to get you started. --gordon On Nov 12, 2011, at 17:43 , Keith Irwin wrote: > On Nov 12, 2011, at 2:32 PM, Gordon Tillman wrote: > >> Keith I have an idea that might work for you. This is a bit vague but I >> would be glad to put together a more concrete example if you like. > > Okay, thanks! Not sure I understand everything, though. > >> Use secondary indexes to tag each entry with the device id. > > I get the tagging part, but I'm not sure what the bucket and key being tagged > would look like. Are you taking a single bucket for all data? > > put /buckets/mydata/keys/<device>-<timestamp> > x-riak-index-device_bin: FF06541287AB > > Something like that? > >> You can then find all of the entries for a given device by using the the >> secondary index to feed into a simple map phase operation that returns only >> the entries that you want; i.e., those that are in a given time range. > > This I don't know how to do based on my reading of the docs. Something like: > > get /buckets/mydata/index/device_bin/FF345678912 > > which would return a list of .... what, device-timestamp compound keys? And > then would I feed a potentially huge list of "bucket/key" pairs into a > gigantic javascript query for the map-reduce phase? > >> In addition, to easily find all of the registered device ids easily you can >> create one entry for each device. The key can be most anything (even the >> device id if you encode it properly -- hash it), and you could tag each of >> those entries with a secondary index whose field is something like "type" or >> whatever and whose value is "deviceid". The value for each entry could be >> just a simple text/plain value whose contents is just the device id of the >> registered device. > > Okay, I think I get this: > > When a device comes in, just do something like: > > put /buckets/devices/<device-id> > x-riak-index-type_bin: "device" > > When I want a list of device IDs, I can: > > get /buckets/devices/index/type_bin/device > > and get them all, right? This is more efficient than the various list > functions? That makes sense to me. > > I guess I'll have to try a few examples and see what happens. What you're > telling me is that what I want to do is possible, or is at least not pressing > against Riak's particular trade-offs too much. Or at least I hope that's what > you're telling me. ;) > > Keith > > >> >> --gordon >> >> On Nov 12, 2011, at 16:19 , Keith Irwin wrote: >> >>> Folks-- >>> >>> (Apologies up front for the length of this.) >>> >>> I'm wondering if you can let me know if Riak is a good fit for a simple >>> not-quite-key-value scenario described below. MongoDB or (say) Postgresql >>> seem a more natural fit conceptually, but I really, really like Riak's >>> distribution strategy. >>> >>> ## context >>> >>> The basic overview is this: >>> >>> 50K devices push data once a second to web services which need to store >>> that data in short-term storage (Riak). Once an hour, a sweeper needs to >>> take an hour's worth of data per device (if there is any) and ship it off >>> to long term storage, then delete it from short-term storage. Ideally, >>> there'd only ever be slightly more than 1 hour's worth of data still in >>> short-term storage for any given device. The goal is to write down the data >>> as simply and safely as possible, with little or no processing on that data. >>> >>> Each second's worth of data is: >>> >>> * A device identifier >>> * A timestamp (epoch seconds, integer) for the slice of time the data >>> represents >>> * An opaque blob of binary data (2 to 4k) >>> >>> Once an hour, I'd like to do something like: >>> >>> * For each device: >>> * Find (and concat) all the data between time1 and time2 (an hour). >>> * Move that data to long-term storage (not Riak) as a single blob. >>> * Delete that data from Riak. >>> >>> For an SQL db, this is a really simple problem, conceptually. You can have >>> a table with three columns: device-id, timestamp, blob. You can index the >>> first two columns and roll up the data easily enough and then delete it via >>> single SQL statements (or buffer as needed). The harder part is >>> partitioning, replication, etc, etc. >>> >>> For MongoDB, it's also fairly simple. Just use a document with the same >>> device-id, timestamp and binary-array data (as JSON), make sure indexes are >>> declared, and query/delete just as in SQL. MongoDB provides sharding, >>> replica-sets, recovery, etc. Set up, while less complicated than an RDBMS, >>> still seems way more complicated than necessary. >>> >>> These solutions also provide sorting (which, while nice, isn't a >>> requirement for my case). >>> >>> ## question >>> >>> I've been reading the Riak docs, and I'm just not sure if this simple >>> "queryable" case can really fit all that well. I'm not so concerned about >>> having to send 50K "deletes" to delete data. I'm more concerned about being >>> able to find it. Given what I've written above, I may be blocked >>> conceptually by the above index/query mentality such that I'm just not >>> seeing the Riak way of doing things. >>> >>> Anyway, I can "tag" (via the secondary index feature) each blob of data >>> with the device-id and the timestamp. I could then do a range query similar >>> to: >>> >>> GET /buckets/devices/index/timestamp/start/end >>> >>> However, this doesn't allow me to group based on device-id. I could create >>> a separate bucket for every device, such that I could do: >>> >>> GET /buckets/device-id/index/timestamp/start/end >>> >>> but if I do this, how can I get a list of the device-ids I need so that I >>> can create that specific URL? The docs say listing buckets and keys is >>> problematic. >>> >>> Might be that Riak just isn't a good case for this sort of thing, >>> especially given I want to use it for short-term transient data, and that's >>> fine. But I wanted to ask you all just to make sure that I'm not missing >>> something somewhere. >>> >>> For instance, might link walking help? How about a map/reduce to find a >>> unique list of device-ids within a given time-horizon, and a streaming map >>> job to gather the data for export? Does that seem pretty reasonable? >>> >>> Thanks! >>> >>> Keith >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com