Expected vs Actual Bucket Behavior

2010-07-20 Thread Eric Filson
*Potatoes*: First a hello to the list ;)

*Meat*: I recently became interested in nosql solutions and so my following
statements may be out of ignorance to this new type of db schema design
however, I thought it was worth mentioning...

To preface, I'm looking at nosql solutions to solve the "Big Data" problem
for a limited data set, rather than using riak exclusively for storage.   My
proposed schema for riak consists of a bucket per collection, per user.

The current behavior of riak, when retrieving the contents of any given
bucket, requires all objects to be examined to determine their bucket and
effectively m/r'ed down to your result set. This seems to me to be quite a
costly operation and the logical choice is to store a separate k/v pair that
contains an index of keys in a bucket. I would think that this requirement,
retrieving all objects in a bucket, to be a _very_ common
place occurrence for modern web development and perhaps (depending on
requirements) _the_ most common function aside from retrieving a single k/v
pair.

In my mind, this seems to leave the only advantage to buckets in this
application to be namespacing... While certainly important, I'm fuzzy on
what the downside would be to allowing buckets to exist as a separate
partition/pseudo-table/etc... so that retrieving all objects in a bucket
would not need to read all objects in the entire system; especially
considering how common the usage is... It also seems to me that this would
closer mimic a real world "bucket" and expected vs actual behavior due to
terminology would be much closer. If I'm ever examining a bucket, I'm
looking at that one bucket and never all objects to see what bucket they're
in. I wouldn't use the term "bucket" as the functionality currently stands
because the keys aren't in buckets at all, they're all global.  For all
intensive purposes bucket == namespace in riak while bucket implies (to me)
something more than just a namespace.

This idea/concept may stem from my limited knowledge of nosql storage
engines but I do feel it has some merit. Especially when trying to garner
support from the development community.

In lieu of changing Riak to fit this proposed model and because querying for
the contents of a single bucket is so common, I might recommend a hybrid
solution (based in my limited knowledge of Riak)... What about allowing a
bucket property named something like "key_index" that points to a key
containing a value of "keys in bucket".  Then, when calling GET
/riak/bucket, Riak would use the key_index to immediately reduce its result
set before applying m/r funcs.  While I understand this is essentially what
a developer would do, it would certainly alleviate some code requirements
(application side) as well as make the behavior of retrieving a bucket's
contents more "expected" and efficient.

Anyway, information is pretty limited on riak right now, seeing as how it's
so new, but talk in my development circles is very positive and lively.  I
thought this might be the best place to pose my question / suggestion and
get some feedback.

-Eric
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Expected vs Actual Bucket Behavior

2010-07-20 Thread Eric Filson
On Tue, Jul 20, 2010 at 3:02 PM, Justin Sheehy  wrote:

> Hi, Eric!  Thanks for your thoughts.
>
> On Tue, Jul 20, 2010 at 12:39 PM, Eric Filson  wrote:
>
> > I would think that this requirement,
> > retrieving all objects in a bucket, to be a _very_ common
> > place occurrence for modern web development and perhaps (depending on
> > requirements) _the_ most common function aside from retrieving a single
> k/v
> > pair.
>
> I tend to see people that mostly try to write applications that don't
> select everything from a whole bucket/table/whatever as a very
> frequent occurrence, but different people have different requirements.
>  Certainly, it is sometimes unavoidable.
>

Indeed, in my case it is :(


>
> > In my mind, this seems to leave the only advantage to buckets in this
> > application to be namespacing... While certainly important, I'm fuzzy on
> > what the downside would be to allowing buckets to exist as a separate
> > partition/pseudo-table/etc... so that retrieving all objects in a bucket
> > would not need to read all objects in the entire system
>
> The namespacing aspect is a huge advantage for many people.  Besides
> the obvious way in which that allows people to avoid collisions, it is
> a powerful tool for data modeling.  For example, sets of 1-to-1
> relationships can be very nicely represented as something like
> "bucket1/keyA, bucket2/keyA, bucket3/keyA", which allows related items
> to be fetched without any intermediate queries at all.
>

I agree however, the same thing can be accomplished by prefixing your keys
with a "namespace"...

bucket_1_keyA, bucket_2_keyA, bucket_3_keyA

Obviously, buckets in Riak have additional functionality and allow for some
more complex but easier to use m/r functions across multiple buckets,
etc...


>
> One of the things that many users have become happily used to is that
> buckets in Riak are generally "free"; they come into existence on
> demand, and you can use as many of them as you want in the above or
> any other fashion.  This is in essence what conflicts with your
> desire.  Making buckets more fundamentally isolated from each other
> would be difficult without incurring some incremental cost per bucket.
>

For me, I am more than willing to add a small amount of overhead to the
storage engine for increased functionality and reduced overhead on the
application layer.  Again this is obviously application specific and I'm not
saying it should all be converted over for all buckets exiting in their own
space for every implementation but certainly a different storage engine or
configuration option to allow this level/type of access would be nice :)


> > I might recommend a hybrid
> > solution (based in my limited knowledge of Riak)... What about allowing a
> > bucket property named something like "key_index" that points to a key
> > containing a value of "keys in bucket".  Then, when calling GET
> > /riak/bucket, Riak would use the key_index to immediately reduce its
> result
> > set before applying m/r funcs.  While I understand this is essentially
> what
> > a developer would do, it would certainly alleviate some code requirements
> > (application side) as well as make the behavior of retrieving a bucket's
> > contents more "expected" and efficient.
>
> A much earlier incarnation of Riak actually stored bucket keylists
> explicitly in a fashion somewhat like what you describe.  We removed
> this as one of our biggest goals is predictable and understandable
> behavior in a distributed systems sense, and a model like this one
> turns each write operation into at least two operations.  This isn't
> just a performance issue, but also adds complexity.  For instance, it
> is not immediately obvious what should be returned to the client if a
> data item write succeeds, but the read/write of the index fails?
>

Haha, these are the exact reasons I would cite as a developer for using a
similar method on Riak's side... without the option of auto bucket indexing
it effectively places this double write into the application side where it
requires more cycles and more data across the wire.  Instead of doing a
single write, from the application side, and allowing Riak to handle this,
you have to GET index_key, UPDATE index_key, ADD new_key... So rather than
having a single transaction with Riak, you have to have three transactions
with Riak + Application functionality.  Inherently, this adds another level
of complexity into the application code base for something that could be
done more efficiently by the DB engine itself.

I would think a separate error number and message would suffice as a return
error, obviously though, this would requ