Re: 2i for single-result lookup

Justin Karneges Mon, 07 Nov 2011 19:16:37 -0800

Hmm, but if the username (or email) must be unique then I think this problem 
may be more than just indexing.  There's also an id reservation issue.

At least in my case, I did not think 2i would work for me, because I needed 
one of the indexes (email) to be unique and reservable on a first-come basis, 
yet I wanted people to be able to change email addresses.  The resulting 
algorithm is a bit disgusting but it essentially works like this:

1) All keys are tagged with extra metadata:
  a) creation date (Riak only has modified date)
  b) random "value" id (I'm using a UUID for this).  this is meant to indicate 
a particular generation of a key's value.  updates to a key should retain this 
value, but overwrites should have new ids.

2) User keys (where the indexes point) must contain extra metadata: a set of 
index references (index creation date / value id pairs).

The idea with #1 and #2 here is that indexes point to users and users point to 
indexes.  Notably, users point to specific index key *versions* thanks to the 
extra metadata.  This aids in conflict resolution (more on that later).

3) If you want to claim an index, then you first checks to see if a key already 
exists with that name.  If it does then you can't claim ("account already 
exists").  If a key with that name does not exist then you make the index key 
with a link pointing to the user data key.  Both the index key and user keys 
need to have creation date and value id metadata.  Additionally, the user key 
needs to add a reference to this index into its set of references.  Of course 
a conflict can occur if two clients try to claim indexes simultaneously.  
That's fine, you'll end up with two values (siblings) at the same index with 
different creation dates and value ids, and conflict resolution will deal with 
that.

4) When retrieving a user, you must confirm the relationship between the index 
and the object.  Essentially a valid keypair is the earliest dated index 
sibling that points to a user key with the latest dated index reference that 
points back to it.  Determining this validity requires retrieving both keys, 
regardless of which one you start from, and in certain conflict cases may 
require crawling through to other keys.  If you cannot verify the relationship 
between index and object then you must treat this case as if neither exist.

What does this all mean?

You can create users, unique by some value, on a first come basis.  If two 
clients try to create accounts with the same name at the same time, then an 
arbitrary one will win.  If two existing users try to change to the same name, 
an arbitrary one will win, and the loser *will not lose his original 
reservation*.  I mapped out some pretty awful race conditions like if three 
existing users (A, B, and C) exist and A and B suddenly try to change to the 
same name at the same moment C tries to change to B's original name, and B 
loses to A, then B is guaranteed to get his index back and C's claim becomes 
invalid.

Oh, and finally, whichever index key is valid becomes the authoritative value 
of that data field.  For example, in my case I'm indexing by email.  This means 
if I want to look up a user's email address based on the direct user ID, I 
first look up the user object key directly, then I determine its valid index 
(needed to be done anyway just to confirm the user's validity to exist) and 
then use *that* index value as the user's email address.  You would *not* use 
some email field value in the JSON blob.  In fact, better to not have your 
index field in the JSON blob at all since conflict resolution won't be able to 
fix it.

OH, and one last thing.  There was concern about writing a user requiring two 
writes, opening up the potential of leaving behind orphan keys in the event of 
failure.  I've started to come to the conclusion that this is acceptable, and 
you just need to sweep it up later.  The above-described scheme will work fine 
even if you leave dirty keys all over the place and never resolve siblings.  
You'll never have half of a user or some dangling index or anything.  The 
validity checks at read-time ensure this.  But, some periodically run task 
that cleans up your DB with MapReduce operations would be smart.

Maybe there's a better way to do this, but I thought I'd share.

Justin

On Monday, November 07, 2011 06:19:48 PM Nate Lawson wrote:
> Ok, then 2I will work fine for what you're doing if you're going to keep
> adding fields like that. You can just use the binary type for the 2I keys
> (_bin).
> 
> http://basho.com/blog/technical/2011/09/14/Secondary-Indexes-in-Riak/
> 
> You first said you were concerned about the query performance due to the
> covering set access and that you only had 2 unique fields. That's why I
> suggested you create a manual index (either email or userid) to lookup the
> other key. But with this additional info you mention, I think you should
> just use 2I since that's one of its main purposes.
> 
> -Nate
> 
> On Nov 7, 2011, at 6:11 PM, Greg Pascale wrote:
> > Hi Nate,
> > 
> > There are only 2 secondary keys for now (in addition to the primary key),
> > but this number will grow to 5 or more pretty soon.
> > 
> > I think when you say "insert each separately", you mean create 2
> > duplicate objects, one keyed by username and one keyed by email. Or do
> > you mean create one object keyed by username, and then another object
> > containing the username and keyed by email (a manual index if you will)?
> > Code complexity is the main reason I'd like to avoid a solution like
> > this. Suddenly a user create operation requires n writes to be
> > considered a success. If one fails, I need to delete the others, etc… It
> > quickly becomes a pain.
> > 
> > I don't know what you mean by "some relationship between the keys"
> > 
> >> On Nov 7, 2011, at 5:45 PM, Greg Pascale wrote:
> >>> Hi,
> >>> 
> >>> I'm thinking about using 2i for a certain piece of my system, but I'm
> >>> worried that the document-based partitioning may make it suboptimal.
> >>> 
> >>> The issue is that the secondary fields I want to query over (email and
> >>> username) are unique, so each will only ever map to one value. Since
> >>> 2i queries a coverage set, but I'm only ever looking for one result,
> >>> it's going to be hitting n-1 machines needlessly.
> >>> 
> >>> So, what I'm looking to understand is how much overhead a single-result
> >>> 2i lookup like this will incur vs. a primary-key lookup, or even vs.
> >>> search. Search doesn't intuitively feel like the right tool here, but
> >>> I wonder if it may actually be preferable since it uses term-based
> >>> partitioning.
> >>> 
> >>> Thanks,
> >> 
> >> If it's only 2 keys, why not insert each separately? You will double
> >> your total number of keys in the db. But both search and 2I are
> >> creating extra keys anyway in their private indices, so it has the same
> >> or worse effect on total storage as doubling your primary keys. And
> >> query efficiency is worse, as you point out.
> >> 
> >> 2I and search are more useful where there's some relationship between
> >> the keys, not when they're fully independent as you point out.
> >> 
> >> -Nate
> >> 
> >> 
> >> _______________________________________________
> >> riak-users mailing list
> >> riak-users@lists.basho.com
> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: 2i for single-result lookup

Reply via email to