Kresten, all great questions.  I've attempted to give an answer to all
inline.  I focused on Search since that's what I know best.  I'll leave 2I
clarifications to Rusty.

On Wed, Oct 26, 2011 at 11:42 AM, Kresten Krab Thorup <k...@trifork.com>wrote:

> Perhaps as follow-up FAQ-style questions, you could also answer these
> questions:
>
> 1. What is the typical execution profile of 2i vs solr queries?  Do all
> queries go to all nodes in the cluster, or does it depend on the query?  I
> imagine that range-queries need to go to all vnodes, whereas simple
> "prop=value" queries may be more directly to a node holding the given index.
>

First, I want to introduce Riak vernacular "coverage" to mean an operation
that must query a covering set of partitions.  That is, to make sure the
operation considers all keys it must visit enough vnodes to cover the entire
key space [1].  In the case of N=3 this would be 1/3 of the vnodes, 1/N in
the general case.  For example, the list keys operation uses a coverage
operation.  This makes sense because in order to list all keys you must
visit all keys but no more (i.e. no need to visit all nodes if a portion of
them will get the same job done).

In the case of 2I all queries are executed via a coverage operation.  This
is because 2I uses document-based partitioning of it's index, i.e. each
partition holds a _piece_ of the index [2].

Search uses term-based partitioning, i.e. each partition holds the entire
index for a set of terms.  E.g. the index for term "basho" is stored
entirely on one partition (with replicas on other partitions of course).
 Therefore, a query for one term in Search will only hit one vnode (it uses
R=1).  If you perform union or intersection with M terms you are looking at
max of M vnodes but maybe less than M depending on how they hash and some
randomness in Search's coordinator.  If you apply this knowledge to range
queries then it should be plain to see that it requires visiting a covering
set of partitions.  This is because we're searching a range of terms and
each term in that range may be on a different partition.


>
> 2. What is the reliability N,R,W / replication / failover properties of
> these indexes?  How do the two different index types reconcile after a
> netsplit / crash?
>

Search honors N just like KV.  It will make make N replicas of your index
entries.

R is hardcoded to 1 currently for performance reasons (although this might
be worth revisiting).  To be totally correct, during query time Search is
really N=1 & R=1 in the sense that it picks just one partition to query for
a given term.  I.e. it's not like KV where you send N requests and wait for
R replies.  It sends 1 request and waits for 1 reply.  Furthermore, Search
prefers local data and if it can't get it locally it will choose a partition
at random to help distributed load.  This has the consequence that replica
loss can cause nondeterministic behavior.

Without looking at the code I can't answer about W.  I can't remember if it
waits or just does fire and forget.

Search should handle node failures fine, i.e. when a node goes down for a
while.  However, when you start loosing replicas that's when Search will
start returning nondeterministic results.  There is no form of anti-entropy
(active or passive) for Search indexes.  If you lose replicas you must
reindex.  Yes, this sucks.  Yes, it should be made better.  The only thing
stopping it from getting fixed is time and resources.  If this hurts you
(you being anyone reading this) then please yell as much as necessary until
we do something about it.  If you feel adventurous enough to patch it
yourself then send in a PR and I will be happy to look at it.  Please note
that you can get active anti-entropy for search by using
our replication support that comes with the enterprise version (but it
requires two clusters).


>
> 3. What are the edge conditions if nodes crash in the middle of a put?
> I.e., how dependable are the indexes?  Is the index updated before or after
> the "riak put", or as part of the "same transaction" somehow?
>

In search if you are indexing KV objects then the KV write and index write
are not atomic.  One can succeed while the other fails.  Once again
reindexing needs to occur or use replication support that comes with our
enterprise version.


>
> 4. What is the storage cost for indexes?  Some of it go into the leveldb
> storage, and some go in merge indexes?
>

For 2I everything goes in leveldb, but that's an implementation detail.

For Search your objects go into whatever storage solution you pick and your
indexes _always_ go into merge_index.  There are costs in terms of memory
and ETS tables (for those that don't know what ETS tables are it's a limit
resource in Erlang, but the limitedness of it is configurable) for
merge_index.  merge_index stores a buffer in memory (approximately equal to
bufer_rollover_size) for _every_ partition.  So if you have a ring_size of
128 with 6 nodes then you're looking at ~ (128/6)*<buffer_rollover_size> in
terms of memory usage.  It also keeps file offsets in memory to reduce disk
seek during query time but I'm not sure what their general footprint looks
like [3].

There are cases where merge_index can become overwhelmed and eat up all
available ETS tables causing Search to fall hard with system limit errors.
 This mainly has to do with it's configured buffer_rollover_size.  Raising
that increases memory usage but can save greatly in ETS usage.

-Ryan

[1]: If you really want to dig in checkout the coverage behavior:
https://github.com/basho/riak_core/blob/master/src/riak_core_coverage_fsm.erl


[2]: For more information on document vs. term based partitioning see:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.62.1613
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.7337

[3]: Rusty did a pretty thorough writeup on merge_index in it's README,
https://github.com/basho/merge_index
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to