Kresten, all great questions. I've attempted to give an answer to all inline. I focused on Search since that's what I know best. I'll leave 2I clarifications to Rusty.
On Wed, Oct 26, 2011 at 11:42 AM, Kresten Krab Thorup <k...@trifork.com>wrote: > Perhaps as follow-up FAQ-style questions, you could also answer these > questions: > > 1. What is the typical execution profile of 2i vs solr queries? Do all > queries go to all nodes in the cluster, or does it depend on the query? I > imagine that range-queries need to go to all vnodes, whereas simple > "prop=value" queries may be more directly to a node holding the given index. > First, I want to introduce Riak vernacular "coverage" to mean an operation that must query a covering set of partitions. That is, to make sure the operation considers all keys it must visit enough vnodes to cover the entire key space [1]. In the case of N=3 this would be 1/3 of the vnodes, 1/N in the general case. For example, the list keys operation uses a coverage operation. This makes sense because in order to list all keys you must visit all keys but no more (i.e. no need to visit all nodes if a portion of them will get the same job done). In the case of 2I all queries are executed via a coverage operation. This is because 2I uses document-based partitioning of it's index, i.e. each partition holds a _piece_ of the index [2]. Search uses term-based partitioning, i.e. each partition holds the entire index for a set of terms. E.g. the index for term "basho" is stored entirely on one partition (with replicas on other partitions of course). Therefore, a query for one term in Search will only hit one vnode (it uses R=1). If you perform union or intersection with M terms you are looking at max of M vnodes but maybe less than M depending on how they hash and some randomness in Search's coordinator. If you apply this knowledge to range queries then it should be plain to see that it requires visiting a covering set of partitions. This is because we're searching a range of terms and each term in that range may be on a different partition. > > 2. What is the reliability N,R,W / replication / failover properties of > these indexes? How do the two different index types reconcile after a > netsplit / crash? > Search honors N just like KV. It will make make N replicas of your index entries. R is hardcoded to 1 currently for performance reasons (although this might be worth revisiting). To be totally correct, during query time Search is really N=1 & R=1 in the sense that it picks just one partition to query for a given term. I.e. it's not like KV where you send N requests and wait for R replies. It sends 1 request and waits for 1 reply. Furthermore, Search prefers local data and if it can't get it locally it will choose a partition at random to help distributed load. This has the consequence that replica loss can cause nondeterministic behavior. Without looking at the code I can't answer about W. I can't remember if it waits or just does fire and forget. Search should handle node failures fine, i.e. when a node goes down for a while. However, when you start loosing replicas that's when Search will start returning nondeterministic results. There is no form of anti-entropy (active or passive) for Search indexes. If you lose replicas you must reindex. Yes, this sucks. Yes, it should be made better. The only thing stopping it from getting fixed is time and resources. If this hurts you (you being anyone reading this) then please yell as much as necessary until we do something about it. If you feel adventurous enough to patch it yourself then send in a PR and I will be happy to look at it. Please note that you can get active anti-entropy for search by using our replication support that comes with the enterprise version (but it requires two clusters). > > 3. What are the edge conditions if nodes crash in the middle of a put? > I.e., how dependable are the indexes? Is the index updated before or after > the "riak put", or as part of the "same transaction" somehow? > In search if you are indexing KV objects then the KV write and index write are not atomic. One can succeed while the other fails. Once again reindexing needs to occur or use replication support that comes with our enterprise version. > > 4. What is the storage cost for indexes? Some of it go into the leveldb > storage, and some go in merge indexes? > For 2I everything goes in leveldb, but that's an implementation detail. For Search your objects go into whatever storage solution you pick and your indexes _always_ go into merge_index. There are costs in terms of memory and ETS tables (for those that don't know what ETS tables are it's a limit resource in Erlang, but the limitedness of it is configurable) for merge_index. merge_index stores a buffer in memory (approximately equal to bufer_rollover_size) for _every_ partition. So if you have a ring_size of 128 with 6 nodes then you're looking at ~ (128/6)*<buffer_rollover_size> in terms of memory usage. It also keeps file offsets in memory to reduce disk seek during query time but I'm not sure what their general footprint looks like [3]. There are cases where merge_index can become overwhelmed and eat up all available ETS tables causing Search to fall hard with system limit errors. This mainly has to do with it's configured buffer_rollover_size. Raising that increases memory usage but can save greatly in ETS usage. -Ryan [1]: If you really want to dig in checkout the coverage behavior: https://github.com/basho/riak_core/blob/master/src/riak_core_coverage_fsm.erl [2]: For more information on document vs. term based partitioning see: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.62.1613 http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.7337 [3]: Rusty did a pretty thorough writeup on merge_index in it's README, https://github.com/basho/merge_index
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com