Correct, there is a key in LevelDB for each Riak key that has the index term attached. This is somewhat mitigated by Snappy compression (600K records might very well compress into a single block), but it is nowhere near the storage efficiency of something like Solr's indexes. It still has to scan.
On Fri, Aug 29, 2014 at 11:01 AM, Bryan <br...@go-factory.net> wrote: > Hi Sean, > > Sweet! Thanks for the explanation. Much appreciated and very helpful. Just a > bit more clarification, on an equality lookup, where the ‘foobar’ key has a > value ‘barfoo’ that is the very low-cardinality, are those indexed objects > individually stored as a key/value term which then is enumerated over, versus > a key/list_of_values term? In other words, if I have 600K records where > foobar=barfoo, will this be 600K reads, or 1 read with a list of 600K entries > which are then enumerated over in memory? > > Cheers, > Bryan > ---- > > Bryan Hughes > Go Factory > > http://www.go-factory.net > > On Aug 29, 2014, at 6:46 AM, Sean Cribbs <s...@basho.com> wrote: > >> I made a minor mistake in my example, the PrimaryKey is part of the >> index key, whereas the value contains nothing. It's more like this: >> >> {i, IndexName, IndexTerm, PrimaryKey} => <<>> >> >> So for the initial seek, we construct a key like so: >> >> {i, <<"foobar_bin">>, <<"baz">>, <<>>} >> >> On Fri, Aug 29, 2014 at 8:44 AM, Sean Cribbs <s...@basho.com> wrote: >>> Hi Bryan, >>> >>> Index entries are just keys in LevelDB like normal values are. So, >>> performance is relatively constant at write time but is O(N) at read >>> (because you are scanning the index). The high-cardinality term will >>> definitely be expensive to enumerate, but the low-cardinality terms >>> will be much less so. The index entries in LevelDB look roughly like >>> this: >>> >>> {i, IndexName, IndexTerm} => PrimaryKey >>> >>> Since the keys are encoded with sext, we encode the start key of the >>> range (or the equality query), start an iterator and then seek to that >>> start key and start reading values. When a key is reached that exceeds >>> the range or equality query, we stop iterating. >>> >>> Of course, that is an oversimplification, there are issues with >>> backpressure from the request coordinator process against the >>> iterators, streaming merge sort if you want the results back in order >>> or paginated, etc. Also, all 2I queries are run via coverage, which >>> ensures that the entire keyspace is "covered" but hits most if not all >>> nodes in the cluster. >>> >>> On Thu, Aug 28, 2014 at 9:18 PM, Bryan <br...@go-factory.net> wrote: >>>> Hi Everyone, >>>> >>>> Apologies as this has probably been asked before. Unfortunately I have not >>>> been able to parse through the list serve to find a reasonable answer and >>>> the Basho wiki docs seem to be missing this information. I have read up on >>>> the secondary index docs. >>>> >>>> I am interested to better understand how the secondary indexes perform when >>>> there is a very low distribution of values that are indexed. For example, >>>> lets say I have a bucket with 1 million objects that I create a secondary >>>> index on. Now lets say the index is on a value that has an uneven >>>> distribution where one of the values is not selective while the others are, >>>> such that 60% of the values fall into a single indexed value, while the >>>> remaining 40% have a good distribution. >>>> >>>> For example, I have a record (i.e. object) where the indexed field is >>>> ‘foobar_bin'. I have 1 million objects in the bucket that have 100 unique >>>> ‘foobar’ values distributed over the 1 million objects. One of the values >>>> repeats for 60% of the records (600K) and the rest have an even >>>> distribution >>>> of about 4%. >>>> >>>> How will the secondary indexes perform with this and is this an appropriate >>>> use of the secondary indexes? Finally, what I have read is not completely >>>> clear on what happens if the indexed value is updated when the value has >>>> such a low degree of selectivity? >>>> >>>> We have less than 512 partitions and are using the erlang client. >>>> >>>> Thanks in advance - any insights will be much appreciated! >>>> >>>> >>>> Cheers, >>>> Bryan >>>> >>>> ---- >>>> >>>> Bryan Hughes >>>> Go Factory >>>> http://www.go-factory.net >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>> >>> >>> >>> -- >>> Sean Cribbs <s...@basho.com> >>> Software Engineer >>> Basho Technologies, Inc. >>> http://basho.com/ >> >> >> >> -- >> Sean Cribbs <s...@basho.com> >> Software Engineer >> Basho Technologies, Inc. >> http://basho.com/ > -- Sean Cribbs <s...@basho.com> Software Engineer Basho Technologies, Inc. http://basho.com/ _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com