Geoff, comments inline.

On Nov 13, 2014, at 3:13 PM, Geoff Garbers <ge...@totalsend.com> wrote:

> Hi all.
> 
> I've been looking around for a bit with some sort of guidelines as to how 
> best to structure search indexes within Riak 2.0 - and have yet to come up 
> with anything that satisfies my questions.
> 
> I came across https://github.com/basho/yokozuna/blob/develop/docs/ADMIN.md, 
> where it talks about the one-to-one and many-to-one ways of indexing. It 
> mentions in passing the potential for lower latency of queries and efficient 
> deletion of index data when using the one-to-one method - without really 
> mentioning too much about when one method could significantly outweigh the 
> other in performance.
> 
> However, something I'm still not sure on is when is it considered a good idea 
> to use multiple indexes, versus one massive index.
> 
> If you'll bear with me, I'll use this simple scenario:
> I have lists, and I have contacts within these lists. In total, I have 100 
> million contacts that I am dealing with. Each of them not more than 20KB in 
> size, and they all follow the exact same JSON object structure. Ignoring 
> application design for simplicity's sake, let's say I could choose between 
> the following two ways of storing lists and contacts:
> 
> Having two buckets: lists and contacts.
> All 100 million contacts are stored in the contacts bucket. Each contact 
> object is linked to its corresponding list through a list_key property, and 
> all the contacts are stored in the same single search index.
> 
> Having multiple buckets: lists, and for each list, having a separate bucket 
> contacts_{listkey}.
> Using this structure, each contact_{listkey} bucket would have its own search 
> index.
> With these two scenarios in mind; and making the assumption that we're 
> dealing with 100 million contacts:
> Which would be the better method of implementing the search indexes?
If you have 100M contacts, and giving each contacts it's own index might be 
fine, but note that indexes have their own overhead in both Solr and Riak 
cluster metadata. I wouldn't go this route if your contact_listkey measures in 
the hundreds or thousands.
> At which point would one solution be far better than the other?
If your cluster has 100M objects, note that a solr shard wouldn't have 100M 
objects. Instead, if you had a, say, 10 node cluster, depending on your 
replication value, a single solr node would have 30M.
> How much does Yokozuna differ from stock-standard Solr? All the search 
> results I could find on Solr specifically weren't talking about indexes 
> greater than 60,000 objects, yet Riak is required to be able to deal with 
> 100's of millions of rows.

Solr can't manage far more than 60k objects (I've run 10M on my laptop, 100M 
per shard is safe, and I hear the tip-top limit per shard is 2 Billion unique 
terms per index segment due to Lucene's implementation). I think you'll have to 
experiment with your use-case and hardware, but you shouldn't have a problem.

> Any help at all with this is really appreciated.
> At some point, I do realise that I will need to set this up for myself, and 
> performance my own tests on it. However, I was hoping that those currently 
> using Riak in production might have some more insight into this.
> 
> Regards,
> Geoff
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to