Hey Eric. Awesome! Thanks so much for the feedback. I really appreciate the help.
I guess I'll have to test each method on our infrastructure to really know for myself. Cheers, Geoff Eric Redmond <eredm...@basho.com> wrote: >Geoff, comments inline. > > >On Nov 13, 2014, at 3:13 PM, Geoff Garbers <ge...@totalsend.com> wrote: > > >Hi all. > > >I've been looking around for a bit with some sort of guidelines as to how best >to structure search indexes within Riak 2.0 - and have yet to come up with >anything that satisfies my questions. > > >I came across https://github.com/basho/yokozuna/blob/develop/docs/ADMIN.md, >where it talks about the one-to-one and many-to-one ways of indexing. It >mentions in passing the potential for lower latency of queries and efficient >deletion of index data when using the one-to-one method - without really >mentioning too much about when one method could significantly outweigh the >other in performance. > > >However, something I'm still not sure on is when is it considered a good idea >to use multiple indexes, versus one massive index. > > >If you'll bear with me, I'll use this simple scenario: > >I have lists, and I have contacts within these lists. In total, I have 100 >million contacts that I am dealing with. Each of them not more than 20KB in >size, and they all follow the exact same JSON object structure. Ignoring >application design for simplicity's sake, let's say I could choose between the >following two ways of storing lists and contacts: > > >Having two buckets: lists and contacts. >All 100 million contacts are stored in the contacts bucket. Each contact >object is linked to its corresponding list through a list_key property, and >all the contacts are stored in the same single search index. > >Having multiple buckets: lists, and for each list, having a separate bucket >contacts_{listkey}. >Using this structure, each contact_{listkey} bucket would have its own search >index. > >With these two scenarios in mind; and making the assumption that we're dealing >with 100 million contacts: > >Which would be the better method of implementing the search indexes? > >If you have 100M contacts, and giving each contacts it's own index might be >fine, but note that indexes have their own overhead in both Solr and Riak >cluster metadata. I wouldn't go this route if your contact_listkey measures in >the hundreds or thousands. > >At which point would one solution be far better than the other? > >If your cluster has 100M objects, note that a solr shard wouldn't have 100M >objects. Instead, if you had a, say, 10 node cluster, depending on your >replication value, a single solr node would have 30M. > >How much does Yokozuna differ from stock-standard Solr? All the search results >I could find on Solr specifically weren't talking about indexes greater than >60,000 objects, yet Riak is required to be able to deal with 100's of millions >of rows. > > >Solr can't manage far more than 60k objects (I've run 10M on my laptop, 100M >per shard is safe, and I hear the tip-top limit per shard is 2 Billion unique >terms per index segment due to Lucene's implementation). I think you'll have >to experiment with your use-case and hardware, but you shouldn't have a >problem. > > >Any help at all with this is really appreciated. > >At some point, I do realise that I will need to set this up for myself, and >performance my own tests on it. However, I was hoping that those currently >using Riak in production might have some more insight into this. > > >Regards, > >Geoff > >_______________________________________________ >riak-users mailing list >riak-users@lists.basho.com >http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com