Hi Alexander - thanks again for your inputs here. I believe the problem here was with the sloppy quorum coming in when I brought down 1 node. The failover node would get active and from what I understand from the documentation, since the failover did not have the key and could potentially respond faster than the fetch, I randomly failed to fetch any data till all the read repairs completed.
Understand this much better now - many thanks once again... On Tue, May 24, 2016 at 4:24 PM, Alexander Sicular <sicul...@gmail.com> wrote: > Hi Vikram, > > If you're using the defaults, two of copies may be on the same > machine. When using the default values (ring_size=64, n_val=3) you are > not guaranteed copies on distinct physical machines. Implement a > back-off retry design pattern. aka, fail once, try again with r=1. > Also, a read will trigger a read repair operation which will then copy > your data n_val times to surviving members of the cluster. > > Have you tried that? > -Alexander > > Read these blog posts for more info: > > http://basho.com/posts/technical/understanding-riaks-configurable-behaviors-part-1/ > http://basho.com/posts/technical/riaks-config-behaviors-part-2/ > http://basho.com/posts/technical/riaks-config-behaviors-part-3/ > http://basho.com/posts/technical/riaks-config-behaviors-part-4/ > > On Tue, May 24, 2016 at 3:08 PM, Vikram Lalit <vikramla...@gmail.com> > wrote: > > It's returning no object at all for the relevant key. That too is random > - > > every few calls it returns but then it doesn't. > > > > On May 24, 2016 4:06 PM, "Sargun Dhillon" <sar...@sargun.me> wrote: > >> > >> What do you mean it's not returning? It's returning stale data? Or > >> it's erroring? > >> > >> On Tue, May 24, 2016 at 7:34 AM, Vikram Lalit <vikramla...@gmail.com> > >> wrote: > >> > Hi - I'd appreciate if someone can opine on the below behavior of Riak > >> > that > >> > I am observing... is that expected, or something wrong in my set-up / > >> > understanding? > >> > > >> > To summarize, I have a 3-node Riak cluster (separate EC2 AWS > instances) > >> > with > >> > a separate chat server connecting to them. When I write data on the > Riak > >> > nodes, the process is successful and I can read all data correctly. > >> > However, > >> > as part of my testing, if I deliberately bring down one node (and then > >> > remove it from the cluster using riak-admin cluster force-remove / > plan > >> > / > >> > commit), the client API is not able to fetch all the written data. In > >> > fact, > >> > there is an alternation of success and failure which happens rather > >> > randomly. > >> > > >> > My initial suspicion was that it would be happening only during the > time > >> > the > >> > rebalancing is occurring (i.e. riak-admin ring-status is not fully > >> > settled) > >> > but I've seen this sporadic behavior post the same too. > >> > > >> > Does this have to do with the n and r values for the cluster and given > >> > that > >> > 1 node is down, the cluster does not succeed in returning results > >> > reliably? > >> > Also, does this mean that during the time a cluster is being > rebalanced > >> > (even incl. addition of new nodes), the results could be arbitrary - > >> > that > >> > doesn't sound correct to me? > >> > > >> > Appreciate if someone can throw some light here? Also, the HTTP API > >> > calls to > >> > retrieve and set the n / r / w values for a specific bucket - couldn't > >> > locate the same! > >> > > >> > Thanks much! > >> > Vikram > >> > > >> > > >> > _______________________________________________ > >> > riak-users mailing list > >> > riak-users@lists.basho.com > >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >> > > > > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com