On May 24, 2012, at 11:43 AM, Steve Warren wrote: > Thanks Reid, that's a very clear explanation. Are there any logs created > under the cited circumstances? I'm not seeing any errors or logs that > indicate any of the below conditions are in fact happening and would like to > confirm the exact condition. One thing you can do is like at riak-admin transfers while the _deletes_ are happening, as this will show you if any fallbacks are being used. If the partitions are very short, it's possible that you won't catch them in time here though. The leveldb logs might also prove useful, as we've seen times when leveldb gets backed up during compaction events, which could also cause the async GET to time out.
> I'm also not clear why doing this "PR = PW = R = W = all" would eliminate the > issue if it is about the reap failing. Isn't reaping always independent of > the success of the operation? In other words, isn't creating a successful > tombstone the definition of a successful delete? Or does using "all" bypass > the tombstone process altogether? Yes, I didn't mean to imply that these settings would eliminate the issue, just potentially lessen the likelihood. > > I will test out this workaround, it is probably worth doing to ensure the > indices are always correct rather then pepper the code to handle not found > conditions. Although it's still early in my evaluation so I may find a reason > to pepper the code anyway. :) > > Finally, is there a process to create an issue that will track the eventual > resolution? If I do the workaround I'd like to take it out when it is finally > resolved (if there is a final resolution). I'll be filing a bug in the riak_kv repo, but feel free to do so before me, and I'll add my notes there. > > Regards > Steve > > On Thu, May 24, 2012 at 8:27 AM, Reid Draper <reiddra...@gmail.com> wrote: > I have a pretty good idea what is causing this problem. > > Riak uses "tombstone" values to denote that an object has been deleted. > Under normal conditions, this tombstone value (really, the key/value pair) > will be deleted three (3) seconds after the delete. The delete_mode config > lets you change the time from three seconds, or to put it at immediate or > keep. > > Regardless of the value of delete_mode, the key will continue to show > up in list-keys and 2i $key queries as long as the tombstone is still > around. This is because those calls simply iterate through the keys, > and don't inspect the values for tombstones (this could potentially > by quite costly, depending on the backend). > > For a more detailed explanation of deletes in Riak, I highly > suggest you read Jon Meredith's ML post [1]. > > Since in both of these cases, you have delete_mode set to either > 3s or immediate, we are seeing a case where the first time the tombstone > is attempted to be reaped, it fails. The tombstone reaping isn't attempted > again until you do a GET on the object, which is why you see it no longer > appear in 2i and list-keys queries afterward. This is because a > read-repair-like > mechanism runs and sees that the tombstone needs to be reaped. > > So why is the tombstone reap failing the first time? There could be several > reasons. It's important to know, first, that the reaping process requires all > N _primary_ replicas to be up and responding. Here are some potential > reasons: > > 1. One of the primaries is temporarily unreachable. > 2. The original tombstone writes didn't go to all N primaries, for any reason > 3. The async GET after that starts the tombstone reaping times out, for any > reason > > I don't have a silver-bullet recommendation for this problem at the moment. > If you'd like to favor having deleted keys _not_ show up in 2i/list-keys > requests over delete-availability, you can make your delete requests > with PR = PW = R = W = all (note 'all' is equivalent to whatever your bucket's > N value is). > > We'll also be exploring how we can fix or mitigate this situation for > an upcoming release. > > [1]: > http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-October/006048.html > > Thanks, > Reid Draper > Software Engineer > Basho > > On May 23, 2012, at 4:42 PM, Steve Warren wrote: > >> I have a 5 node cluster and given a successful delete call, I expect to get >> the latest data back given the bucket properties (as shown below)... >> >> Bucket properties: >> >> {"props":{"name":"mybucket","allow_mult":false,"basic_quorum":false,"big_vclock":50,"chash_keyfun":{"mod":"riak_core_util","fun":"chash_std_keyfun"},"dw":"quorum","last_write_wins":false,"linkfun":{"mod":"riak_kv_wm_link_walker","fun":"mapreduce_linkfun"},"n_val":4,"notfound_ok":true,"old_vclock":86400,"postcommit":[],"pr":"quorum","precommit":[],"pw":"quorum","r":"quorum","rw":"quorum","small_vclock":50,"w":"quorum","young_vclock":20}} >> >> Is my understanding not correct in this (the important properties to me are >> the pw/pr settings to ensure a good distribution and consistency). >> >> Regards >> Steve >> >> On Wed, May 23, 2012 at 1:31 PM, Shuhao Wu <ad...@thekks.net> wrote: >> Riak is eventually consistent. Deleting it doesn't show up immediately. >> There is an option like delete_immediate >> >> Shuhao >> >> On May 23, 2012 4:08 PM, "Steve Warren" <swar...@myvest.com> wrote: >> I'm seeing this pretty consistently and have no explanation for it. I delete >> a large number of keys (20k to 100k), but when I then search on the keys >> ($key/0/g) anywhere from 0-200 or so of the deleted keys show up in the >> results. It doesn't matter how long I wait after completing the deletion >> step, the keys stay in the list until I try to access the object and then it >> goes away. I'm using 1.1.2 and the riak-java client, and getting no errors >> on the deletion step. >> >> On Tue, May 22, 2012 at 9:34 AM, Steve Warren <swar...@myvest.com> wrote: >> Thank you for the reply. My observation does not quite match up with this >> though so I'm still a bit confused. The deleted keys appeared to stay long >> past the 3 seconds described in the post you referenced. In fact, I don't >> know if they ever "went away". I'll run some more tests to see if I can >> narrow down the exact behavior, for example not all key deletions exhibited >> this behavior (the test I ran resulted in 118 residual keys out of around >> 20K deletes). If I directly queried any of the keys it would respond with >> "not found" and immediately stop showing up in the key list or $key index >> query. >> >> I'm still running a bunch of tests just to learn the behavior of the system >> so I'll keep plugging away at it. For example, I'm observing that $key index >> queries halt inserts into the same bucket while the query is running, I >> don't know yet if this halts all server activity or just the inserts for >> that bucket though. >> >> Regards >> Steve >> >> On Tue, May 22, 2012 at 8:58 AM, Kelly McLaughlin <ke...@basho.com> wrote: >> Hi Steve. There is no caching of key lists in riak. What you are seeing is >> likely the fact that listing of keys or index queries can pick up deleted >> keys due to the fact that riak keeps tombstone markers around for deleted >> objects for some period. For a really good explanation of riak's delete >> behavior check out this writeup by Jon Meredith: >> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-October/006048.html. >> You can set delete_mode to immediate as described in that post and you will >> most likely not see any deleted keys when you do an index query or key list. >> The tradeoff is that you may get the unexpected behavior when doing >> concurrent updates to the same set of keys that the delete_mode changes were >> designed to address as Jon also indicates in that post. We are considering >> different options on this front, but at this time no actual changes have >> been made to address this. >> >> Kelly >> >> On May 20, 2012, at 10:13 AM, Steve Warren wrote: >> >>> The last message I saw on this (from a year ago) says the caching of key >>> lists will be removed. I just ran into it while running a $key index range >>> search. I then ran a ?key=stream search on the bucket and the same stale >>> key list appeared (I had created a bunch of data and then deleted it as a >>> test). Did the caching removal not happen? I'm running 1.1.2 >>> >>> The query: >>> >>> curl 'localhost:8098/buckets/testbucket/index/$key/0/g' >>> >>> As others have noted, this behavior is quite disconcerting and I don't want >>> to pepper the application with otherwise unnecessary checks for stale keys >>> even on 2i range queries. Or is that unavoidable? >>> >>> Regards >>> Steve >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com