> Would be good to know the riak version Riak 2.1.1 Riak CS 2.1.0 Stanchion 2.1.0
> why the dvv_enabled bucket property is set to false, please? Looks like that's the default <http://docs.basho.com/riak/kv/2.2.0/learn/concepts/buckets/#dvv-enabled>. I haven't changed it. > Also, is there multi-datacentre replication involved? no > Do you re-use your keys, for example, have the keys in question been created, deleted, and then re-created? no Thank you for the prompt follow-up. Daniel On Mon, Mar 6, 2017 at 10:38 AM, Russell Brown <russell.br...@icloud.com> wrote: > Hi, > Would be good to know the riak version, and why the dvv_enabled bucket > property is set to false, please? Also, is there multi-datacentre > replication involved? Do you re-use your keys, for example, have the keys > in question been created, deleted, and then re-created? > > Cheers > > Russell > > On 6 Mar 2017, at 15:07, Daniel Miller <dmil...@dimagi.com> wrote: > > > I recently had another case of a disappearing object. This time the > object was successfully PUT, and (unlike the previous cases reported in > this thread) for a period of time GETs were also successful. Then GETs > started 404ing for no apparent reason. There are no errors in the logs to > indicate that anything unusual happened. This is quite disconcerting. Is it > normal that Riak CS just loses track of objects? At this point we are using > CS as primary object storage, meaning we do not have the data stored in > another database so it's critical that the data is not randomly lost. > > > > In the CS access logs I see > > > > # all prior GET requests for this object succeeding like this one. This > is the last successful GET request: > > [28/Feb/2017:14:42:35 +0000] "GET > > /buckets/blobdb/objects/commcarehq__apps%2F3d2b... > HTTP/1.0" 200 14923 "" "Boto3/1.4.0 Python/2.7.6 Linux/3.13.0-86-generic > Botocore/1.4.53 Resource" > > ... > > # all GET requests for this object are now failing like this one (the > first 404): > > [02/Mar/2017:08:36:11 +0000] "GET > > /buckets/blobdb/objects/commcarehq__apps%2F3d2b... > HTTP/1.0" 404 240 "" "Boto3/1.4.0 Python/2.7.6 Linux/3.13.0-86-generic > Botocore/1.4.53 Resource" > > > > The object name has been elided for readability. I do not know when this > object was PUT into the cluster because I only have logs for the past > month. Is there any way to dig further into Riak or Riak CS data to > determine if the object content is actually completely lost or if there are > any other details that might explain why it is now missing? Could I > increase some logging parameters to get more information about what is > going wrong when something like this happens? > > > > I have searched the logs for other 404 responses but found none (other > than the two reported earlier), so this is the 3rd known missing object in > the cluster. We retain logs for one month only (I'm increasing this now > because of this issue), so it is possible that other objects have also gone > missing, but I cannot see them since the logs have been truncated. > > > > The cluster now has 7 nodes instead of 9 (see earlier emails in this > thread), and the riak storage backend is now leveldb instead of multi. I > have attached config file templates for riak, raik-cs and stanchion (these > are deployed with ansible). > > > > Bucket properties: > > { > > "props": { > > "notfound_ok": true, > > "n_val": 3, > > "last_write_wins": false, > > "allow_mult": true, > > "dvv_enabled": false, > > "name": "blobdb", > > "r": "quorum", > > "precommit": [], > > "old_vclock": 86400, > > "dw": "quorum", > > "rw": "quorum", > > "small_vclock": 50, > > "write_once": false, > > "basic_quorum": false, > > "big_vclock": 50, > > "chash_keyfun": { > > "fun": "chash_std_keyfun", > > "mod": "riak_core_util" > > }, > > "postcommit": [], > > "pw": 0, > > "w": "quorum", > > "young_vclock": 20, > > "pr": 0, > > "linkfun": { > > "fun": "mapreduce_linkfun", > > "mod": "riak_kv_wm_link_walker" > > } > > } > > } > > > > I'll be happy to provide more context to help troubleshoot this issue. > > > > Thanks in advance for any help you can provide. > > > > Daniel > > > > > > On Tue, Feb 14, 2017 at 11:52 AM, Daniel Miller <dmil...@dimagi.com> > wrote: > > Hi Luke, > > > > Sorry for the late response and thanks for following up. I haven't seen > it happen since. At this point I'm going to wait and see if it happens > again and hopefully get more details about what might be causing it. > > > > Daniel > > > > On Thu, Feb 9, 2017 at 1:02 PM, Luke Bakken <lbak...@basho.com> wrote: > > Hi Daniel - > > > > I don't have any ideas at this point. Has this scenario happened again? > > > > -- > > Luke Bakken > > Engineer > > lbak...@basho.com > > > > > > On Wed, Jan 25, 2017 at 2:11 PM, Daniel Miller <dmil...@dimagi.com> > wrote: > > > Thanks for the quick response, Luke. > > > > > > There is nothing unusual about the keys. The format is a name + UUID + > some > > > other random URL-encoded charaters, like most other keys in our > cluster. > > > > > > There are no errors near the time of the incident in any of the logs > (the > > > last [error] is from over a month before). I see lots of messages like > this > > > in console.log: > > > > > > /var/log/riak/console.log > > > 2017-01-20 15:38:10.184 [info] > > > <0.22902.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys > during > > > active anti-entropy exchange of > > > {776422744832042175295707567380525354192214163456,3} between > > > {776422744832042175295707567380525354192214163456,'riak- > fa...@fake3.fake.com'} > > > and > > > {822094670998632891489572718402909198556462055424,'riak- > fa...@fake9.fake.com'} > > > 2017-01-20 15:40:39.640 [info] > > > <0.21789.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 1 keys > during > > > active anti-entropy exchange of > > > {936274486415109681974235595958868809467081785344,3} between > > > {959110449498405040071168171470060731649205731328,'riak- > fa...@fake3.fake.com'} > > > and > > > {981946412581700398168100746981252653831329677312,'riak- > fa...@fake5.fake.com'} > > > 2017-01-20 15:46:40.918 [info] > > > <0.13986.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys > during > > > active anti-entropy exchange of > > > {662242929415565384811044689824565743281594433536,3} between > > > {685078892498860742907977265335757665463718379520,'riak- > fa...@fake3.fake.com'} > > > and > > > {707914855582156101004909840846949587645842325504,'riak- > fa...@fake6.fake.com'} > > > 2017-01-20 15:48:25.597 [info] > > > <0.29943.1193>@riak_kv_exchange_fsm:key_exchange:263 Repaired 2 keys > during > > > active anti-entropy exchange of > > > {776422744832042175295707567380525354192214163456,3} between > > > {776422744832042175295707567380525354192214163456,'riak- > fa...@fake3.fake.com'} > > > and > > > {799258707915337533392640142891717276374338109440,'riak- > fa...@fake0.fake.com'} > > > > > > Thanks! > > > Daniel > > > > > > > > > > > > On Wed, Jan 25, 2017 at 9:45 AM, Luke Bakken <lbak...@basho.com> > wrote: > > >> > > >> Hi Daniel - > > >> > > >> This is a strange scenario. I recommend looking at all of the log > > >> files for "[error]" or other entries at about the same time as these > > >> PUTs or 404 responses. > > >> > > >> Is there anything unusual about the key being used? > > >> -- > > >> Luke Bakken > > >> Engineer > > >> lbak...@basho.com > > >> > > >> > > >> On Wed, Jan 25, 2017 at 6:40 AM, Daniel Miller <dmil...@dimagi.com> > wrote: > > >> > I have a 9-node Riak CS cluster that has been working flawlessly for > > >> > about 3 > > >> > months. The cluster configuration, including backend and bucket > > >> > parameters > > >> > such as N-value are using default settings. I'm using the S3 API to > > >> > communicate with the cluster. > > >> > > > >> > Within the past week I had an issue where two objects were PUT > resulting > > >> > in > > >> > a 200 (success) response, but all subsequent GET requests for those > two > > >> > keys > > >> > return status of 404 (not found). Other than the fact that they are > now > > >> > missing, there was nothing out of the ordinary with these > particular to > > >> > PUTs. Maybe I'm missing something, but this seems like a scenario > that > > >> > should never happen. All information included here about PUTs and > GETs > > >> > comes > > >> > from reviewing the CS access logs. Both objects were PUT on the same > > >> > node, > > >> > however GET requests returning 404 have been observed on all nodes. > > >> > There is > > >> > plenty of other traffic on the cluster involving GETs and PUTs that > are > > >> > not > > >> > failing. I'm unsure of how to troubleshoot further to find out what > may > > >> > have > > >> > happened to those objects and why they are now missing. What is the > best > > >> > approach to figure out why an object that was successfully PUT > seems to > > >> > be > > >> > missing? > > >> > > > >> > Thanks! > > >> > Daniel Miller > > >> > > > >> > _______________________________________________ > > >> > riak-users mailing list > > >> > riak-users@lists.basho.com > > >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >> > > > > > > > > > > > > > <config-files.zip>_______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com