riak key sharding
Hi, I have noticed that Riak-CS can shard (that is split) large keys automatically across nodes. I would like to achieve a similar outcome with Riak itself. Is there any best practice to achieve this? Could a portion of Riak-CS be used or should I just bite the bullet and use Riak-CS? Latency is key for my application and I wanted to avoid the additional layer Riak-CS provides. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Upgrade from 1.3.1 to 1.4.2 => high IO
Hi @list, I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after upgrading the first node (out of 12) this node seems to do many merges. the sst_* directories changes in size "rapidly" and the node is having a disk utilization of 100% all the time. I know that there is something like that: "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset will initiate an automatic conversion that could pause the startup of each node by 3 to 7 minutes. The leveldb data in "level #1" is being adjusted such that "level #1" can operate as an overlapped data level instead of as a sorted data level. The conversion is simply the reduction of the number of files in "level #1" to being less than eight via normal compaction of data from "level #1" into "level #2". This is a one time conversion." but it looks much more invasive than explained here or doesn't have to do anything with the (probably seen) merges. Is this "normal" behavior or could I do anything about it? At the moment I'm stucked with the upgrade procedure because this high IO load would probably lead to high response times. Also we have a lot of data (per node ~950 GB). Cheers Simon ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, see inline.. On Tue, 10 Dec 2013 10:38:03 -0500 Matthew Von-Maszewski wrote: > The sad truth is that you are not the first to see this problem. And yes, it > has to do with your 950GB per node dataset. And no, nothing to do but sit > through it at this time. > > While I did extensive testing around upgrade times before shipping 1.4, > apparently there are data configurations I did not anticipate. You are > likely seeing a cascade where a shift of one file from level-1 to level-2 is > causing a shift of another file from level-2 to level-3, which causes a > level-3 file to shift to level-4, etc … then the next file shifts from > level-1. > > The bright side of this pain is that you will end up with better write > throughput once all the compaction ends. I have to deal with that.. but my problem is now, if I'm doing this node by node it looks like 2i searches aren't possible while 1.3 and 1.4 nodes exists in the cluster. Is there any problem which leads me to an 2i repair marathon or could I easily wait for some hours for each node until all merges are done before I upgrade the next one? (2i searches can fail for some time.. the APP isn't having problems with that but are new inserts with 2i indices processed successfully or do I have to do the 2i repair?) /s one other good think: saving disk space is one advantage ;).. > > Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but that > is not going to help you today. > > Matthew > > On Dec 10, 2013, at 10:26 AM, Simon Effenberg > wrote: > > > Hi @list, > > > > I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > > upgrading the first node (out of 12) this node seems to do many merges. > > the sst_* directories changes in size "rapidly" and the node is having > > a disk utilization of 100% all the time. > > > > I know that there is something like that: > > > > "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > > will initiate an automatic conversion that could pause the startup of > > each node by 3 to 7 minutes. The leveldb data in "level #1" is being > > adjusted such that "level #1" can operate as an overlapped data level > > instead of as a sorted data level. The conversion is simply the > > reduction of the number of files in "level #1" to being less than eight > > via normal compaction of data from "level #1" into "level #2". This is > > a one time conversion." > > > > but it looks much more invasive than explained here or doesn't have to > > do anything with the (probably seen) merges. > > > > Is this "normal" behavior or could I do anything about it? > > > > At the moment I'm stucked with the upgrade procedure because this high > > IO load would probably lead to high response times. > > > > Also we have a lot of data (per node ~950 GB). > > > > Cheers > > Simon > > > > ___ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web:www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
I had something like that once but with version 1.2 or 1.3 .. a rolling restart helped in my case. /s On Mon, 9 Dec 2013 09:48:12 -0500 Ivaylo Panitchkov wrote: > Hello, > > We have a prod cluster of four machines running riak (1.1.4 2012-06-19) > Debian x86_64. > Two days ago one of the servers went down because of a hardware failure. > I force-removed the machine in question to re-balance the cluster before > adding the new machine. > Since then the cluster is operating properly, but I noticed some handoffs > are stalled now. > I had similar situation awhile ago that was solved by simply forcing the > handoffs, but this time the same approach didn't work. > Any ideas, solutions or just hints are greatly appreciated. > Below are cluster statuses. Replaced the IP addresses for security reason. > > > > ~# riak-admin member_status > Attempting to restart script through sudo -u riak > = Membership > == > Status RingPendingNode > --- > valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' > valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' > valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' > --- > Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 > > > > ~# riak-admin ring_status > Attempting to restart script through sudo -u riak > == Claimant > === > Claimant: 'r...@aaa.aaa.aaa.aaa' > Status: up > Ring Ready: true > > == Ownership Handoff > == > Owner: r...@aaa.aaa.aaa.aaa > Next Owner: r...@bbb.bbb.bbb.bbb > > Index: 22835963083295358096932575511191922182123945984 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 570899077082383952423314387779798054553098649600 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 1118962191081472546749696200048404186924073353216 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 1392993748081016843912887106182707253109560705024 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > --- > Owner: r...@aaa.aaa.aaa.aaa > Next Owner: r...@ccc.ccc.ccc.ccc > > Index: 114179815416476790484662877555959610910619729920 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 662242929415565384811044689824565743281594433536 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 1210306043414653979137426502093171875652569137152 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > --- > > == Unreachable Nodes > == > All nodes are up and reachable > > > > Thanks in advance, > Ivaylo > > > > -- > Ivaylo Panitchkov > Software developer > Hibernum Creations Inc. > > Ce courriel est confidentiel et peut aussi être protégé par la loi.Si vous > avez reçu ce courriel par erreur, veuillez nous en aviser immédiatement en > y répondant, puis supprimer ce message de votre système. Veuillez ne pas le > copier, l’utiliser pour quelque raison que ce soit ni divulguer son contenu > à quiconque. > This email is confidential and may also be legally privileged. If you have > received this email in error, please notify us immediately by reply email > and then delete this message from your system. Please do not copy it or use > it for any purpose or disclose its content. -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web:www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
The sad truth is that you are not the first to see this problem. And yes, it has to do with your 950GB per node dataset. And no, nothing to do but sit through it at this time. While I did extensive testing around upgrade times before shipping 1.4, apparently there are data configurations I did not anticipate. You are likely seeing a cascade where a shift of one file from level-1 to level-2 is causing a shift of another file from level-2 to level-3, which causes a level-3 file to shift to level-4, etc … then the next file shifts from level-1. The bright side of this pain is that you will end up with better write throughput once all the compaction ends. Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but that is not going to help you today. Matthew On Dec 10, 2013, at 10:26 AM, Simon Effenberg wrote: > Hi @list, > > I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > upgrading the first node (out of 12) this node seems to do many merges. > the sst_* directories changes in size "rapidly" and the node is having > a disk utilization of 100% all the time. > > I know that there is something like that: > > "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > will initiate an automatic conversion that could pause the startup of > each node by 3 to 7 minutes. The leveldb data in "level #1" is being > adjusted such that "level #1" can operate as an overlapped data level > instead of as a sorted data level. The conversion is simply the > reduction of the number of files in "level #1" to being less than eight > via normal compaction of data from "level #1" into "level #2". This is > a one time conversion." > > but it looks much more invasive than explained here or doesn't have to > do anything with the (probably seen) merges. > > Is this "normal" behavior or could I do anything about it? > > At the moment I'm stucked with the upgrade procedure because this high > IO load would probably lead to high response times. > > Also we have a lot of data (per node ~950 GB). > > Cheers > Simon > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
2i is not my expertise, so I had to discuss you concerns with another Basho developer. He says: Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk format. You must wait for all nodes to update if you desire to use the new 2i query. The 2i data will properly write/update on both 1.3 and 1.4 machines during the migration. Does that answer your question? And yes, you might see available disk space increase during the upgrade compactions if your dataset contains numerous delete "tombstones". The Riak 2.0 code includes a new feature called "aggressive delete" for leveldb. This feature is more proactive in pushing delete tombstones through the levels to free up disk space much more quickly (especially if you perform block deletes every now and then). Matthew On Dec 10, 2013, at 10:44 AM, Simon Effenberg wrote: > Hi Matthew, > > see inline.. > > On Tue, 10 Dec 2013 10:38:03 -0500 > Matthew Von-Maszewski wrote: > >> The sad truth is that you are not the first to see this problem. And yes, >> it has to do with your 950GB per node dataset. And no, nothing to do but >> sit through it at this time. >> >> While I did extensive testing around upgrade times before shipping 1.4, >> apparently there are data configurations I did not anticipate. You are >> likely seeing a cascade where a shift of one file from level-1 to level-2 is >> causing a shift of another file from level-2 to level-3, which causes a >> level-3 file to shift to level-4, etc … then the next file shifts from >> level-1. >> >> The bright side of this pain is that you will end up with better write >> throughput once all the compaction ends. > > I have to deal with that.. but my problem is now, if I'm doing this > node by node it looks like 2i searches aren't possible while 1.3 and > 1.4 nodes exists in the cluster. Is there any problem which leads me to > an 2i repair marathon or could I easily wait for some hours for each > node until all merges are done before I upgrade the next one? (2i > searches can fail for some time.. the APP isn't having problems with > that but are new inserts with 2i indices processed successfully or do > I have to do the 2i repair?) > > /s > > one other good think: saving disk space is one advantage ;).. > > >> >> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but that >> is not going to help you today. >> >> Matthew >> >> On Dec 10, 2013, at 10:26 AM, Simon Effenberg >> wrote: >> >>> Hi @list, >>> >>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after >>> upgrading the first node (out of 12) this node seems to do many merges. >>> the sst_* directories changes in size "rapidly" and the node is having >>> a disk utilization of 100% all the time. >>> >>> I know that there is something like that: >>> >>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset >>> will initiate an automatic conversion that could pause the startup of >>> each node by 3 to 7 minutes. The leveldb data in "level #1" is being >>> adjusted such that "level #1" can operate as an overlapped data level >>> instead of as a sorted data level. The conversion is simply the >>> reduction of the number of files in "level #1" to being less than eight >>> via normal compaction of data from "level #1" into "level #2". This is >>> a one time conversion." >>> >>> but it looks much more invasive than explained here or doesn't have to >>> do anything with the (probably seen) merges. >>> >>> Is this "normal" behavior or could I do anything about it? >>> >>> At the moment I'm stucked with the upgrade procedure because this high >>> IO load would probably lead to high response times. >>> >>> Also we have a lot of data (per node ~950 GB). >>> >>> Cheers >>> Simon >>> >>> ___ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: seffenb...@team.mobile.de > Web:www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Upgrade from 1.3.1 to 1.4.2 => high IO
Hi Matthew, thanks!.. that answers my questions! Cheers Simon On Tue, 10 Dec 2013 11:08:32 -0500 Matthew Von-Maszewski wrote: > 2i is not my expertise, so I had to discuss you concerns with another Basho > developer. He says: > > Between 1.3 and 1.4, the 2i query did change but not the 2i on-disk format. > You must wait for all nodes to update if you desire to use the new 2i query. > The 2i data will properly write/update on both 1.3 and 1.4 machines during > the migration. > > Does that answer your question? > > > And yes, you might see available disk space increase during the upgrade > compactions if your dataset contains numerous delete "tombstones". The Riak > 2.0 code includes a new feature called "aggressive delete" for leveldb. This > feature is more proactive in pushing delete tombstones through the levels to > free up disk space much more quickly (especially if you perform block deletes > every now and then). > > Matthew > > > On Dec 10, 2013, at 10:44 AM, Simon Effenberg > wrote: > > > Hi Matthew, > > > > see inline.. > > > > On Tue, 10 Dec 2013 10:38:03 -0500 > > Matthew Von-Maszewski wrote: > > > >> The sad truth is that you are not the first to see this problem. And yes, > >> it has to do with your 950GB per node dataset. And no, nothing to do but > >> sit through it at this time. > >> > >> While I did extensive testing around upgrade times before shipping 1.4, > >> apparently there are data configurations I did not anticipate. You are > >> likely seeing a cascade where a shift of one file from level-1 to level-2 > >> is causing a shift of another file from level-2 to level-3, which causes a > >> level-3 file to shift to level-4, etc … then the next file shifts from > >> level-1. > >> > >> The bright side of this pain is that you will end up with better write > >> throughput once all the compaction ends. > > > > I have to deal with that.. but my problem is now, if I'm doing this > > node by node it looks like 2i searches aren't possible while 1.3 and > > 1.4 nodes exists in the cluster. Is there any problem which leads me to > > an 2i repair marathon or could I easily wait for some hours for each > > node until all merges are done before I upgrade the next one? (2i > > searches can fail for some time.. the APP isn't having problems with > > that but are new inserts with 2i indices processed successfully or do > > I have to do the 2i repair?) > > > > /s > > > > one other good think: saving disk space is one advantage ;).. > > > > > >> > >> Riak 2.0's leveldb has code to prevent/reduce compaction cascades, but > >> that is not going to help you today. > >> > >> Matthew > >> > >> On Dec 10, 2013, at 10:26 AM, Simon Effenberg > >> wrote: > >> > >>> Hi @list, > >>> > >>> I'm trying to upgrade our Riak cluster from 1.3.1 to 1.4.2 .. after > >>> upgrading the first node (out of 12) this node seems to do many merges. > >>> the sst_* directories changes in size "rapidly" and the node is having > >>> a disk utilization of 100% all the time. > >>> > >>> I know that there is something like that: > >>> > >>> "The first execution of 1.4.0 leveldb using a 1.3.x or 1.2.x dataset > >>> will initiate an automatic conversion that could pause the startup of > >>> each node by 3 to 7 minutes. The leveldb data in "level #1" is being > >>> adjusted such that "level #1" can operate as an overlapped data level > >>> instead of as a sorted data level. The conversion is simply the > >>> reduction of the number of files in "level #1" to being less than eight > >>> via normal compaction of data from "level #1" into "level #2". This is > >>> a one time conversion." > >>> > >>> but it looks much more invasive than explained here or doesn't have to > >>> do anything with the (probably seen) merges. > >>> > >>> Is this "normal" behavior or could I do anything about it? > >>> > >>> At the moment I'm stucked with the upgrade procedure because this high > >>> IO load would probably lead to high response times. > >>> > >>> Also we have a lot of data (per node ~950 GB). > >>> > >>> Cheers > >>> Simon > >>> > >>> ___ > >>> riak-users mailing list > >>> riak-users@lists.basho.com > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >> > > > > > > -- > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > Fon: + 49-(0)30-8109 - 7173 > > Fax: + 49-(0)30-8109 - 7131 > > > > Mail: seffenb...@team.mobile.de > > Web:www.mobile.de > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > Geschäftsführer: Malte Krüger > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > Sitz der Gesellschaft: Kleinmachnow > -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web:www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführ
2i stopped working on LevelDB with multi backend
We just rebuilt our test environment (something we do every month) and suddenly we get the following error when trying to use 2i: {error,{error,{indexes_not_supported,riak_kv_multi_backend}}} But looking at the properties of the bucket it's set to use leveldb: # curl -k https://localhost:8069/riak/eleveldb/ | jq . { "props": { "young_vclock": 20, "w": "quorum", "small_vclock": 50, "rw": "quorum", "r": "quorum", "linkfun": { "fun": "mapreduce_linkfun", "mod": "riak_kv_wm_link_walker" }, "last_write_wins": false, "dw": "quorum", "chash_keyfun": { "fun": "chash_std_keyfun", "mod": "riak_core_util" }, "big_vclock": 50, "basic_quorum": false, "backend": "eleveldb_data", "allow_mult": false, "n_val": 3, "name": "eleveldb", "notfound_ok": true, "old_vclock": 86400, "postcommit": [], "pr": 0, "precommit": [], "pw": 0 } } Here's the relevant app.config snippet: {storage_backend, riak_kv_multi_backend}, {multi_backend_default, <<"bitcask_data">>}, {multi_backend, [ {<<"bitcask_data">>, riak_kv_bitcask_backend, [ {data_root, "/srv/riak/data/bitcask/data"}, %%{io_mode, nif}, {max_file_size, 2147483648}, %% 2G {merge_window, always}, {frag_merge_trigger, 30}, %% Merge at 30% dead keys {dead_bytes_merge_trigger, 134217728}, %% Merge files that have more than 128MB dead {frag_threshold, 25}, %% Files that have 25% dead keys will be merged too {dead_bytes_threshold, 67108864}, %% Include files that have 64MB of dead space in merges {small_file_threshold, 10485760}, %% Files smaller than 100MB will not be merged {log_needs_merge, true}, %% Log when we need to merge... {sync_strategy, none} ] }, {<<"eleveldb_data">>, riak_kv_eleveldb_backend, [{data_root, "/srv/riak/data/eleveldb/files"}, {write_buffer_size_min, 31457280 }, %% 30 MB in bytes {write_buffer_size_max, 62914560}, %% 60 MB in bytes {max_open_files, 20}, %% Maximum number of files open at once per partition {sst_block_size, 4096}, %% 4K blocks {cache_size, 8388608} %% 8MB default cache size per-partition ] } ]}, Anyone have any ideas? We're using Ubuntu 12.04 with the Basho Riak 1.4.2 .deb. The only change to this environment has been to upgrade the kernel from 3.5.0-26 to 3.8.0-31-generic but I'd be very surprised if that broke 2i... Thanks, Chris ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: riak nagios script
Hello Kathleen, Have you executed the `make encrypt` target to build the `check_node` binary? [0] From there, I copied it to the Riak node and invoked it like this: $ /usr/lib/riak/erts-5.9.1/bin/escript check_node --node riak@127.0.0.1 riak_kv_up OKAY: riak_kv is running on riak@127.0.0.1 I used the entire path to escript because the bin directory under erts was not in my PATH by default. -- Hector [0] https://github.com/basho/riak_nagios#building On Mon, Dec 9, 2013 at 7:35 PM, kzhang wrote: > Also, when running > > https://github.com/basho/riak_nagios/blob/master/src/check_node.erl > > I ran into the error: > > ** exception error: undefined function getopt:parse/2 > in function check_node:main/2 (check_node.erl, line 15) > > > > > -- > View this message in context: > http://riak-users.197444.n3.nabble.com/riak-nagios-script-tp4030025p4030026.html > Sent from the Riak Users mailing list archive at Nabble.com. > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: riak nagios script
Thanks Hector. Here is how I executed the script. I downloaded and installed the erlang shell from http://www.erlang.org/documentation/doc-5.3/doc/getting_started/getting_started.html started erlang OTP: root@MYRIAKNODE otp_src_R16B02]# erl -s toolbar Erlang R16B02 (erts-5.10.3) [source] [64-bit] [async-threads:10] [hipe] [kernel-poll:false] Eshell V5.10.3 (abort with ^G) grabbed the source code (https://github.com/basho/riak_nagios/blob/master/src/check_node.erl), compiled it: c(check_node). ran it: check_node:main([{node, 'xx.xx.xx.xx'}]). then got: ** exception error: undefined function getopt:parse/2 in function check_node:main/2 (check_node.erl, line 15) Here is where I am. I found this: https://github.com/jcomellas/getopt I grabbed the source code, compiled it under otp_src_R16B02. ran it again: 2> check_node:main([{node, 'xx.xx.xx.xx'}]). UNKNOWN: invalid_option_arg {check,{node,'xx.xx.xx.xx'}} Am I on the right path? Thanks, Kathleen -- View this message in context: http://riak-users.197444.n3.nabble.com/riak-nagios-script-tp4030025p4030037.html Sent from the Riak Users mailing list archive at Nabble.com. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Riak Recap for December 4 - 9
Morning, Afternoon, Evening to All - Here's today's Recap. Enjoy. Also, if you're around Raleigh/Durham and want to have drinks next week, let me know. Mark twitter.com/pharkmillups --- Riak Recap for December 4 - 9 == The recording of last Friday's Riak Community Hangout is now available. This one is all about Riak Security and the exciting history behind "allow_mult=false". It's well worth your time. - http://www.youtube.com/watch?v=n8m8xlizekg John Daily et al. are talking about Riak 2.0 tomorrow night at the Braintree offices in Chicago. This is not to be missed. - www.meetup.com/Chicago-Riak-Meetup/events/151516252/ Tom Santero and I will be at the West End Ruby Meetup next week in Durham, NC to talk about Riak. - http://www.meetup.com/raleighrb/events/154001722/ Riakpbc, nlf's Node.js protocol buffers client for Riak hit version 1.0.5. (Also, h/t to nlf for cranking out bug fixes.) - https://npmjs.org/package/riakpbc Riaks, Noah Isaacson's Rak client, just hit 2.0.2 - https://npmjs.org/package/riaks We wrote up some details on how the team at CityMaps is using Riak in production. - https://basho.com/social-map-innovator-and-riak-user-citymaps-predicts-where-you-want-to-go/ Vincent Chinedu Okonkwo open sourced a Lager backend for Mozilla’s Heka. - https://github.com/codmajik/herlka Vic Iglesias wrote a great post about getting Riak CS and Eucalyptus running together. - http://testingclouds.wordpress.com/2013/12/10/testing-riak-cs-with-eucalyptus/ Q & A - http://stackoverflow.com/questions/20366695/truncate-a-riak-database - http://stackoverflow.com/questions/20440450/riak-databse-and-spring-mvc - http://stackoverflow.com/questions/20461280/are-single-client-writes-strongly-ordered-in-dynamodb-or-riak ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: riak nagios script
Hi Kathleen, If you’d like to run riak_nagios from the erl command line, you’ll need to compile everything in src and include it in the path along with the getopt library. You can compile everything with a simple call to make, and then include it in the path with "erl -pa deps/*/ebin ebin”. Once everything is loaded, you can call "check_node:main(["--node", "dev1@127.0.0.1", "riak_kv_up"]).” or something similar to run it. The last parameter in the Args array will be the check to make. Is there a reason you’re running it this way instead of compiling it to an escript and running it from bash? Thanks, Alex Moore On December 10, 2013 at 1:26:20 PM, kzhang (kzh...@wayfair.com) wrote: Thanks Hector. Here is how I executed the script. I downloaded and installed the erlang shell from http://www.erlang.org/documentation/doc-5.3/doc/getting_started/getting_started.html started erlang OTP: root@MYRIAKNODE otp_src_R16B02]# erl -s toolbar Erlang R16B02 (erts-5.10.3) [source] [64-bit] [async-threads:10] [hipe] [kernel-poll:false] Eshell V5.10.3 (abort with ^G) grabbed the source code (https://github.com/basho/riak_nagios/blob/master/src/check_node.erl), compiled it: c(check_node). ran it: check_node:main([{node, 'xx.xx.xx.xx'}]). then got: ** exception error: undefined function getopt:parse/2 in function check_node:main/2 (check_node.erl, line 15) Here is where I am. I found this: https://github.com/jcomellas/getopt I grabbed the source code, compiled it under otp_src_R16B02. ran it again: 2> check_node:main([{node, 'xx.xx.xx.xx'}]). UNKNOWN: invalid_option_arg {check,{node,'xx.xx.xx.xx'}} Am I on the right path? Thanks, Kathleen -- View this message in context: http://riak-users.197444.n3.nabble.com/riak-nagios-script-tp4030025p4030037.html Sent from the Riak Users mailing list archive at Nabble.com. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
What does "riak-admin transfers" tell you? Are there any transfers in progress? You can try to set the amount of allowed transfers per host to 0 and then back to 2 (the default) or whatever you want, in order to restart any transfers which may be in progress. You can do that with the "riak-admin transfer-limit " command. -- Jeppe Fihl Toustrup Operations Engineer Falcon Social On 9 December 2013 15:48, Ivaylo Panitchkov wrote: > > > Hello, > > We have a prod cluster of four machines running riak (1.1.4 2012-06-19) > Debian x86_64. > Two days ago one of the servers went down because of a hardware failure. > I force-removed the machine in question to re-balance the cluster before > adding the new machine. > Since then the cluster is operating properly, but I noticed some handoffs are > stalled now. > I had similar situation awhile ago that was solved by simply forcing the > handoffs, but this time the same approach didn't work. > Any ideas, solutions or just hints are greatly appreciated. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
Hello, Below is the transfers info: ~# riak-admin transfers Attempting to restart script through sudo -u riak 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions ~# riak-admin member_status Attempting to restart script through sudo -u riak = Membership == Status RingPendingNode --- valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' --- It's stuck with all those handoffs for few days now. riak-admin ring_status gives me the same info like the one I mentioned when opened the case. I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's responsible for almost half of the data. Is it safe to add another machine to the cluster in order to relief AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved? Thanks, Ivaylo On Tue, Dec 10, 2013 at 3:04 PM, Jeppe Toustrup wrote: > What does "riak-admin transfers" tell you? Are there any transfers in > progress? > You can try to set the amount of allowed transfers per host to 0 and > then back to 2 (the default) or whatever you want, in order to restart > any transfers which may be in progress. You can do that with the > "riak-admin transfer-limit " command. > > -- > Jeppe Fihl Toustrup > Operations Engineer > Falcon Social > > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
Try to take a look at this thread from November where I experienced a similar problem: http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html The following mails in the thread mentions things you try to correct the problem, and what I ended up doing with the help of Basho employees. -- Jeppe Fihl Toustrup Operations Engineer Falcon Social On 10 December 2013 22:03, Ivaylo Panitchkov wrote: > Hello, > Below is the transfers info: > > ~# riak-admin transfers > > Attempting to restart script through sudo -u riak > 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions > 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions > 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions > > > ~# riak-admin member_status > Attempting to restart script through sudo -u riak > = Membership > == > Status RingPendingNode > --- > valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' > valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' > valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' > --- > > It's stuck with all those handoffs for few days now. > riak-admin ring_status gives me the same info like the one I mentioned when > opened the case. > I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's > responsible for almost half of the data. > Is it safe to add another machine to the cluster in order to relief > AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved? > > Thanks, > Ivaylo ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: riak nagios script
Hi Alex, Thanks. I am completely new to erlang. When googling how to run an erlang program, I came across http://www.erlang.org/documentation/doc-5.3/doc/getting_started/getting_started.html . That's how I got started. To run the script using escript, based on http://www.erlang.org/doc/man/escript.html, looks like I dont need to compile the scripts, so I ran: /usr/local/bin/escript check_node --node riak@127.0.0.1 check_riak_repl got escript: Failed to open file: check_node -- View this message in context: http://riak-users.197444.n3.nabble.com/riak-nagios-script-tp4030025p4030043.html Sent from the Riak Users mailing list archive at Nabble.com. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
Hi Ivaylo, Is there anything useful in console.log of any (or all) the nodes? If so, throw it in a gist and we'll take a look at it. Mark On Tue, Dec 10, 2013 at 1:13 PM, Jeppe Toustrup wrote: > Try to take a look at this thread from November where I experienced a > similar problem: > http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html > > The following mails in the thread mentions things you try to correct > the problem, and what I ended up doing with the help of Basho > employees. > > -- > Jeppe Fihl Toustrup > Operations Engineer > Falcon Social > > On 10 December 2013 22:03, Ivaylo Panitchkov wrote: >> Hello, >> Below is the transfers info: >> >> ~# riak-admin transfers >> >> Attempting to restart script through sudo -u riak >> 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions >> 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions >> 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions >> >> >> ~# riak-admin member_status >> Attempting to restart script through sudo -u riak >> = Membership >> == >> Status RingPendingNode >> --- >> valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' >> valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' >> valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' >> --- >> >> It's stuck with all those handoffs for few days now. >> riak-admin ring_status gives me the same info like the one I mentioned when >> opened the case. >> I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's >> responsible for almost half of the data. >> Is it safe to add another machine to the cluster in order to relief >> AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved? >> >> Thanks, >> Ivaylo > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com