Leveldb segfault during Riak startup

Antti Kuusela Fri, 18 Dec 2015 06:16:28 -0800

Hi,

I have been testing Riak and Riak CS as a possible solution for ourfuture storage needs. I have a five server cluster running Centos 7.Riak version is 2.1.3 (first installed as 2.1.1, updated twice via Bashorepo) and Riak CS version is 2.1.0. The servers each have 64GB RAM andsix 4TB disks in raid 6 using btrfs.

I have been pushing random data into Riak-CS via s3cmd to see how thesystem behaves. Smallest objects have been 2000 bytes, largest 100MB. Ihave also been making btrfs snapshots of the entire platform data dirnightly for backup purposes. Stop Riak CS, wait 10 seconds, stop Riak,wait 10, make snapshot, start Riak, wait 180 seconds, start Riak CS.This is performed on each of the servers in turn with a five minute waitin between. I have added the waits to try spread the startup load andallow the system time to get things running. New data is constantlypushed to the S3 API but restarting the nodes in rotation causes by farthe highest stress on the system.

I have encountered one problem in particular. Quite often one of theRiak nodes starts up but after a couple of minutes it just drops, allprocesses exited except for epmd.

Following is from /var/log/riak/console, most of the lines skipped forsake of brevity. Normal startup stuff, as far as I can see:

2015-12-16 00:26:04.446 [info] <0.7.0> Application lager started on node'riak@192.168.50.32'

...

2015-12-16 00:26:04.490 [info] <0.72.0> alarm_handler:{set,{system_memory_high_watermark,[]}}

...

2015-12-16 00:26:04.781 [info]<0.206.0>@riak_core_capability:process_capability_changes:555 Newcapability: {riak_core,vnode_routing} = proxy

...

2015-12-16 00:26:04.869 [info] <0.7.0> Application riak_core started onnode 'riak@192.168.50.32'

...

2015-12-16 00:26:04.969 [info] <0.407.0>@riak_kv_env:doc_env:46Environment and OS variables:2015-12-16 00:26:05.124 [warning] <0.6.0> lager_error_logger_h dropped 9messages in the last second that exceeded the limit of 100 messages/sec

2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Open file limit: 65536

2015-12-16 00:26:05.124 [warning] <0.407.0> riak_kv_env: Cores aredisabled, this may hinder debugging2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Erlang processlimit: 2621442015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Erlang portslimit: 655362015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: ETS table countlimit: 256000

2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Thread pool size: 64

2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Generations beforefull sweep: 02015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Schedulers: 12 for12 cores2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctlvm.swappiness is 0 greater than or equal to 0)2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctlnet.core.wmem_default is 8388608 lesser than or equal to 8388608)

...

2015-12-16 00:26:05.139 [info] <0.478.0>@riak_core:wait_for_service:504Waiting for service riak_kv to start (0 seconds)2015-12-16 00:26:05.158 [info]<0.495.0>@riak_kv_entropy_manager:set_aae_throttle_limits:790 SettingAAE throttle limits:[{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}]

...

2015-12-16 00:26:30.160 [info]<0.495.0>@riak_kv_entropy_manager:perhaps_log_throttle_change:853Changing AAE throttle from undefined -> 5000 msec/key, based on maximumvnode mailbox size{unknown_mailbox_sizes,node_list,['riak@192.168.50.32']} from['riak@192.168.50.32']2015-12-16 00:27:12.053 [info] <0.478.0>@riak_core:wait_for_service:504Waiting for service riak_kv to start (60 seconds)2015-12-16 00:28:25.057 [info] <0.478.0>@riak_core:wait_for_service:504Waiting for service riak_kv to start (120 seconds)


And then nothing

From /var/log/messages:

Dec 16 00:26:02 storage2 su: (to riak) root on none
Dec 16 00:26:04 storage2 riak[48174]: Starting up

Dec 16 00:28:59 storage2 kernel: traps: beam.smp[48492] generalprotection ip:7fcaf9402f16 sp:7fca6affcdd0 error:0 ineleveldb.so[7fcaf93b5000+93000]

Dec 16 00:28:59 storage2 run_erl[48172]: Erlang closed the connection.

On another node at a different time /var/log/riak/console.log hadsimilar messages, and also some warnings about invalid hint files, such as:

2015-12-13 00:15:41.232 [warning] <0.815.0> Hintfile'/data/riak/bitcask/570899077082383952423314387779798054553098649600/56.bitcask.hint'invalid

In this latter example riak was started with "systemctl start riak"rather than "riak start". From /var/log/messages:


Dec 13 00:15:29 storage1 riak: Starting riak: [  OK  ]

Dec 13 00:15:29 storage1 systemd: Started SYSV: Riak is a distributeddata store.Dec 13 00:15:56 storage1 kernel: beam.smp[131820]: segfault at 160 ip00007f24c0902ce6 sp 00007f24337fddd0 error 4 ineleveldb.so[7f24c08b5000+93000]

Dec 13 00:15:56 storage1 run_erl[131501]: Erlang closed the connection.

Of reported Riak bugs, this is similar tohttps://github.com/basho/riak/issues/790 . However, the poster of thatissue reported that his problem was fixed by repairing leveldbpartitions. I looked at this followinghttp://docs.basho.com/riak/latest/ops/running/recovery/repairing-leveldb/ butdidn't find any errors.

Incidentally, I started having problems with btrfs as well. On one nodebtrfs caused a kernel crash and on another kernel killed beam.smpprocess after it stopped responding for over 120 seconds while syncingto btrfs. The kernel in Centos 7 probably isn't best suited for workingwith btrfs. Advertised version is 3.10.0.

So, my question is what is your take on this? Is this a bug in theleveldb library? The same one already reported? What log data would helpdebug or reproduce it? Or is there potentially some problem with mysetup? Or could this be caused by a bug in btrfs? What is your take onusing Riak with btrfs?


--
Antti Kuusela, M.Sc
Senior Software Developer
Firstbeat Technologies Ltd.


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Leveldb segfault during Riak startup

Reply via email to