Leveldb segfault during Riak startup

2015-12-18 Thread Antti Kuusela
sp 7f24337fddd0 error 4 in 
eleveldb.so[7f24c08b5000+93000]

Dec 13 00:15:56 storage1 run_erl[131501]: Erlang closed the connection.

Of reported Riak bugs, this is similar to 
https://github.com/basho/riak/issues/790 . However, the poster of that 
issue reported that his problem was fixed by repairing leveldb 
partitions. I looked at this following 
http://docs.basho.com/riak/latest/ops/running/recovery/repairing-leveldb/ but 
didn't find any errors.


Incidentally, I started having problems with btrfs as well. On one node 
btrfs caused a kernel crash and on another kernel killed beam.smp 
process after it stopped responding for over 120 seconds while syncing 
to btrfs. The kernel in Centos 7 probably isn't best suited for working 
with btrfs. Advertised version is 3.10.0.


So, my question is what is your take on this? Is this a bug in the 
leveldb library? The same one already reported? What log data would help 
debug or reproduce it? Or is there potentially some problem with my 
setup? Or could this be caused by a bug in btrfs? What is your take on 
using Riak with btrfs?


--
Antti Kuusela, M.Sc
Senior Software Developer
Firstbeat Technologies Ltd.


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Leveldb segfault during Riak startup

2015-12-31 Thread Antti Kuusela

Hi Luke,

We erased the btrfs file system, replaced it with xfs on lvm with 
thinly-provisioned volumes and continued testing with a new database 
from scratch. The same problem continues, though. From /var/log/messages:


Dec 31 03:35:31 storage5 riak[66419]: Starting up
Dec 31 03:35:45 storage5 kernel: traps: beam.smp[66731] general 
protection ip:7f02280b3f16 sp:7f0197ad8dd0 error:0 in 
eleveldb.so[7f0228066000+93000]

Dec 31 03:35:46 storage5 run_erl[66417]: Erlang closed the connection.



18.12.2015, 16:45, Luke Bakken kirjoitti:

Hi Antti,

Riak is not tested on btrfs and the file system is not officially
supported. We recommend ext4 or xfs for Linux. ZFS is an option on
Solaris derivatives and FreeBSD.

--
Luke Bakken
Engineer
lbak...@basho.com


On Fri, Dec 18, 2015 at 6:14 AM, Antti Kuusela
 wrote:

Hi,

I have been testing Riak and Riak CS as a possible solution for our future
storage needs. I have a five server cluster running Centos 7. Riak version
is 2.1.3 (first installed as 2.1.1, updated twice via Basho repo) and Riak
CS version is 2.1.0. The servers each have 64GB RAM and six 4TB disks in
raid 6 using btrfs.

I have been pushing random data into Riak-CS via s3cmd to see how the system
behaves. Smallest objects have been 2000 bytes, largest 100MB. I have also
been making btrfs snapshots of the entire platform data dir nightly for
backup purposes. Stop Riak CS, wait 10 seconds, stop Riak, wait 10, make
snapshot, start Riak, wait 180 seconds, start Riak CS. This is performed on
each of the servers in turn with a five minute wait in between. I have added
the waits to try spread the startup load and allow the system time to get
things running. New data is constantly pushed to the S3 API but restarting
the nodes in rotation causes by far the highest stress on the system.

I have encountered one problem in particular. Quite often one of the Riak
nodes starts up but after a couple of minutes it just drops, all processes
exited except for epmd.

Following is from /var/log/riak/console, most of the lines skipped for sake
of brevity. Normal startup stuff, as far as I can see:

2015-12-16 00:26:04.446 [info] <0.7.0> Application lager started on node
'riak@192.168.50.32'
...
2015-12-16 00:26:04.490 [info] <0.72.0> alarm_handler:
{set,{system_memory_high_watermark,[]}}
...
2015-12-16 00:26:04.781 [info]
<0.206.0>@riak_core_capability:process_capability_changes:555 New
capability: {riak_core,vnode_routing} = proxy
...
2015-12-16 00:26:04.869 [info] <0.7.0> Application riak_core started on node
'riak@192.168.50.32'
...
2015-12-16 00:26:04.969 [info] <0.407.0>@riak_kv_env:doc_env:46 Environment
and OS variables:
2015-12-16 00:26:05.124 [warning] <0.6.0> lager_error_logger_h dropped 9
messages in the last second that exceeded the limit of 100 messages/sec
2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Open file limit: 65536
2015-12-16 00:26:05.124 [warning] <0.407.0> riak_kv_env: Cores are disabled,
this may hinder debugging
2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Erlang process limit:
262144
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Erlang ports limit:
65536
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: ETS table count limit:
256000
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Thread pool size: 64
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Generations before
full sweep: 0
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Schedulers: 12 for 12
cores
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctl vm.swappiness
is 0 greater than or equal to 0)
2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: sysctl
net.core.wmem_default is 8388608 lesser than or equal to 8388608)
...
2015-12-16 00:26:05.139 [info] <0.478.0>@riak_core:wait_for_service:504
Waiting for service riak_kv to start (0 seconds)
2015-12-16 00:26:05.158 [info]
<0.495.0>@riak_kv_entropy_manager:set_aae_throttle_limits:790 Setting AAE
throttle limits: [{-1,0},{200,10},{500,50},{750,250},{900,1000},{1100,5000}]
...
2015-12-16 00:26:30.160 [info]
<0.495.0>@riak_kv_entropy_manager:perhaps_log_throttle_change:853 Changing
AAE throttle from undefined -> 5000 msec/key, based on maximum vnode mailbox
size {unknown_mailbox_sizes,node_list,['riak@192.168.50.32']} from
['riak@192.168.50.32']
2015-12-16 00:27:12.053 [info] <0.478.0>@riak_core:wait_for_service:504
Waiting for service riak_kv to start (60 seconds)
2015-12-16 00:28:25.057 [info] <0.478.0>@riak_core:wait_for_service:504
Waiting for service riak_kv to start (120 seconds)

And then nothing

 From /var/log/messages:

Dec 16 00:26:02 storage2 su: (to riak) root on none
Dec 16 00:26:04 storage2 riak[48174]: Starting up
Dec 16 00:28:59 storage2 kernel: traps: beam.smp[48492] general protection
ip:7fcaf9402f16 sp:7fca6affcdd0 error:0 in eleveldb.so[7fcaf93b50

Re: Leveldb segfault during Riak startup

2016-01-04 Thread Antti Kuusela
Hi Matthew,

This phenomenon started after upgrade to 2.1.2. I downgraded the servers to 
2.1.1, now waiting to see if it makes a difference.

I installed the binary packages from Basho repo.




From: Matthew Von-Maszewski 
Sent: 31 December 2015 18:25
To: Antti Kuusela
Cc: Luke Bakken; riak-users
Subject: Re: Leveldb segfault during Riak startup

I also failed to ask two basic questions:

1.  did this failure start after your upgrade to 2.1.3, or happen prior to 
upgrade also?

2.  did you use a Basho package for Centos 7, or did you build from source code?

Matthew


> On Dec 31, 2015, at 6:06 AM, Antti Kuusela  wrote:
>
> Hi Luke,
>
> We erased the btrfs file system, replaced it with xfs on lvm with 
> thinly-provisioned volumes and continued testing with a new database from 
> scratch. The same problem continues, though. From /var/log/messages:
>
> Dec 31 03:35:31 storage5 riak[66419]: Starting up
> Dec 31 03:35:45 storage5 kernel: traps: beam.smp[66731] general protection 
> ip:7f02280b3f16 sp:7f0197ad8dd0 error:0 in eleveldb.so[7f0228066000+93000]
> Dec 31 03:35:46 storage5 run_erl[66417]: Erlang closed the connection.
>
>
>
> 18.12.2015, 16:45, Luke Bakken kirjoitti:
>> Hi Antti,
>>
>> Riak is not tested on btrfs and the file system is not officially
>> supported. We recommend ext4 or xfs for Linux. ZFS is an option on
>> Solaris derivatives and FreeBSD.
>>
>> --
>> Luke Bakken
>> Engineer
>> lbak...@basho.com
>>
>>
>> On Fri, Dec 18, 2015 at 6:14 AM, Antti Kuusela
>>  wrote:
>>> Hi,
>>>
>>> I have been testing Riak and Riak CS as a possible solution for our future
>>> storage needs. I have a five server cluster running Centos 7. Riak version
>>> is 2.1.3 (first installed as 2.1.1, updated twice via Basho repo) and Riak
>>> CS version is 2.1.0. The servers each have 64GB RAM and six 4TB disks in
>>> raid 6 using btrfs.
>>>
>>> I have been pushing random data into Riak-CS via s3cmd to see how the system
>>> behaves. Smallest objects have been 2000 bytes, largest 100MB. I have also
>>> been making btrfs snapshots of the entire platform data dir nightly for
>>> backup purposes. Stop Riak CS, wait 10 seconds, stop Riak, wait 10, make
>>> snapshot, start Riak, wait 180 seconds, start Riak CS. This is performed on
>>> each of the servers in turn with a five minute wait in between. I have added
>>> the waits to try spread the startup load and allow the system time to get
>>> things running. New data is constantly pushed to the S3 API but restarting
>>> the nodes in rotation causes by far the highest stress on the system.
>>>
>>> I have encountered one problem in particular. Quite often one of the Riak
>>> nodes starts up but after a couple of minutes it just drops, all processes
>>> exited except for epmd.
>>>
>>> Following is from /var/log/riak/console, most of the lines skipped for sake
>>> of brevity. Normal startup stuff, as far as I can see:
>>>
>>> 2015-12-16 00:26:04.446 [info] <0.7.0> Application lager started on node
>>> 'riak@192.168.50.32'
>>> ...
>>> 2015-12-16 00:26:04.490 [info] <0.72.0> alarm_handler:
>>> {set,{system_memory_high_watermark,[]}}
>>> ...
>>> 2015-12-16 00:26:04.781 [info]
>>> <0.206.0>@riak_core_capability:process_capability_changes:555 New
>>> capability: {riak_core,vnode_routing} = proxy
>>> ...
>>> 2015-12-16 00:26:04.869 [info] <0.7.0> Application riak_core started on node
>>> 'riak@192.168.50.32'
>>> ...
>>> 2015-12-16 00:26:04.969 [info] <0.407.0>@riak_kv_env:doc_env:46 Environment
>>> and OS variables:
>>> 2015-12-16 00:26:05.124 [warning] <0.6.0> lager_error_logger_h dropped 9
>>> messages in the last second that exceeded the limit of 100 messages/sec
>>> 2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Open file limit: 65536
>>> 2015-12-16 00:26:05.124 [warning] <0.407.0> riak_kv_env: Cores are disabled,
>>> this may hinder debugging
>>> 2015-12-16 00:26:05.124 [info] <0.407.0> riak_kv_env: Erlang process limit:
>>> 262144
>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Erlang ports limit:
>>> 65536
>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: ETS table count limit:
>>> 256000
>>> 2015-12-16 00:26:05.125 [info] <0.407.0> riak_kv_env: Thread pool size: 64
>>> 2015-12-16 00:26:05.125 [info] <0.407.0>