Greetings, Thanks for the response. The recommendation on simply taking out the corrupted buffer file worked! There was indeed single buffer file (buffer.13), which apparently was causing the crash. Once renamed, the node stopped hanging on new data inserts. Thanks for the interpretation of the issue. As for the 'binary' brackets in the data dump, I can't tell exactly what that it, as it doesn't directly match any of the data I am writing (json objects that are mainly composed of string sequences of printable ASCII text). I can share the corrupted buffer file if that'd be helpful in investigating the root cause.
As for the cause of truncation/corruption - the machine and the riaksearch node has been continually up for more than 3 months with two occasional crashes, which looked like this in syslog: node kernel: [ 2282.474990] beam[2396]: segfault at 24a5d ip 081016a2 sp bfd68600 error 4 in beam[8048000+15c000] ... I suspect the corrupted segments might have been caused by beam segfaults. Anyway, thanks alot for pointers to corrupted buffer files. This indeed resolved the issue. Per Rusty's request, here's detailed information of the platform where the issue was observed: What platform are you running? Debian 5.3 (Lenny) What version of Riak Search are you using? At the time when problem appeared, the riak-search_0.14.2-1_i386.deb was installed on the system. The node was upgraded from riak-search_0.14.0-1_i386.deb (the updade procedure was like riaksearch stop; dpkg -i ..; riaksearch start) Did you install Riak Search from our pre-built binaries, or did you compile from source? pre-build binaries (.deb packages) If you compiled from source, what version of Erlang are you running? Erlang R14A (erts-5.8) What interface are you using to index the files? (Solr or KV?) the data is indexed by riaksearch precommit hook when stored in the bucket (if that was the question) I'll send email regarding the data sampling offlist On Tue, Jul 5, 2011 at 9:15 PM, Ryan Zezeski <rzeze...@basho.com> wrote: > Fyodor, > I can't tell you exactly what caused this to happen but I can tell you how > to move past it. Search uses two data structures to store the index: > buffers and segments. A buffer is an in-memory structure backed by a file > on disk. Overtime buffers are converted to segments. All segments live on > disk but there is an in-memory offset table to perform lookups. During a > request if the vnode to handle that request is not already up it will be > started. During the vnode's initialization it will read all buffers and > segment tables into memory. In your case, each time the vnode is started it > crashes while trying to read the buffer file. Looking at the binary in your > trace it looks like somehow the data became corrupted. First off, I'm > confused by the syntax of the binary in your stack trace. I.e. what's up > with the brackets surrounding that binary data? That aside, I see two terms > in that data, i.e. there are two occurrences of the byte '131' which > indicates the start of a term. The second term is valid: > [{{<<"logs">>,<<"text">>,<<"SEQ=1">>}, > <<"ae2b12ae-a155-11e0-9e33-00219bfc3293">>, > -1309244813808575, > [{p,[14]}]}] > However, the first term seems to have been truncated/corrupted somehow. > Why? I'm not sure. My immediate guess would be that a write failed at some > point, writing bad data to the buffer file, the vnode crashed, and then when > it started back up it couldn't read back the buffer file. The code to read > the buffer data expects correct data or it will simply crash, as you see. > This will cause a perpetual series of crashes until the problem is manually > resolved. In this case you can just move your buffer files, for the > crashing vnodes, one at a time until the problem goes away. This will cause > you to lose some of your indexed data. For example, in your case the > crashing vnode is for > partition 433883298582611803841718934712646521460354973696. You can cd to > riak_search/data/merge_index/433883298582611803841718934712646521460354973696 > and then mv your buffer.* files to something like corrupt-buffer.*. > TL;DR - For one reason or another a buffer file became corrupted. As a > workaround you can move your buffer files out of the way. > -Ryan > On Sat, Jul 2, 2011 at 6:40 AM, Fyodor Yarochkin <fyodo...@armorize.com> > wrote: >> >> Greetings, >> >> I've been running a single node riaksearch instance, while came >> across this problem: after inserting roughly 200Mb of data every >> consequential insert (into any bucket) would start to time out with a >> sequence of errors logs that point on riak_search_vnode_master crash: >> >> =SUPERVISOR REPORT==== 2-Jul-2011::06:04:57 === >> Supervisor: {local,riak_search_sup} >> Context: child_terminated >> Reason: >> >> {{badmatch,{error,{{badmatch,{error,{badarg,[{erlang,binary_to_term,[<<[131,108,0,0,0,2,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,16,91,49,50,49,49,49,56,48,46,55,49,54,51,55,52,93,109,0,0,0,36,97,97,54,55,53,52,53,99,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,112,21,181,79,192,166,4,108,0,0,0,1,104,0,0,0,106,131,108,0,0,0,1,104,4,104,3,109,0,0,0,4,108,111,103,115,109,0,0,0,4,116,101,120,116,109,0,0,0,5,83,69,81,61,49,109,0,0,0,36,97,101,50,98,49,50,97,101,45,97,49,53,53,45,49,49,101,48,45,57,101,51,51,45,48,48,50,49,57,98,102,99,51,50,57,51,110,7,1,191,19,13,80,192,166,4,108,0,0,0,1,104,2,100,0,1,112,107,0,1,14,106,106]>>]},{mi_buffer,read_value,1},{mi_buffer,open_inner,2},{mi_buffer,new,1},{mi_server,read_buffers,4},{mi_server,read_buf_and_seg,1},{mi_server,init,1},{gen_server,init_it,6}]}}},[{merge_index_backend,start,2},{riak_search_vnode,init,1},{riak_core_vnode,init,1},{gen_fsm,init_it,6},{proc_lib,init_p_do_apply,3}]}}},[{riak_core_vnode_master,get_vnode,2},{riak_core_vnode_master,handle_call,3},{gen_server,handle_msg,5},{proc_lib,init_p_do_apply,3}]} >> Offender: >> >> [{pid,<0.754.0>},{name,riak_search_vnode_master},{mfa,{riak_core_vnode_master,start_link,[riak_search_vnode]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}] >> >> >> (the full paste of error dump log is here http://pastebin.com/0Bj5cJAQ) >> >> Reads still work and I am slighly confused on the reason of the crash. >> The availability of RAM is one of the things I suspect here: >> "mem_total":1059192832,"mem_allocated":893632512,". There is no >> shortage of the disk space or other resources on the system. I am >> abit stuck as to where to start troubleshooting this issue. Any >> pointers or hints would be appreciated greatly! :) >> >> >> regards, >> -F >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com