problem with rolling upgrade 0.14 -> 1.0

2011-10-09 Thread Tomer Naor

I'm trying to upgrade my Riak 0.14 nodes and encountered some problems.
After installing the riak-1.0 rpm and trying to start the node I'm getting the 
following error message:

riak start
Attempting to restart script through sudo -u riak
pthread/ethr_event.c:98: Fatal error in wait__(): Function not implemented (38)
pthread/ethr_event.c:98: Fatal error in wait__(): Function not implemented (38)
Error reading /etc/riak/app.config

When installing the riak-1.0 rpm I'm getting the following output:

sudo rpm -Uvh riak-1.0.0-1.el5.x86_64.rpm
Preparing...### [100%]
   1:riak   warning: /etc/riak/app.config created as 
warning: /etc/riak/vm.args created as /etc/riak/vm.args.rpmnew
### [100%]
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled
chcon: couldn't compute security context from unlabeled

in addition, I noticed that the app.config did not update as it should on the 
installation process.

Can you please advise what could be the cause of this error?

Re: Timeout when storing

2011-10-09 Thread Jim Adler
Thanks David - I'll try that on my single-node instance, but I'm working 
another Riak issue on another thread. 


Hi Jim, 

Sorry for the slow response -- email is like a running battle at times. :) 

How many partitions are you running? 

Also, take down the node and then remove any *.lock files. 



> About 90 out of 3000 are zero-bytes. 
> Jim 
> Jim, 
> If you look at your bitcask directories, do you have a large number of 
> zero-byte files, perchance? 
> D. 
>> After upgrading my single-node instance to 1.0, I'm still seeing the 
>> "timeout when storing" issue. Here are the changes I made based on 
>> everyone's suggestions (much appreciated!): 
>> - Ubuntu 11.04 (natty) 32-bit 
>> - Python client 1.3.0 
>> - /etc/riak/vm.args: -env ERL_MAX_PORTS 32768 
>> - /etc/default/riak: ulimit -n 32768 
>> Here's the /var/log/crash.log report: 
>> 2011-10-01 12:31:03 =ERROR REPORT 
>> ** State machine <0.3452.0> terminating 
>> ** Last event in was 
>> {'riak_vnode_req_v1',1136089163393944065322395631681798128560666312704 
>> ,{fsm,undefined,<0.3451.0>},{'riak_kv_put_req_v1',{<<"nodes">>,<<"user 
>> _id-17527747-info">>},{r_object,<<"nodes">>,<<"user_id-17527747-info"> 
>> >,[{r_content,{dict,4,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[], 
>> [],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[[<<"content-type">>,9 
>> 7,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<"X-Riak 
>> -VTag">>,49,88,88,75,75,51,90,88,68,117,90,122,85,53,57,85,53,101,107, 
>> 89,115,110]],[[<<"index">>]],[],[[<<"X-Riak-Last-Modified">>|{1317,497 
>> 463,847242}]],[],[]}}},<<"{DATA 
>> DELETED}">>}],[],{dict,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[], 
>> [],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[[clean 
>> |true]],[]}}},undefined},51456853,63484716663,[coord]}} 
>> ** When State == active 
>> ** Data == 
>> {state,1136089163393944065322395631681798128560666312704,riak_kv_vnode 
>> ,{state,1136089163393944065322395631681798128560666312704,false,riak_k 
>> v_bitcask_backend,{state,#Ref<>,"11360891633939440653223956 
>> 31681798128560666312704",[{async_folds,true},[{vnode_vclocks,true},{in 
>> cluded_applications,[]},{add_paths,[]},{allow_strfun,false},{storage_b 
>> ackend,riak_kv_bitcask_backend},{legacy_keylisting,false},{reduce_js_v 
>> m_count,6},{js_thread_stack,16},{pb_ip,""},{riak_kv_stat,true}, 
>> {map_js_vm_count,8},{mapred_system,pipe},{js_max_vm_mem,8},{pb_port,80 
>> 87},{legacy_stats,true},{mapred_name,"mapred"},{stats_urlpath,"stats"} 
>> ,{http_url_encoding,on},{hook_js_vm_count,2}],{read_write,true}],11360 
>> 89163393944065322395631681798128560666312704,"/var/lib/riak/bitcask"}, 
>> {dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[] 
>> },{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<<35,9,254,249, 
>> 78,135,82,106>>,3000,1000,100,100,true,false},undefined,undefined,none 
>> ,undefined,<0.3454.0>,6} 
>> ** Reason for termination = 
>> ** {bad_return_value,{error,{write_locked,emfile}}} 
>> 2011-10-01 12:31:03 =CRASH REPORT 
>> crasher: 
>> initial call: riak_core_vnode:init/1 
>> pid: <0.3452.0> 
>> registered_name: [] 
>> exception exit: {bad_return_value,{error,{write_locked,emfile}}} 
>> in function gen_fsm:terminate/7 
>> in call from proc_lib:init_p_do_apply/3 
>> ancestors: [riak_core_vnode_sup,riak_core_sup,<0.92.0>] 
>> messages: [{'EXIT',<0.3454.0>,shutdown}] 
>> links: [<0.96.0>] 
>> dictionary: [] 
>> trap_exit: true 
>> status: running 
>> heap_size: 6765 
>> stack_size: 24 
>> reductions: 160650 
>> neighbours: 
>> 2011-10-01 12:31:03 =SUPERVISOR REPORT 
>> Supervisor: {local,riak_core_vnode_sup} 
>> Context: child_terminated 
>> Reason: {bad_return_value,{error,{write_locked,emfile}}} 
>> Offender: 
>> [{pid,<0.3452.0>},{name,undefined},{mfargs,{riak_core_vnode,start_link 
>> ,undefined}},{restart_type,temporary},{shutdown,30},{child_type,wo 
>> rker}] 
>> 2011-10-01 12:45:28 =ERROR REPORT 
>> Failed to merge 
> "/var/lib/riak/bitcask/605153021707326989568713251046585937826284568576/var/ 
> lib/riak/bitcask/605153021707326989568713251046585937826284568576/1315770213 
> 4568576/ 
> 3251046585937826284568576/ 
> 30217073269895687132

Re: Riak 1.0 pre2 legacy_keylisting crash

2011-10-09 Thread Jim Adler
I'm seeing the same behavior and logs on a bucket with about 8M keys. Fyodor, 
any luck with any of Bryan's suggestions? 


On Fri, Oct 7, 2011 at 1:50 AM, Fyodor Yarochkin  wrote: 
> Here's one of the queries that consistently generates series of 
> 'fitting_died' log messages: 
> { 
> "inputs":{ 
> "bucket":"test", 
> "index":"integer_int", 
> }, 
> "query":[ 
> {"map":{"language":"javascript", 
> }, 
> {"reduce":{"language":"javascript", 
> {"reduce":{"language":"javascript", 
> ],"timeout": 9000 
> } 
> produces over hundred of " "Supervisor riak_pipe_vnode_worker_sup had 
> child at module undefined at <0.28835.0> exit with reason fitting_died 
> in context child_terminated" entries in log file and returns 'timeout' 

My interpretation of your report is that 9 seconds is not long enough 
to finish your MapReduce query. I'll explain how I arrived at this 

The log message you're seeing says that many processes that 
riak_pipe_vnode_worker_sup was monitor exited abnormally. That 
supervisor only monitors Riak Pipe worker processes, the processes 
that do the work for Riak 1.0's MapReduce phases. 

The reason those workers gave for exiting abnormally was 
'fitting_died'. This means that the pipeline they were working for 
closed before they were finished with their work. 

The result your received was 'timeout'. The way timeouts work in 
Riak-Pipe-based MapReduce is that a timer triggers a message at the 
given time, causing a monitoring process to cease waiting for results, 
tear down the pipe, and return a timeout message to your client. 

The "tear down the pipe" step in the timeout process is what causes 
all of those 'fitting_died' message you see. They're normal, and are 
intended to aid in analysis like the above. 

With that behind us, though, the question remains: why isn't 9 seconds 
long enough to finish this query? To figure that out, I'd start from 
the beginning: 

1. Is 9 seconds long enough to just finish the index query (using the 
index API outside of MapReduce)? If not, then the next people to jump 
in with help here will want to know more about the types, sizes, and 
counts of data you have indexed. 

2. Assuming the bare index query finishes fast enough, is 9 seconds 
long enough to get through just the index and map phase (no reduce 
phases)? If not, it's likely that either it takes longer than 9 
seconds to pull every object matching your index query out of KV, or 
that contention for Javascript VMs prohibits the throughput needed. 

2a. Try switching to an Erlang map phase. 
should do exactly what your Javascript function does, without 
contending for a JS VM. 

2b. Try increasing the number of JS VMs available for map phases. In 
your app.config, find the 'map_js_vm_count' setting, and increase it. 

3. Assuming just the map phase also makes it through, is 9 seconds 
long enough to get through just the index, map, and first reduce phase 
(leave off the second)? Your first reduce phase looks like it doesn't 
do anything … is it needed? Try removing it. 

4. If you get all the way to the final phase before hitting the 9 
second timeout, then it's may be that the re-reduce behavior of Riak 
KV's MapReduce causes your function to be too expensive. This will be 
especially true if you expect that phase to receive thousands of 
inputs. A sort function such as yours probably doesn't benefit from 
re-reduce, so I would recommend disabling it by adding 
"arg":{"reduce_phase_only_1":true} to that reduce phase's 
specification. With that in place, your function should be evaluated 
only once, with all the inputs it will receive. This may still fail 
because of the time it can take to encode/decode a large set of 
inputs/outputs to/from JSON, but doing it only once may be enough to 
get you finished. 

Hope that helps, 

riak-admin test | problems

2011-10-09 Thread Roberto Calero
I'm able to start riak (riak start)
and execute riak-admin status, which dumps lots of info about the local node... 
when running riak-admin test, I always get the following:
Attempting to restart script through sudo -u riakFailed to read test value: 
Is it something wrong with the installation? Haven't been able to get much info 
on the net


