Re: riak java client causing OOMs on high latency [resolution]

2013-01-10 Thread Dietrich Featherston
same socket, and hence reading "length headers" out of the middle of messages. We went digging for culprits, found the missing finally block, then noticed this change downstream. Will post more info if this doesn't do the trick. On Thu, Jan 10, 2013 at 11:44 AM, Dietrich Fea

riak java client causing OOMs on high latency

2013-01-07 Thread Dietrich Featherston
We're seeing instances of a JVM app which talks to riak run out of memory when riak operations rise in latency or riak becomes otherwise unresponsive. A heap dump of the JVM at the time of the OOM show that 91% of the 1G (active) heap is consumed by large byte[] instances. In our case 3 of those by

Re: strange behavior upgrading from riak-java-client 1.0.5 -> 1.0.6

2012-12-28 Thread Dietrich Featherston
I don't believe allow_mult is enabled. It shouldn't be at least! On Dec 28, 2012, at 1:23 PM, Brian Roach wrote: > On Fri, Dec 28, 2012 at 12:34 PM, Dietrich Featherston > wrote: >> Primarily stores but I did see one case of socket timeouts simply building a >> n

Re: strange behavior upgrading from riak-java-client 1.0.5 -> 1.0.6

2012-12-28 Thread Dietrich Featherston
On Dec 28, 2012, at 11:57 AM, Brian Roach wrote: > On Fri, Dec 28, 2012 at 11:37 AM, Dietrich Featherston > wrote: >> >> All socket operations. It looks as though those that open a new socket are >> especially >> impacted. We are running 1.2.1 with the le

Re: strange behavior upgrading from riak-java-client 1.0.5 -> 1.0.6

2012-12-28 Thread Dietrich Featherston
fter changing your code to use the new > 'withoutFetch()' ? > > Thanks, > Brian Roach > > On Wed, Dec 26, 2012 at 7:28 PM, Dietrich Featherston wrote: >> I had rolled out an upgrade to a JVM app that uses rjc 1.0.5. We had >> upgraded to 1.0.6 to take advantage of n

Re: strange behavior upgrading from riak-java-client 1.0.5 -> 1.0.6

2012-12-28 Thread Dietrich Featherston
you see this simply dropping in the 1.0.6 client to your existing > application, or is this after changing your code to use the new > 'withoutFetch()' ? 1.0.6 with the inclusion of withoutFetch(). Haven't tried with just the driver alone. > > Thanks, > Brian Roach

strange behavior upgrading from riak-java-client 1.0.5 -> 1.0.6

2012-12-26 Thread Dietrich Featherston
I had rolled out an upgrade to a JVM app that uses rjc 1.0.5. We had upgraded to 1.0.6 to take advantage of newly added abilities to do a put without preceding it with a fetch in order to reduce operational load on the cluster. However, after rolling out this change we frequently see large rises in

Re: avg write io wait time regression in 1.2.1

2012-11-06 Thread Dietrich Featherston
i wrote: > Would you paste the data for one core from /proc/cpuinfo? And do you know > the brand of controller running the SSD drives? > > Thank you, > Matthew > > > On Nov 6, 2012, at 7:04 PM, Dietrich Featherston wrote: > > Thanks for the feedback. We haven't

Re: avg write io wait time regression in 1.2.1

2012-11-06 Thread Dietrich Featherston
ainst you in 1.2.x. I personally introduced a bug > that hurts performance on that setting. My apologies. I recommend you > take it below 100M until release notes publish that the bug is fixed. > > Matthew > > > On Nov 6, 2012, at 7:04 PM, Dietrich Featherston wrote: > >

Re: avg write io wait time regression in 1.2.1

2012-11-06 Thread Dietrich Featherston
Erlang have given us some suggested > changes in our Erlang to leveldb interface this past weekend. That > information could also give leveldb a throughput boost if proven valid. > Keep you posted. > > But at this time I see nothing that yells "massive slow down". I am of &g

2i error when half cluster is 1.1, other half is 1.2.1

2012-11-02 Thread Dietrich Featherston
Seeing the following reported by rjc when doing a 2i scan. Has only started happening since upgrading half of this cluster's nodes from 1.1 to 1.2.1. Should we presume this is an incompatibility that will go away when upgrading the remaining nodes? com.basho.riak.client.RiakRetryFailedException:

Re: avg write io wait time regression in 1.2.1

2012-11-01 Thread Dietrich Featherston
avy times where the > throttle saves the user experience. > > Matthew > > > On Nov 1, 2012, at 8:54 PM, Dietrich Featherston wrote: > > Thanks. The amortized stalls may very well describe what we are seeing. If > I combine leveldb logs from all partitions on one of the up

Re: avg write io wait time regression in 1.2.1

2012-11-01 Thread Dietrich Featherston
G files, combined them, then compare > compaction activity to your graph. > > Write stalls are detailed here: > http://basho.com/blog/technical/2012/10/30/leveldb-in-riak-1p2/ > > How can I better assist you at this point? > > Matthew > > > On Nov 1, 2012, at 8:13 PM, Die

avg write io wait time regression in 1.2.1

2012-11-01 Thread Dietrich Featherston
We've just gone through the process of upgrading two riak clusters from 1.1 to 1.2.1. Both are on the leveldb backend backed by RAID0'd SSDs. The process has gone smoothly and we see that latencies as measured at the gen_fsm level are largely unaffected. However, we are seeing some troubling disk

high 99.9% latencies with leveldb backend

2012-10-30 Thread Dietrich Featherston
Seeing 99th percentile put latencies at around 30-40 ms with 99.9th percentile jumping all the way up to 3-4s. This is riak 1.1 with the eleveldb backend on a 9-node cluster, N = 2, W = 1. Lots of free iops, but CPU is consistently burning 30-40% across all 8 cores. Wondering if this could be caus

Re: Riak significant downtime

2012-08-01 Thread Dietrich Featherston
What is max_open_files set to in the eleveldb section of app.config? If unspecified I think the limit is 20. Remember that this number is per vnode. The process limit specified by ulimit -n must be greater than max_open_files * num_vnodes / num_nodes allowing room for vnode multiplexing and fallbac

Re: Large numbers of keys

2012-06-27 Thread Dietrich Featherston
LevelDB is a nice option with a key space that will not fit in memory. Whether or not bitcask will work for you depends on total memory capacity of the cluster and N value. Recommend using the bitcask capacity planner to see if it is a suitable backed for your hardware+data combination. http://

Re: bucket level authentication

2012-06-05 Thread Dietrich Featherston
Suggest implementing security outside of riak. The interface to applications which use riak for storage should not be riak-dependent. In addition, it would be wise to avoid exposing storage-level details like bucket choice in the security model for your applications. For more details in why ria

Re: How to process batch of events in N seconds after latest

2012-05-19 Thread Dietrich Featherston
You might try coordinating this activity outside of riak if at all possible. If there is a single point of origin of these events (ie, a dedicated master for each partition of writes) then you could maintain reasonable guarantees that you dont need sibling processing on the riak end since data is b

busy_dist_port

2012-05-17 Thread Dietrich Featherston
Been seeing some busy_dist_port errors lately in our riak logs and curious if it's anything to worry about. Here is a specific instance of the log msg https://gist.github.com/881598dc29ca168dba34 More info on our setup * 9 node cluster (dedicated) * Ubuntu 10.04 * 32GB ram * SSDs * 256K block siz

debugging riak behavior by looking at the network

2012-04-19 Thread Dietrich Featherston
Hey guys, I just wrote a new blog post debugging some issues I'm seeing with riak by looking at the network. Lots of words and pretty pictures here: http://blog.boundary.com/2012/04/19/hungry-kobayashi-pt1/ What seems to be happening is that cleanup tasks in our app eventually become the primary

Re: True HA for RiakClient

2012-03-03 Thread Dietrich Featherston
The haproxy approach tends to work well so long as haproxy it is located on the machine from which the riak client is establishing connections to the riak cluster. This way the client always talks over localhost and haproxy is unlikely to be a failure point. So in your service tier, each machine ru

Re: Possibility of a CAS API

2012-02-24 Thread Dietrich Featherston
If you need CAS semantics, then coordinate that outside of riak. Any check-then-act type of operation where atomicity is important is going to leave some room for a data race in a system with the distribution semantics of riak. Would suggest thinking about the problem in such a way that handling of

Re: efficient fetching of values falling in a key range

2012-01-30 Thread Dietrich Featherston
ou'll see it is a bit > of mess, what with the Query types and Index types and all that. I'm happy > to keep going and finish. > > Cheers > > Russell > > On 30 Jan 2012, at 18:58, Dietrich Featherston wrote: > > Hey Russel. Any thoughts on when you'd get

Re: efficient fetching of values falling in a key range

2012-01-30 Thread Dietrich Featherston
sell Brown wrote: > > On 30 Jan 2012, at 18:12, Dietrich Featherston wrote: > > I'm using a leveldb-backed riak 1.0.2 and looking for some suggestions to > fetch a block of data by key range. I have control over the keys and all > reads out of this setup will involve at minimu

efficient fetching of values falling in a key range

2012-01-30 Thread Dietrich Featherston
I'm using a leveldb-backed riak 1.0.2 and looking for some suggestions to fetch a block of data by key range. I have control over the keys and all reads out of this setup will involve at minimum a key range. It seems that if leveldb is an ideal candidate for this kind of access pattern so long as I