Re: embedding riak_core into RabbitMQ

2011-05-24 Thread Jon Brisbin
- Original Message -

> From: "Ryan Zezeski" 
> To: "Jon Brisbin" 
> Cc: "riak-users Users" 
> Sent: Monday, May 23, 2011 12:47:09 PM
> Subject: Re: embedding riak_core into RabbitMQ

> Jon,

> Sounds like a neat project. Out of curiosity, what use cases do you
> imagine for something like this?
We already have RabbitMQ integration with a riak_core application where vnodes 
are called in response to messages. This embedding would simply be an extension 
of that. 

> In regards to riak_err you shouldn't have any troubles leaving it
> out. It's purpose is to protect your VM from OOM issues related to
> error logging. What's the warning you see?
I thought it would be okay. 

It says: *WARNING* Undefined function riak_err_handler:limited_fmt/4 

jb 

> -Ryan

> On Thu, May 19, 2011 at 2:49 PM, Jon Brisbin < j...@jbrisbin.com >
> wrote:

> > I've mentioned this a couple places, but I'm toying with embedding
> > riak_core into RabbitMQ as a plugin. I already have a consumer that
> > invokes vnodes but I'm wondering (in code) whether or not it would
> > be more efficient to implement this behaviour as a custom exchange.
> > This is also an exercise for my Erlang Factory presentation.
> 

> > To that end, I've got riak_core running as a plugin with a special
> > script I call from my Makefile to patch the riak_core distro so it
> > will go into RabbitMQ's VM without complaining about extra .beam
> > files (for things already inside RabbitMQ).
> 

> > I'm getting a couple warnings related to cluster_info and riak_err
> > not being available. I've included cluster_info to get rid of that
> > warning, but I'm not sure I want to include riak_err because it
> > seems to put my logging output back into the console (I run my
> > RabbitMQ broker in a terminal so I can get an Erlang shell with
> > it).
> > It makes a mess in my terminal window, so I'm considering leaving
> > that out of my plugins directory.
> 

> > Will leaving riak_err out cause problems with riak_core? Other than
> > this, riak_core was relatively easy to embed.
> 

> > Thanks!
> 

> > Jon Brisbin
> 

> > http://jbrisbin.com
> 
> > Twitter: @j_brisbin
> 

> > ___
> 
> > riak-users mailing list
> 
> > riak-users@lists.basho.com
> 
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Mapreduce crosstalk

2011-05-24 Thread Kelly McLaughlin
Kyle,

Just wanted to let you know that this bug is fixed in master now. If you are
interested, I describe some details about the issue and the fix in the
commit message here:
https://github.com/basho/riak_kv/commit/f7e09d54c932f24d1d06fb595ce74e686657810f

Kelly

On Tue, May 17, 2011 at 1:35 PM, Aphyr  wrote:

> I was writing a new mapreduce query to look at users over time, and ran it
> over a single user in production. After that, other mapreduce jobs over
> users started returning results from my new map phase, some of the time.
> After five minutes of this, I had to restart every node in the cluster to
> get it to stop.
>
> Every node has {map_cache_size, 0} in riak_kv.
>
> The map phase that screwed things up was:
>
> function(v) {
>  o = JSON.parse(v.values[0].data);
>
>  // Age of account in days
>  age = Math.round(
>(Date.now() - Date.iso8601(o.created_at)) /
>(1000 * 60 * 60 * 24)
>  );
>
>  return [['t_user_scores', v.key, age]];
> }
>
> It looks like one node started running that phase instead of the requested
> phase for subsequent jobs. It *should* have run this one, but didn't.
>
> function(v) {
>o = JSON.parse(v.values[0].data);
>return [{
>key: v.key,
>name: o.name,
>thumbnail: o.thumbnail
>}];
> }
>
> Now I'm scared to run MR jobs. Could it be an issue with returning keydata?
> Anybody else seen this before?
>
> --Kyle
>
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



-- 
Kelly McLaughlin
Engineer
Basho Technologies, Inc.
ke...@basho.com
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: General Memory/Performance Tuning Guidelines

2011-05-24 Thread Rusty Klophaus
Hi Gordon,

I have limited knowledge of configuring Innostore but can help answer some
of your merge_index questions.

The most important merge_index setting in terms of memory usage is
'buffer_rollover_size'. This affects how large the buffer is allowed to
grow, in bytes, before getting converted to an on-disk segment. Each
partition maintains a separate buffer, so any increases to this number will
be multiplied by the number of partitions in your system. The higher this
number, the less frequently merge_index will need to perform compactions.

The second most important settings for memory usage are a combination of
'segment_full_read_size' and 'max_compact_segments'. During compaction, the
system will completely page any segments smaller than the
'segment_full_read_size' value into memory. This should generally be as
large or larger than the 'buffer_rollover_size'. The higher this number, the
quicker each compaction will be. 'max_compact_segments' is the maximum
number of segments to compact at one time. The higher this number, the more
segments merge_index can involve in each compaction. In the worst case, a
compaction could take ('segment_full_read_size' * 'max_compact_segments')
bytes of RAM.

The rest of the settings have a much smaller impact on performance and
memory usage, and exist mainly for tweaking and special cases.

This is a completely unscientific estimate based on observing other Riak
Search applications, but I'd set buffer_rollover_size so that (# Partitions
* buffer_rollover_size) is about one-half the memory you wish for
merge_index to consume, hopefully somewhere between 1M and 10M. The rest of
the memory will be used by in-memory offset tables, compaction processes,
and during query operations.

Hope that helps.

Best,
Rusty


On Mon, May 23, 2011 at 2:05 PM, Gordon Tillman  wrote:

> Greetings!
>
> We are working with a riaksearch cluster that uses innostore as the primary
> backend in tandem with merge_index that is required by search.  From reading
> the Basho wiki it looks like the following are the most important factors
> affecting memory and performance:
>
>• innostore
>• put data_home_dir and log_group_home_dir on different
> spindles
>• noatime
>• buffer_pool_size
>• flush_method
>• merge_index
>• data_root
>• buffer_rollover_size
>• max_compact_segments
>• segment_file_buffer_size
>• segment_full_read_size
>• segment_block_size
>
> Ideally, data_home_dir, log_group_home_dir, and data_root would all be on
> different spindles, but if you had just 2 disks available what would you
> recommend?  Would it be best to have data_home_dir and data_root on one and
> then log_group_home_dir on the other?
>
> in calculating the proper setting for buffer_pool_size you are directed to
> allocate 60-80 percent of available RAM.  So lets assume you want to take
> the remaining 20-40% of available RAM and split it up between innostore and
> merge_index?
>
> Would it be best to give each of them half of that value?
>
> Determining the approximate memory requirements for merge_index isn't (to
> me) real obvious.  I looks like the following all have  an effect:
>
>  * buffer_rollover_size
>  * buffer_delayed_write_size
>  * max_compact_segments
>  * segment_query_read_ahead_size
>  * segment_compaction_read_ahead_size
>  * segment_full_read_size
>  * segment_block_size
>  * segment_values_staging_size
>
> Is there a formula for determining the (approximate) proper values to use
> given a certain amount of available RAM?
>
> Thanks in advance for any advice.  Sorry for all the questions!
>
> --gordon
>
>
>
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Issues with capacity planning pages on wiki

2011-05-24 Thread Anthony Molinaro
Just curious if anyone has any ideas, for the moment, I'm just taking
the RAM calculation and multiplying by 2 and the Disk calculation and
multiplying by 8, based on my findings with my current cluster.  But
I would like to know why my values are so much higher than those I should
be getting.

Also, I'd still like to know how the forms calculate things as the disk
calculation there does not match reality or the formula.

Also, waiting to hear if there is any way to force merge to run so I can
more accurately gauge whether multiple copies are effecting disk usage.

Thanks,

-Anthony

On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote:
> 
> On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote:
> > 
> > On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote:
> > > On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro
> > > Thus, depending on
> > > your merge triggers, more space can be used than is strictly necessary
> > > to store the data.
> > 
> > So the lack of any overhead in the calculation is expected?  I mean
> > according to http://wiki.basho.com/Cluster-Capacity-Planning.html
> > 
> > Disk = Estimated Total Objects * Average Object Size * n_val
> > 
> > Which just seems wrong, doesn't it?  I don't quite understand the
> > bitcask code well enough yet to see what the actual data it stores is,
> > but the whitepaper suggested several things were involved in the on
> > disk representation.
> 
> Okay, finally found the code for this part, I kept looking in the nif
> but that's only the keydir, not the data files.  It looks like
> 
>%% Setup io_list for writing -- avoid merging binaries if we can help it
>Bytes0 = [<>, <>,
>  <>, Key, Value],
>Bytes  = [<<(erlang:crc32(Bytes0)):?CRCSIZEFIELD>> | Bytes0],
> 
> And looking at the header, it seems that there's 14 bytes of overhead
> (4 for CRC, 4 for timestamp, 2 for keysize, 4 for valsize).
> 
> So disk calculation should be
> 
> ( 14 + Key + Value ) * Num Entries * N_Val
> 
> So using my numbers from before that gives
> 
> ( 14 + 36 + 36 ) * 183915891 * 3 = 47450299878 = 44.1 GB
> 
> which actually isn't much closer to 341 GB than the previous calculation :(
> 
> So all my questions from the previous email still apply.
> 
> -Anthony
> 
> -- 
> 
> Anthony Molinaro   

-- 

Anthony Molinaro   

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


riaksearch: using index docs in place of real objects

2011-05-24 Thread Greg Pascale
Hi,

In our data model, our riak objects are flat JSON objects, and thus their
corresponding index documents are nearly identical - the only difference is
that a few fields which are ints in the riak objects are strings in the
index doc.

Since they are so similar, we are directly using the index docs returned
from our search call, skipping the second step of doing gets on the returned
keys to retrieve the real objects.

Is this advisable? Are there any circumstances under which we might run into
consistency issues?

Thanks,
-Greg
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Best practice for using erlang modules in riak?

2011-05-24 Thread Sylvain Niles
So I've seen a few well written examples of erlang map or reduce
functions in the contrib section of the wiki/github but the missing
piece of glue for me is: Where do I compile from? I've done a lot of
ejabberd development and generally I just throw it in the src
directory, add a config param to the ejabberd.conf to load my new
module at startup, make install, and I'm done. Should I be deploying
my modules to /deps/riak_core/src/?

The wiki has this note:

Distributing Erlang MapReduce Code

Any modules and functions you use in your Erlang MapReduce calls must
be available on all nodes in the cluster. You can add them in Erlang
applications by specifying the -pz option in vm.args or by adding the
path to the add_paths setting in app.config.

But the vm.args page does not list the valid config format/options for
the -pz option. Is the add_paths behavior such that a valid beam in
that dir will automatically be loaded on startup and all exported
functions available? What about live code updates of internally
developed modules?

Thanks in advance!

-Sylvain

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com