That was one hell of a response. You need to post that as a Wiki article or such, after all that work :-O*
<> Jonathan Langevin Systems Administrator Loom Inc. Wilmington, NC: (910) 241-0433 - - - Skype: intel352 * On Wed, May 25, 2011 at 12:22 PM, Nico Meyer <> wrote: > Hi Anthony, > > I think, I can explain at least a big chunk of the difference in RAM and > disk consumption you see. > > Let start with RAM. I could of course be wrong here, but I believe the > *'static > bitcask per key overhead*' is just plainly too small. Let me explain why. > The bitcask_keydir_entry struct for each entry looks like this: > > typedef struct > { > uint32_t file_id; > uint32_t total_sz; > uint64_t offset; > uint32_t tstamp; > uint16_t key_sz; > char key[0]; > } bitcask_keydir_entry; > > > This has indeed a size of 22 bytes (The array 'key' has zero entries > because the key is written to the memory address directly after the keydir > entry). > As is done int the capacity planner, you need to add the size of the bucket > and key to get the size of the keydir entry, but that is not the whole > story. > > The thing that is actually stored in key is the result of this Erlang > expression: > > erlang:term_to_binary( {<<"bucket">>, <<"key">>} ) > > that is, a tuple of two binaries converted to the Erlang external term > format. > > So lets see: > > 1> term_to_binary({<<>>,<<>>}). > <<131,104,2,109,0,0,0,0,109,0,0,0,0>> > 2> iolist_size(term_to_binary({<<>>,<<>>})). > 13 > 3> iolist_size(term_to_binary({<<"a">>,<<"b">>})). > 15 > 4> iolist_size(term_to_binary({<<"aa">>,<<"b">>})). > 16 > 5> iolist_size(term_to_binary({<<"aa">>,<<"bb">>})). > 17 > > so even an empty bucket/key pair take 13 bytes to store. > > Also, since the hashtable storing the keydir entries is essentially an > array of pointers to bitcask_keydir_entry objects, there is another 8 bytes > of overhead per key, assuming you are running a 64bit system. > > so the real static overhead per key is not 22 but 22+13+8 = 43 bytes. > > Lets run the numbers for your predicted memory consumption again: > > ( 43 + 10 + 36 ) * 183915891 * 3 = 49105542897 = 45.7 GB > > > Your actual RAM consumption of 70 GB seems to be at odd with the output of > erlang:memory/0 that you sent: > > {total,7281790968} => RAM: 7281790968 * 8 = 54.3 GB > > > So that is much closer, within about 20 percent. Some additional overhead > is to be expected, but it is hard to say how much of that is due to Erlangs > internal usage and how much due to bitcask. > > So lets examine the disk consumption next. > As you rightly concluded the equation here > is somewhat > simplified, and your are also right, that the real equation would be > > ( 14 + Key + Value ) * Num Entries * N_Val > > On the other hand 14 bytes + keysize might be quite irrelevant if your > values have a size of at least 2KB (as in the example), which seems to be > the general assumption in some aspects of the design of riak and bitcask. > As you also noticed, this additional small overhead brings you nowhere near > the disk usage that you observe. > > First, the key that is stored in the bitcask files is not the key part of > the bucket/key pair that riak calls a key, but the serialized bucket/key > pair described above, so the calculation becomes: > > ( 14 + ( 13 + Bucket + Key) + Value ) * Num Entries * N_Val > > ( 14 + ( 13 + 10 + 36) + 36 ) * 183915891 * 3 = 56 GB > > Still not enough :-/. > So next lets examine what is actually stored as the value in bitcask. It is > not simply the data you provide, but a riak object (r_object record) which > is again serialized by the erlang:term_to_binary/1 function. So lets see. I > create a new riak object with zero byte bucket, key and value: > > 3> Obj = riak_object:new(<<>>,<<>>,<<>>). > {r_object,<<>>,<<>>, > [{r_content,{dict,0,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],[],...}}}, > <<>>}], > [], > {dict,1,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}}, > undefined} > 4> iolist_size(erlang:term_to_binary(Obj)).* > 205* > > Also, bucket and key are contained int the riak object itself (and > therefore in the bitcask notion of the value). So with this information the > predicted disk usage becomes: > > ( 14 + ( 13 + Bucket + Key ) + ( 205 + Bucket + Key + Value ) ) * Num Entries > * N_Val > > ( 14 + ( 13 + 10 + 36) + ( 205 + 10 + 36 ) ) * 183915891 * 3 = 166.5 GB > > which is way closer to the 341 GB you observe. > > But we can get even closer, although the detailes become somewhat more > fuzzy. But bear with me. > I again create a riak object, but this time with a non empty bucket/key so > I can store it in riak: > > (ctag@> Obj = riak_object:new(<<"a">>,<<"a">>,<<>>). > {r_object,<<"a">>,<<"a">>, > [{r_content,{dict,0,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],[],...}}}, > <<>>}], > [], > {dict,1,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}}, > undefined} > > (ctag@> iolist_size(erlang:term_to_binary(Obj)).*207* > > (ctag@> {ok,C}=riak:local_client(). > {ok,{riak_client,'ctag@',<<2,123,179,255>>}} > (ctag@> C:put(Obj,1,1). > ok > > (ctag@> {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1). > {ok,{r_object,<<"a">>,<<"a">>, > [{r_content,{dict,2,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],...}}}, > <<>>}], > [{<<2,123,179,255>>,{1,63473554112}}], > {dict,1,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],...}}}, > undefined}} > (ctag@> iolist_size(erlang:term_to_binary(ObjStored)).*358* > > Ok? What happened? The object we retrieved is considerably larger than the > one we stored. One culprit is the vector clock data, which was an empty list > for Obj, and now has one entry: > > (ctag@> riak_object:vclock(Obj). > [] > (ctag@> riak_object:vclock(ObjStored). > [{<<2,123,179,255>>,{1,63473554112}}] > (ctag@> iolist_size(term_to_binary(riak_object:vclock(Obj))). > 2 > (ctag@> > iolist_size(term_to_binary(riak_object:vclock(ObjStored))). > 30 > > So thats 28 bytes each time the object is updated with a new client ID (so > alway use a meaningful client ID!!!!), until the vclock pruning sets in. The > default bucket property is {big_vclock,50}, so in the worst case this could > account for 28*50=1400 byte! > But each object that has been stored somehow has at least one entry in the > vclock, so another 28 bytes of overhead > > The other part of the growth stems from some standard entries, which are > added to the object metadata during the put operation: > > (ctag@> dict:to_list(riak_object:get_metadata(Obj)). > [] > (ctag@> > iolist_size(term_to_binary(riak_object:get_metadata(Obj))). > 60 > > (ctag@> dict:to_list(riak_object:get_metadata(ObjStored)). > [{<<"X-Riak-VTag">>,"7PoD9FEMUBzNmQeMnjUbas"}, > {<<"X-Riak-Last-Modified">>,{1306,334912,424099}}] > (ctag@> > iolist_size(term_to_binary(riak_object:get_metadata(ObjStored))). > 183 > > So there are the other 123 bytes. > > In total this 356 byte* overhead per object leads us to the following > calculation: (* 2 bytes from the above 358 came from the bucket and key > which are already accounted for) > > ( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) ) * Num Entries > * N_Val > > ( 14 + ( 13 + 10 + 36) + ( 356 + 10 + 36 ) ) * 183915891 * 3 = 244 GB > > > We are getting closer! > If you loaded the data via the REST API the overhead is somewhat larger > still, since the object will also contain 'content-type', 'X-Riak-Meta' and > 'Link' metadata entries: > > xxxx@node2:~$ curl -v -d '' -H "Content-Type: text/plain" > > > > (ctag@> {ok,ObjStored} = C:get(<<"a">>,<<"a">>, 1). > {ok,{r_object,<<"a">>,<<"a">>, > [{r_content,{dict,5,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],...}, > > {{[],[],[[<<"Links">>]],[],[],[],[],[],[],[],...}}}, > <<>>}], > [{<<5,134,53,93>>,{1,63473557230}}], > {dict,1,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],...}}}, > undefined}} > (ctag@> dict:to_list(riak_object:get_metadata(ObjStored)). > [{<<"Links">>,[]}, > {<<"X-Riak-VTag">>,"3TQzJznzXXWtZefntWXPDR"}, > {<<"content-type">>,"text/plain"}, > {<<"X-Riak-Last-Modified">>,{1306,338030,682871}}, > {<<"X-Riak-Meta">>,[]}] > > (ctag@> iolist_size(erlang:term_to_binary(ObjStored)). > * > 449* > > > Which leads to: (remember again to subtract 2 bytes) > > ( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) ) * Num Entries > * N_Val > > ( 14 + ( 13 + 10 + 36) + ( 447 + 10 + 36 ) ) * 183915891 * 3 = 290.8 GB > > > Nearly there! > > Now there are also the hintfiles, which are a kind of an index into the > bitcask data files to speedup the start of a riak node. The hintfiles > contain one entry per key and the code that creates one entry looks like > this: > > [<<Tstamp:?TSTAMPFIELD>>, <<KeySz:?KEYSIZEFIELD>>, > <<TotalSz:?TOTALSIZEFIELD>>, <<Offset:?OFFSETFIELD>>, Key]. > > > So thats 4 + 2 + 4 + 8 + KeySize (= 18 + KeySize) additonal bytes per key. > So the final result if you inserted the key via the Rest API is: > > ( 14 + ( 13 + Bucket + Key ) + ( 447 + Bucket + Key + Value ) + (18 + ( 13 + > Bucket + Key ) ) ) * Num Entries * N_Val = *( 505 + 3 * (Bucket + Key) + > Value ) * Num Entries * N_Val* > > ( 505 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 374636669967 = 348.9 GB > > > And if you used Erlang (or probably any ProtocolBuffers client): > > ( 14 + ( 13 + Bucket + Key ) + ( 356 + Bucket + Key + Value ) + (18 + ( 13 + > Bucket + Key ) ) ) * Num Entries * N_Val = *( 414 + 3 * (Bucket + Key) + > Value ) * Num Entries * N_Val* > > ( 414 + 3 * (10 + 36) + 36 ) * 183915891 * 3 = 324427631724 = 302.1 GB > > > So the truth is somewhere in between. But as David wrote, there can be > additional overhead due to the append only nature on bitcask. > > Cheers, > Nico > > Am 24.05.2011 23:48, schrieb Anthony Molinaro: > > Just curious if anyone has any ideas, for the moment, I'm just taking > the RAM calculation and multiplying by 2 and the Disk calculation and > multiplying by 8, based on my findings with my current cluster. But > I would like to know why my values are so much higher than those I should > be getting. > > Also, I'd still like to know how the forms calculate things as the disk > calculation there does not match reality or the formula. > > Also, waiting to hear if there is any way to force merge to run so I can > more accurately gauge whether multiple copies are effecting disk usage. > > Thanks, > > -Anthony > > On Mon, May 23, 2011 at 11:06:31PM -0700, Anthony Molinaro wrote: > > On Mon, May 23, 2011 at 10:53:29PM -0700, Anthony Molinaro wrote: > > On Mon, May 23, 2011 at 09:57:25PM -0600, David Smith wrote: > > On Mon, May 23, 2011 at 9:39 PM, Anthony Molinaro > Thus, depending on > your merge triggers, more space can be used than is strictly necessary > to store the data. > > So the lack of any overhead in the calculation is expected? I mean > according to > > Disk = Estimated Total Objects * Average Object Size * n_val > > Which just seems wrong, doesn't it? I don't quite understand the > bitcask code well enough yet to see what the actual data it stores is, > but the whitepaper suggested several things were involved in the on > disk representation. > > Okay, finally found the code for this part, I kept looking in the nif > but that's only the keydir, not the data files. 