Bryan,

We saw that Riak was using much more memory than was expected at the end of
the handoffs.  Using `riak-admin top` we could see that this wasn't process
memory, but binaries.  Firstly did some work via attach looping over
processes and running GC to confirm that this wasn't a failure to collect
garbage - the references to memory were real.  Then did a bit of work in
attach writing some functions to analyse process_info/2 for each process
(looking at binary and memory), and discovered that there were penciller
processes that had lots of references to lots of large binaries (and this
accounted for all the unexpected memory use), and where the penciller was
the only process with a reference to the binary.  This made no sense
initially as the penciller should only have small binaries (metadata).
Then looked at the running state of the penciller processes and could see
no large binaries in the state, but could see that a lot of the active keys
in the penciller were keys that were known to have large object values (but
small amounts of metadata) - and that the size of the object values were
the same as the size of the binary references found on the penciller
process via process_info/2..

I then recalled the first part of this:
https://dieswaytoofast.blogspot.com/2012/12/erlang-binaries-and-garbage-collection.html.
It was obvious that in extracting the metadata the beam was naturally
retaining a reference to the whole binary, as long as the sub-binary was
retained by the a process (the Penciller).  Forcing a binary copy resolved
this referencing issue.  It was nice that the same tools used to detect the
issue, made it quite easy to write a test to confirm resolution -
https://github.com/martinsumner/leveled/blob/master/test/end_to_end/riak_SUITE.erl#L1214-L1239
.

The memory leak section of Fred Herbert's http://www.erlang-in-anger.com/ is
great reading for helping with these types of issues.

Thanks

Martin


On Fri, 28 Jun 2019 at 09:46, b h <bryanhuntwit...@gmail.com> wrote:

> Nice work - I've read issue / PR - how did you discover / track it down -
> tools or just reading the code ?
>
> On Fri, 28 Jun 2019 at 09:35, Martin Sumner <martin.sum...@adaptip.co.uk>
> wrote:
>
>> There is now a second update available for 2.9.0:
>> https://github.com/basho/riak/tree/riak-2.9.0p2.
>>
>> This patch, like the patch before, resolves a memory management issue in
>> leveled, which this time could be triggered by sending many large objects
>> in a short period of time.  The underlying problem is described a bit
>> further here https://github.com/martinsumner/leveled/issues/285, and is
>> resolved by leveled working more sympathetically with the beam binary
>> memory management.
>>
>> Switching to the patched version is not urgent unless you are using the
>> leveled backend, and may send a large number of large objects in a burst.
>>
>> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
>> https://files.tiot.jp/riak/kv/2.9/2.9.0p2/
>>
>> Thanks again to the testing team at the NHS Spine project, Aaron Gibbon
>> (BJSS) and Ramen Sen, who discovered the problem.  The issue was discovered
>> in a handoff scenario where there were a tens of thousands of 2MB objects
>> stored in a portion of the keyspace at the end of the handoff - which led
>> to memory issues until either more PUTs were received (to force a persist
>> to disk) or a restart occurred..
>>
>> Regards
>>
>>
>> On Sat, 25 May 2019 at 09:35, Martin Sumner <martin.sum...@adaptip.co.uk>
>> wrote:
>>
>>> Unfortunately, Riak 2.9.0 was released with an issue whereby a race
>>> condition in heavy-PUT scenarios (e.g. handoffs), could cause a leak of
>>> file descriptors.
>>>
>>> The issue is described here -
>>> https://github.com/basho/riak_kv/issues/1699, and the underlying issue
>>> here - https://github.com/martinsumner/leveled/issues/278.
>>>
>>> There is a new patched version of the release available (2.9.0p1) at
>>> https://github.com/basho/riak/tree/riak-2.9.0p1.  This should be used
>>> in preference to the original release of 2.9.0.
>>>
>>> Updated packages are available (thanks to Nick Adams at TI Tokyo) -
>>> https://files.tiot.jp/riak/kv/2.9/2.9.0p1/
>>>
>>> Thanks also to the testing team at the NHS Spine project, Aaron Gibbon
>>> (BJSS) and Ramen Sen, who discovered the problem.
>>>
>>> Regards
>>>
>>> Martin
>>>
>>>
>>>
>>>
>>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to