Since you were able to create and write these objects in the first place, you probably had enough ram at one point to load and save them. I would try bringing each node up in isolation, then issuing a delete request against the local node, then restarting the node in normal, talking-to-the-ring mode. If there are any local processes you can stop to free up memory, try that too.

When I encountered this problem, I was able to use the riak:local_client at the erlang shell to delete my huge objects--so long as other processes weren't hammering it with requests.

--Kyle

On 07/05/2011 09:28 PM, Jeff Pollard wrote:
Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the
issue down to us having several multiple-hundred-megabyte sized
documents and one 1.1 gig document.  Deletion of those documents has now
kept the cluster running quite happily for 3+ hours now, where before
nodes were crashing after 15 minutes.

I've managed to delete most of the large documents, but there are still
a handful (3) that I am unable to delete.  Attempts to curl -X DELETE
them result in 503 error from Riak:

    < HTTP/1.1 503 Service Unavailable
    < Server: MochiWeb/1.1 WebMachine/1.7.3 (participate in the frantic)
    < Date: Wed, 06 Jul 2011 04:20:15 GMT
    < Content-Type: text/plain
    < Content-Length: 18

    <
    request timed out


In the erlang.log, I see this right before the timeout comes back:

    =INFO REPORT==== 5-Jul-2011::21:26:35 ===
    [{alarm_handler,{set,{process_memory_high_watermark,<0.10425.0>}}}]


Anyone have any help/ideas on what's going on here and how to fix it?

On Tue, Jul 5, 2011 at 8:58 AM, Jeff Pollard <jeff.poll...@gmail.com
<mailto:jeff.poll...@gmail.com>> wrote:

    Over the last few days we've had random nodes in our 5-node cluster
    crash with "eheap_alloc: Cannot allocate xxxx bytes of memory"
    errors in the erl_crash.dump file.  In general, the error messages
    seem to crash trying to allocate 13-20 gigs of memory (our boxes
    have 32 gigs total).  As far as I can tell crashing doesn't seem to
    coincide with any particular requests to Riak.  I've tried to make
    some sense fo the erl_crash.dump file but haven't had any luck.  I'm
    also in the process of restoring our riak bakups to our staging
    cluster in hopes of more accurately reproducing the issue in a less
    noisy environment.

    My questions for the list are:

       1. Any clue how to further diagnose the issue? I can attach my
          erl_crash.dump if needed.
       2. Is it possible/likely this is due to large m/r requests?  We
          have a couple m/r requests.  One goes over no more than 4
          documents at a time while the other goes over anywhere between
          60 and 10,000 documents, though more towards the smaller
          number of documents.  We use 16 js VMs with max memory for the
          VM and stack of 32 MB, each.
       3. We're running riak 0.14.1.  Would upgrading to 0.14.2 help?

    Thanks!




_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to