Since you were able to create and write these objects in the first
place, you probably had enough ram at one point to load and save them. I
would try bringing each node up in isolation, then issuing a delete
request against the local node, then restarting the node in normal,
talking-to-the-ring mode. If there are any local processes you can stop
to free up memory, try that too.
When I encountered this problem, I was able to use the riak:local_client
at the erlang shell to delete my huge objects--so long as other
processes weren't hammering it with requests.
--Kyle
On 07/05/2011 09:28 PM, Jeff Pollard wrote:
Thanks to some help from Aphyr + Sean Cribbs on IRC, we narrowed the
issue down to us having several multiple-hundred-megabyte sized
documents and one 1.1 gig document. Deletion of those documents has now
kept the cluster running quite happily for 3+ hours now, where before
nodes were crashing after 15 minutes.
I've managed to delete most of the large documents, but there are still
a handful (3) that I am unable to delete. Attempts to curl -X DELETE
them result in 503 error from Riak:
< HTTP/1.1 503 Service Unavailable
< Server: MochiWeb/1.1 WebMachine/1.7.3 (participate in the frantic)
< Date: Wed, 06 Jul 2011 04:20:15 GMT
< Content-Type: text/plain
< Content-Length: 18
<
request timed out
In the erlang.log, I see this right before the timeout comes back:
=INFO REPORT==== 5-Jul-2011::21:26:35 ===
[{alarm_handler,{set,{process_memory_high_watermark,<0.10425.0>}}}]
Anyone have any help/ideas on what's going on here and how to fix it?
On Tue, Jul 5, 2011 at 8:58 AM, Jeff Pollard <jeff.poll...@gmail.com
<mailto:jeff.poll...@gmail.com>> wrote:
Over the last few days we've had random nodes in our 5-node cluster
crash with "eheap_alloc: Cannot allocate xxxx bytes of memory"
errors in the erl_crash.dump file. In general, the error messages
seem to crash trying to allocate 13-20 gigs of memory (our boxes
have 32 gigs total). As far as I can tell crashing doesn't seem to
coincide with any particular requests to Riak. I've tried to make
some sense fo the erl_crash.dump file but haven't had any luck. I'm
also in the process of restoring our riak bakups to our staging
cluster in hopes of more accurately reproducing the issue in a less
noisy environment.
My questions for the list are:
1. Any clue how to further diagnose the issue? I can attach my
erl_crash.dump if needed.
2. Is it possible/likely this is due to large m/r requests? We
have a couple m/r requests. One goes over no more than 4
documents at a time while the other goes over anywhere between
60 and 10,000 documents, though more towards the smaller
number of documents. We use 16 js VMs with max memory for the
VM and stack of 32 MB, each.
3. We're running riak 0.14.1. Would upgrading to 0.14.2 help?
Thanks!
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com