Matthew,

I did some testing with your code and I was able to reproduce what you were
seeing. I would occasionally see an error similar to the following:

    Failed to fetch item 460 err Object not found

This behavior is a result of the trade-offs of using an eventually
consistent database like Riak. It is not the case that your inserts are
failing to write or the data is being lost, but what is actually happening
is that the quick read after writing with the default request options does
not provide any guarantee that you will read your writes. So basically when
you make the read, the replicas that are responding to your request have
not seen the latest value yet and so you end up with "Not Found" as the
response. If you did another read attempt for one of those objects reported
missing, it would succeed because Riak's read-repair would have kicked in
to make sure each replica has the value. To increase the likelihood of
reading your writes you should set the optional request parameters pr and
pw to ensure that all of the primary replicas are available prior to
performing a read or write request. I altered your code to use those
options and put the updates in a gist [1] (it's my first stab at Go so my
changes may not be very idiomatic). Additionally I changed to riak driver
so that NotFoundOk was false instead of true. With these changes I was able
to run the test 50 times in a row with no errors where previously I would
see at least one error every 10 iterations or so. Hope that helps.

[1] : https://gist.github.com/kellymclaughlin/6041109

Kelly


On Wed, Jul 17, 2013 at 4:07 PM, Matthew Dawson <matt...@mjdsystems.ca>wrote:

> On July 17, 2013 08:45:01 AM Kelly McLaughlin wrote:
> > Matthew,
> >
> > I find it really surprising that you don't see any difference in behavior
> > when you set delete_mode to keep. I think it would be helpful if you
> could
> > outline your specific setup and give the steps to reproduce what you're
> > seeing to be able to make a determination if this represents a bug or
> not.
> > Thanks.
> >
> > Kelly
> Hi Kelly,
>
> Sure, no problem.  Hardware wise, I have:
>  - An AMD Phenom II X6 Desktop with 16G memory, and a HDD with an SSD
> cache.
>  - An Intel Ivy Bridge Dual Core (+HT) Laptop with 16G memory and SSD.
> Both have lots of free memory + disk space for running my tests, and my
> Desktop never seems to be IO bound.  Both machines are connected over
> Ethernet
> on the same LAN.
>
> On top of that hardware, both are running two instances of Riak each, all
> forming one 4 node cluster.  I'm using the default ring size of 64.  I've
> also
> upgraded all the nodes to the latest release, 1.4, using the 1.4 tag from
> Git.
> I'm not using this to seriously benchmark Riak, so I don't think this setup
> should cause any issues.  I'm also going to setup a really cluster for
> production use, so ring size is not a concern.
> Each Riak instance uses LevelDB as the datastore, Riak Search is disabled.
> I'm using Riak's PB API for access, and I've bumped up the backlog
> parameter
> to 1024 for now.  Originally my program would connect to a single node, but
> recently I've been playing with HAProxy locally, and now I use that to
> connect
> to all four instances.  The problem existed before I implemented HAProxy.
> Riak Control is also enabled on one node per computer.
>
> For my application, it effectively stores in Riak two pieces of
> information.
> First it stores a list of keys associated with an object, and then stores
> an
> individual item at each key.  I limit the number of keys to 10000 per
> object.
>
> For my test suite, I automatically clean up after each test by listing all
> the
> keys associated with a bucket, and then delete each key individually.  I
> only
> store items in two buckets, so this cleans the slate before each run.
>
> The test that has the high chance of failing is testing how the system
> deals
> with inserting 10000 items against one object.  The key list remains below
> 1M.
> Occasionally I see other tests fail, but I think this one fails more often
> as
> it stresses the entire system the most.  If I stop the automatic cleanup,
> the
> not found key is also not findable by Curl either.
>
> Before posting, I would delete and insert keys, without using a vclock.  I
> had
> figured this was safe as I ran with allow_mult=true on both buckets, and I
> implemented conflict resolution first.  As suggested on this list, I now
> have
> the 10000 item test suite use vclocks from start to finish.  However, I
> still
> see this behaviour.
>
> I've attached a program (written in go as that is what I'm using) to this
> email which triggers the behaviour.  As far as I understand Riak, it is
> properly fetching vclocks whenever possible.  The library I'm using
> (located
> at: github.com/tpjg/goriakpbc ) was just recently updated to ensure that
> vclocks are fetched, even if the item is deleted.  I am using an up to date
> version of the library.  The program is acts similarly to my app, but
> paired
> down as far as possible.  Note that this behaviour is unpredictable, and
> this
> program will sometimes execute fine.
> I only tested this program against the default delete_mode setting.  Also,
> using HAProxy seems to trigger the issue far more readily, but it happens
> fine
> without it.
>
>
> If there is any other information I can provide to help, let me know.
>
> Thanks,
> --
> Matthew
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to