Re: Core Claim and Property-Based Tests

2017-05-17 Thread Russell Brown
Back to the original post, the important point for me is that this is not 
really about riak-core, but Riak, the database.

The OP in TL;DR form:

1. A thorough report of a long lived bug in claim that means many node/ring 
combos end up with multiple replicas on one physical node, silently!
2. A proposed fix (which I consider very important for anyone running Riak.)
3. The most important question: If the OP fixes this, how can everyone benefit?

I had some dataloss fixes merged by Basho in March/April, will they ever be 
released?
Will each of the major users hardfork Riak and struggle to benefit from each 
others work?

Cheers

Russell

On 16 May 2017, at 21:02, DeadZen  wrote:

> I'd like to keep the core project going, just depends on how much interest 
> there is.
> There are a lot of separate issues and stalled initiatives, if anyone likes 
> to discuss them. Some have to do simply with scaling Distributed Erlang. 
> Theres a riak core mailing list as well that probably could use some fresh 
> air. 
> 
> Thanks,
> Pedram
> 
> On Tue, May 16, 2017 at 3:29 PM Christopher Meiklejohn 
>  wrote:
> We're looking at mainly leveraging partisan for changing the
> underlying communication structure -- we hope to have via support in
> Partisan soon along with connection multiplexing, so we hope to avoid
> bottlenecks related to head-of-line-blocking in distributed Erlang, be
> able to support SSL/TLS easier for intra-cluster communication and
> have more robust visibility into how the cluster is operating.
> 
> One thing we learned from Riak MDC is that the single connection's
> used in distributed Erlang are a bottleneck and difficult to apply
> flow and congestion control to -- where, we believe a solution based
> completely on gen_tcp would be more flexible.
> 
> [Keep in mind this is a ~1 year vision at the moment.]
> 
> Thanks,
> - Christopher
> 
> On Tue, May 16, 2017 at 9:20 PM, Martin Sumner
>  wrote:
> > Chris,
> >
> > Is this only the communications part, so the core concepts like the Ring,
> > preflists, the Claimant role, the claim algo etc will remain the same?
> >
> > Where's the best place to start reading about Partisan, I'm interested in
> > the motivation for changing that part of Core.  Is there a special use case
> > or problem you're focused on (e,g. gossip problems in much larger clusters)?
> >
> > Ta
> >
> > Martin
> >
> > On 16 May 2017 at 20:06, Christopher Meiklejohn
> >  wrote:
> >>
> >> For what it's worth, the Lasp community is looking at doing a fork of
> >> Riak Core replacing all communication with our Partisan library and
> >> moving it completely off of distributed Erlang.  We'd love to hear
> >> from more folks that are interested in this work.
> >>
> >> - Christopher
> >>
> >> On Tue, May 16, 2017 at 6:53 PM, Tom Santero  wrote:
> >> > I'm aware of a few other companies and individuals who are interested in
> >> > continued development and support in a post-Basho world. Ideally the
> >> > community can come together and contribute to a single, canonical fork.
> >> >
> >> > Semi-related, there's a good chance this mailing list won't last much
> >> > longer, either. I'm happy to personally contribute time and resources to
> >> > help maintain the community.
> >> >
> >> > Tom
> >> >
> >> > On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
> >> >  wrote:
> >> >>
> >> >>
> >> >> I've raised an issue with Core today
> >> >> (https://github.com/basho/riak_core/issues/908), related to the claim
> >> >> algorithms.
> >> >>
> >> >> There's a long-read associated with this, which provides a broader
> >> >> analysis of how claim works with the ring:
> >> >>
> >> >>
> >> >>
> >> >> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
> >> >>
> >> >> I believe the long-read explains some of the common mysterious issues
> >> >> which can occur with claim.
> >> >>
> >> >> We're in the process of fixing up the property-based tests for
> >> >> riak_core_claim.erl, and will then be looking to make some improvements
> >> >> to
> >> >> claim v2 to try and pass the improved tests.
> >> >>
> >> >> Big question is though, how can we progress any contribution we make
> >> >> into
> >> >> the Riak codebase?  What is the plan going forward for open-source
> >> >> contributions to Riak?  Do Basho have any contingency plans for
> >> >> smoothly
> >> >> handing over open-source code to the community, before the list of
> >> >> Basho's
> >> >> Github people (https://github.com/orgs/basho/people) who still work at
> >> >> Basho
> >> >> is reduced to zero?
> >> >>
> >> >> Is this something of concern to others?
> >> >>
> >> >> Regards
> >> >>
> >> >> Martin
> >> >>
> >> >>
> >> >> ___
> >> >> riak-users mailing list
> >> >> riak-users@lists.basho.com
> >> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >> >>
> >> >
> >> >
> >> > ___
> >> > riak-users mailing list
> >> >

Re: Core Claim and Property-Based Tests

2017-05-17 Thread Martin.Cox
Apologies in advance if this doesn't quite submit correctly to the list.

We [bet365] are very much interested in the continued development of Riak in 
its current incarnation, with Core continuing to be underpinned by distributed 
Erlang. We are very keen to help to build / shape / support the community 
around the project. Internally, we have assembled a team to continue the 
development of Riak, along a roadmap, and are also looking to bring more 
expertise into the business to help support this. Whilst the Lasp / Partisan 
project sounds really interesting, and something that could probably be of 
interest to us in the future, our immediate focus is around stabilising and 
securing the project in its current form. We’re looking to take Riak forward by 
contributing to a renewed community effort.

In summary, we're committed to continuing the development of Riak (we've 
already assembled /  growing a team to do so) and are happy to engage with, and 
support, the community in order to move the project forward.

Thanks

Martin Cox
Software Developer
Hillside (Technology) Limited
e: martin@bet365.com
bet365.com
This email and any files transmitted with it are confidential and contain 
information which may be privileged or confidential and are intended solely to 
be for the use of the individual(s) or entity to which they are addressed. If 
you are not the intended recipient be aware that any disclosure, copying, 
distribution or use of the contents of this information is strictly prohibited 
and may be illegal. If you have received this email in error, please notify us 
by telephone or email immediately and delete it from your system. Activity and 
use of our email system is monitored to secure its effective operation and for 
other lawful business purposes. Communications using this system will also be 
monitored and may be recorded to secure effective operation and for other 
lawful business purposes. Internet emails are not necessarily secure. We do not 
accept responsibility for changes made to this message after it was sent. You 
are advised to scan this message for viruses and we cannot accept liability for 
any loss or damage which may be caused as a result of any computer virus.

This email is sent by a bet365 group entity. The bet365 group includes the 
following entities: Hillside (Shared Services) Limited (registration no. 
3958393), Hillside (Spain New Media) Plc (registration no. 07833226), bet365 
Group Limited (registration no. 4241161), Hillside (Technology) Limited 
(registration no. 8273456), Hillside (Media Services) Limited (registration no. 
9171710), Hillside (Trader Services) Limited (registration no. 9171598) each 
registered in England and Wales with a registered office address at bet365 
House, Media Way, Stoke-on-Trent, ST1 5SZ, United Kingdom; Hillside (Gibraltar) 
Limited (registration no. 97927), Hillside (Sports) GP Limited (registration 
no. 111829) and Hillside (Gaming) GP Limited (registered no. 111830) each 
registered in Gibraltar with a registered office address at Unit 1.1, First 
Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside (UK Sports) LP 
(registration no. 117), Hillside (Sports) LP (registration no. 118), Hillside 
(International Sports) LP (registration no. 119), Hillside (Gaming) LP 
(registration no. 120) and Hillside (International Gaming) LP (registration no. 
121) each registered in Gibraltar with a principal place of business at Unit 
1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside 
España Leisure S.A (CIF no. A86340270) registered in Spain with a registered 
office address at C/ Conde de Aranda nº20, 2º, 28001 Madrid, Spain; Hillside 
(Australia New Media) Pty Limited (registration no. 148 920 665) registered in 
Australia with a registered office address at Level 4, 90 Arthur Street, North 
Sydney, NSW 2060, Australia; Hillside (New Media Malta) Limited, (registration 
no c.66039) registered in Malta with a registered office address at Office 
1/2373, Level G, Quantum House, 75 Abate Rigord Street, Ta’ Xbiex XBX 1120, 
Malta and Hillside (New Media Cyprus) Limited, (registration no. HE 361612) 
registered in Cyprus with a registered office address at Omrania Centre, 313, 
28th October Avenue, 3105 Limassol, Cyprus. Hillside (Shared Services) Limited, 
Hillside (Spain New Media) Plc and Hillside (New Media Malta) Limited also have 
places of business at Unit 1.1, First Floor, Waterport Place, 2 Europort 
Avenue, Gibraltar. For residents of Greece, this email is sent on behalf of B2B 
Gaming Services (Malta) Limited (registration number C41936) organised under 
the laws of Malta with a registered office at Apartment 21, Suite 41, Charles 
Court, St. Luke's Road, Pietà, Malta.


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Core Claim and Property-Based Tests

2017-05-17 Thread Daniel Abrahamsson
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster
into a 8 node cluster. We ended up rebuilding the cluster and writing a
small escript to verify that the generated riak ring lived up to our
requirements (which were 1: to survive an AZ outage, and 2: to survive any
2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of
setting up a Riak cluster.

//Daniel
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Core Claim and Property-Based Tests

2017-05-17 Thread Jon Meredith
Thanks for the excellent writeup.

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent,
right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.

  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness
constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment

```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4(Q=8 S=4,TargetN4)

Partition   All Up n4 down
(position)
0   n2 n3 n4   n2

Re: Core Claim and Property-Based Tests

2017-05-17 Thread andrei zavada
> ... before the list of Basho's Github people 
> (https://github.com/orgs/basho/people) who still work at Basho is reduced to 
> zero?

Just a note on that list: these are the (few) people who took the
trouble to flip the visibility of their membership in their profiles.
Github seems to have changed the default to be "private", which means
others simply won't see the organisation(s) a user belongs too.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Core Claim and Property-Based Tests

2017-05-17 Thread Martin Sumner
Jon,

Many thanks for taking the time to look at this.  You've given me lots to
think about, so I will take some time before updating my write-up to take
account of your feedback.

I need to go back and look at the safe transfers issues.  I spent some time
trying to work out how the claimaint transitions from having a plan to
committing the plan, and just kept getting lost in the code ... and put it
to one side.  I will be brave and dive back in, and have a look at the
simulator as well.

The time it takes to expand a cluster, and the cost of that expansion on
the existing nodes in the cluster (both during the fold, and the legacy of
the page-cache impact after the fold) is something we're worried about.
One of the motivations behind the leveled backend was perhaps to have an
alternative to handoff/fold_objects when moving vnodes, whereby the backend
could just ship the WAL files instead (and the receiving vnode on the
joining node would rebuild the KeyStore from the shipped WAL files).
Perhaps a vnode proxy solution might be better.

With the physical run-time promises I wasn't thinking of anything clever
with fallback that necessarily guaranteed in more cases that it would be
written to two physical nodes, but just that the writing client could be
aware when it has been written to two physical nodes even when two primary
vnodes are unavailable in the preflist.  Currently the NHS uses pw=2 as a
proxy for guaranteeing something has been written to two physical nodes,
but this kills availability on two nodes failing, even though as
target_n_val is 4 the PUT coordinator may be able to wait and confirm that
it was written to two physical nodes in all cases.

I'm going to read the jump consistent hash paper - it looks interesting.
Thoughts about radical changes that will support things like AZ-awareness
are shoved to the back of my mind at present, as even if they are possible,
it feels like transitioning old clusters to a radically changed ring (or
even ringless) algorithm would just be too hard.  One thing related to
this, is that we're assuming any future open-source full-sync type
replication will need to be de-coupled from the need to have consistent
ring-sizes (or perhaps even consistent ways of calculating the ring), as
the in-cluster ring-resizing is now dropped as a documented feature having
only ever been an experimental one - the only real option will be to be to
migrate to a new cluster with a different ring-size through replication.


Thanks again

Martin

On 17 May 2017 at 16:34, Jon Meredith  wrote:

>
> Thanks for the excellent writeup.
>
> I have a few notes on your writeup and then a little history to help
> explain the motivation for the v3 work.
>
> The Claiming Problem
>
>   One other property of the broader claim algorithm + claimant + handoff
>   manager group of processes that's worth mentioning is safety during
>   transition.  The cluster should ensure that target N-val copies
>   are always available even during transitions.  Much earlier in Riak's
>   life the claim would just execute and ownership transfer immediately,
>   without putting the data in place (fine, it's eventually consistent,
> right?)
>   but that meant if more than two vnodes in a preference list changed
>   ownership then clients would read not found until at least one of the
>   objects it was receiving had transferred. The claimant now shepherds
> those
>   transitions so it should be safe.  The solution of transferring the
>   data before ownership has fixed the notfound problem, but Riak lost
>   agility in adding capacity to the cluster - existing data has to transfer
>   to new nodes before they are freed up, and they continue to grow
>   while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
>   capacity and proxying back to the original vnode is probably a better
>   option.
>
>   Predicting load on the cluster is also difficult with the single
>   ring with a target n-val set at creation time being used for all
>   buckets despite their n-value.  To compute the operations sent to
>   each vnode you need to know the proportion of access to each N-value.
>
>   There's also the problem that if a bucket is created with an N-value
>   larger than target N all bets are off about the number of physical nodes
>   values are written to (*cough* strong consistency N-5)
>
>   Having a partitioning-scheme-per-N-value is one way of sidestepping the
>   load prediction and max-N problems.
>
> Promixity of Vnodes
>
>   An alternate solution to the target_n_val problem is to change the way
>   fallback partitions are added and apply an additional uniqueness
> constraint
>   as target nodes are added.  That provides safety against multiple node
>   failures (although can potentially cause loading problems).  I think
>   you imply this a couple of points when you talk about 'at runtime'.
>
> Proximity of vnodes as the partition list wraps
>
>   One kludge I considered solving the wraparound problem is to go f

Re: Core Claim and Property-Based Tests

2017-05-17 Thread Matt Davis
I don't contribute to this list as much as I lurk in #riak (craque), but
it's really great to see this kind of community support somewhere,
especially at a large place that is heavily invested in riak itself.

I have considered posting some of the operational lessons I've learned over
the past five years on riak-based systems. If there will be an organized
effort around these types of things, I'm here to help and would love to be
involved as well.

-matt


On Wed, May 17, 2017 at 3:19 AM,  wrote:

> Apologies in advance if this doesn't quite submit correctly to the list.
>
> We [bet365] are very much interested in the continued development of Riak
> in its current incarnation, with Core continuing to be underpinned by
> distributed Erlang. We are very keen to help to build / shape / support the
> community around the project. Internally, we have assembled a team to
> continue the development of Riak, along a roadmap, and are also looking to
> bring more expertise into the business to help support this. Whilst the
> Lasp / Partisan project sounds really interesting, and something that could
> probably be of interest to us in the future, our immediate focus is around
> stabilising and securing the project in its current form. We’re looking to
> take Riak forward by contributing to a renewed community effort.
>
> In summary, we're committed to continuing the development of Riak (we've
> already assembled /  growing a team to do so) and are happy to engage with,
> and support, the community in order to move the project forward.
>
> Thanks
>
> Martin Cox
> Software Developer
> Hillside (Technology) Limited
> e: martin@bet365.com
> bet365.com
> This email and any files transmitted with it are confidential and contain
> information which may be privileged or confidential and are intended solely
> to be for the use of the individual(s) or entity to which they are
> addressed. If you are not the intended recipient be aware that any
> disclosure, copying, distribution or use of the contents of this
> information is strictly prohibited and may be illegal. If you have received
> this email in error, please notify us by telephone or email immediately and
> delete it from your system. Activity and use of our email system is
> monitored to secure its effective operation and for other lawful business
> purposes. Communications using this system will also be monitored and may
> be recorded to secure effective operation and for other lawful business
> purposes. Internet emails are not necessarily secure. We do not accept
> responsibility for changes made to this message after it was sent. You are
> advised to scan this message for viruses and we cannot accept liability for
> any loss or damage which may be caused as a result of any computer virus.
>
> This email is sent by a bet365 group entity. The bet365 group includes the
> following entities: Hillside (Shared Services) Limited (registration no.
> 3958393), Hillside (Spain New Media) Plc (registration no. 07833226),
> bet365 Group Limited (registration no. 4241161), Hillside (Technology)
> Limited (registration no. 8273456), Hillside (Media Services) Limited
> (registration no. 9171710), Hillside (Trader Services) Limited
> (registration no. 9171598) each registered in England and Wales with a
> registered office address at bet365 House, Media Way, Stoke-on-Trent, ST1
> 5SZ, United Kingdom; Hillside (Gibraltar) Limited (registration no. 97927),
> Hillside (Sports) GP Limited (registration no. 111829) and Hillside
> (Gaming) GP Limited (registered no. 111830) each registered in Gibraltar
> with a registered office address at Unit 1.1, First Floor, Waterport Place,
> 2 Europort Avenue, Gibraltar; Hillside (UK Sports) LP (registration no.
> 117), Hillside (Sports) LP (registration no. 118), Hillside (International
> Sports) LP (registration no. 119), Hillside (Gaming) LP (registration no.
> 120) and Hillside (International Gaming) LP (registration no. 121) each
> registered in Gibraltar with a principal place of business at Unit 1.1,
> First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside España
> Leisure S.A (CIF no. A86340270) registered in Spain with a registered
> office address at C/ Conde de Aranda nº20, 2º, 28001 Madrid, Spain;
> Hillside (Australia New Media) Pty Limited (registration no. 148 920 665)
> registered in Australia with a registered office address at Level 4, 90
> Arthur Street, North Sydney, NSW 2060, Australia; Hillside (New Media
> Malta) Limited, (registration no c.66039) registered in Malta with a
> registered office address at Office 1/2373, Level G, Quantum House, 75
> Abate Rigord Street, Ta’ Xbiex XBX 1120, Malta and Hillside (New Media
> Cyprus) Limited, (registration no. HE 361612) registered in Cyprus with a
> registered office address at Omrania Centre, 313, 28th October Avenue, 3105
> Limassol, Cyprus. Hillside (Shared Services) Limited, Hillside (Spain New
> Media) Plc and Hillside (New Media Malta) Limited also have pla