It seems to me a matter of definition of "available replicas". At Wed, 16 Mar 2016 14:13:48 +1300, Thomas Munro <thomas.mu...@enterprisedb.com> wrote in <CAEepm=3Ye+Ax_5=mzehmkx9dfn25qorzs362sqgnvgcvwx+...@mail.gmail.com> > <para> > Synchronous replication offers the ability to confirm that all changes > - made by a transaction have been transferred to one synchronous standby > - server. This extends the standard level of durability > + made by a transaction have been transferred to one or more > synchronous standby > + server. This extends that standard level of durability > offered by a transaction commit. This level of protection is referred > - to as 2-safe replication in computer science theory. > + to as group-safe replication in computer science theory. > </para> > > A message on the -general list today pointed me to some earlier > discussion[1] which quoted and referenced definitions of these > academic terms[2]. I think the above documentation should say: > > "This level of protection is referred to as 2-safe replication in > computer science literature when <variable>synchronous_commit</> is > set to <literal>on</>, and group-1-safe (group-safe and 1-safe) when > <variable>synchronous_commit</> is set to <literal>remote_write</>."
I suppose that the "available replica" on the paper is equivalent to "one choosen synchronous server" at the top of the queue of living standbys specified by s_s_names. The original description is true based on this interpretation. > By my reading, the situation doesn't actually change with this patch. > It doesn't matter whether you need 1 or 42 synchronous standbys to > make a quorum: 2-safe means durable (fsync) on all of them, > group-1-safe means durable on one server and received (implied by > remote_write) by all of them. Likewise, "the first two of the living standbys" (2[r01, ..r42]) and the master is translated to "three replicas". So it keeps 2-safe for the case. > I think we should be using those definitions because Gray's earlier > definition of 2-safe from Transaction Processing 12.6.3 doesn't really > fit: It can optionally mean remote receipt or remote durable storage, > but it doesn't wait if the 'backup' is down, so it's not the same type > of guarantee. (He also has 'very safe' which might describe our > syncrep, I'm not sure.) If the discussion above is true, the description doesn't seem to need to be amended in the view of the safe-criteria. > <para> > Synchronous replication offers the ability to confirm that all changes > - made by a transaction have been transferred to one synchronous standby > - server. This extends the standard level of durability > + made by a transaction have been transferred to one or more synchronous > standby > + server. This extends that standard level of durability > offered by a transaction commit. This level of protection is referred > to as 2-safe replication in computer science theory. > </para> But some additional explanation might be needed. For the true quorum commit, a client will be notified when the master and any n of all standbys have committed. This won't fit exactly to the criterias in the paper. In regard to Gray's definition, "2-safe" looks to be PG's syncrep with automatic release mechanism, such like what pgsql-RA offers. And "high availability" doesn't seem to fit to PostgreSQL's behavior because the master virtually commits a transaction before making an agreement to commit among all of replicas. # I'm reading it in Japanese so some words may be incorrect. Thoughts? > [1] > http://www.postgresql.org/message-id/603c8f070812132142n5408e7ddk899e83cddd4cb...@mail.gmail.com > [2] http://infoscience.epfl.ch/record/33053/files/EPFL_TH2577.pdf page 76 > > On Thu, Mar 10, 2016 at 11:21 PM, Fujii Masao <masao.fu...@gmail.com> wrote: > > On Fri, Mar 4, 2016 at 3:40 AM, Masahiko Sawada <sawada.m...@gmail.com> > > wrote: > >> On Thu, Mar 3, 2016 at 11:30 PM, Masahiko Sawada <sawada.m...@gmail.com> > >> wrote: > >>> Hi, > >>> > >>> Thank you so much for reviewing this patch! > >>> > >>> All review comments regarding document and comment are fixed. > >>> Attached latest v14 patch. > >>> > >>>> This accepts 'abc^Id' as a name, which is wrong behavior (but > >>>> such appliction names are not allowed anyway. If you assume so, > >>>> I'd like to see a comment for that.). > >>> > >>> 'abc^Id' is accepted as application_name, no? > >>> postgres(1)=# set application_name to 'abc^Id'; > >>> SET > >>> postgres(1)=# show application_name ; > >>> application_name > >>> ------------------ > >>> abc^Id > >>> (1 row) > >>> > >>>> addlit_xd_string(char *ytext) and addlitchar_xd_string(unsigned > >>>> char ychar) requires differnt character types. Is there any reason > >>>> for that? > >>> > >>> Because addlit_xd_string() is for adding string(char *) to xd_string, > >>> OTOH addlit_xd_char() is for adding just one character to xd_string. > >>> > >>>> I personally don't like addlit*string() things for such simple > >>>> syntax but itself is acceptble enough for me. However it uses > >>>> StringInfo to hold double-quoted names, which pallocs 1024 bytes > >>>> of memory chunk for every double-quoted name. The chunks are > >>>> finally stacked up left uncollected until the current > >>>> memorycontext is deleted or reset (It is deleted just after > >>>> finishing config file processing). Addition to that, setting > >>>> s_s_names runs the parser twice. It seems to me too greedy and > >>>> seems that static char [NAMEDATALEN] is enough using the v12 way > >>>> without palloc/repalloc. > >>> > >>> I though that length of group name could be more than NAMEDATALEN, so > >>> I use StringInfo. > >>> Is it not necessary? > >>> > >>>> I found that the name SyncGroupName.wait_num is not > >>>> instinctive. How about sync_num, sync_member_num or > >>>> sync_standby_num? If the last is preferable, .members also should > >>>> be .standbys . > >>> > >>> Thanks, sync_num is preferable to me. > >>> > >>> === > >>>> I am quite uncomfortable with the existence of > >>>> WanSnd.sync_standby_priority. It represented the pirority in the > >>>> old linear s_s_names format but nested groups or even > >>>> single-level quarum list obviously doesn't fit it. Can we get rid > >>>> of sync_standby_priority, even though we realize atmost > >>>> n-priority for now? > >>> > >>> We could get rid of sync_standby_priority. > >>> But if so, we will not be able to see the next sync standby in > >>> pg_stat_replication system view. > >>> Regarding each node priority, I was thinking that standbys in quorum > >>> list have same priority, and in nested group each standbys are given > >>> the priority starting from 1. > >>> > >>> === > >>>> The function SyncRepGetSyncedLsnsUsingPriority doesn't seem to > >>>> have specific code for every prioritizing method (which are > >>>> priority, quorum, nested and so). Is there any reson to use it as > >>>> a callback of SyncGroupNode? > >>> > >>> The reason why the current code is so is that current code is for only > >>> priority method supporting. > >>> At first version of this feature, I'd like to implement it more simple. > >>> > >>> Aside from this, of course I'm planning to have specific code for nested > >>> design. > >>> - The group can have some name nodes or group nodes. > >>> - The group can use either 2 types of method: priority or quorum. > >>> - The group has SyncRepGetSyncedLsnFn() and SyncRepGetStandbysFn() > >>> - SyncRepGetSyncedLsnsFn() function recursively determine synced LSN > >>> at that moment using group's method. > >>> - SyncRepGetStandbysFn() function returns standbys of its group, > >>> which are considered as sync using group's method. > >>> > >>> For example, s_s_name = '3(a, b, 2[c,d]::group1)', SyncRepStandbys > >>> memory structure will be, > >>> > >>> "main(quorum)" --- "a" > >>> | > >>> -- "b" > >>> | > >>> -- "group1(priority)" --- "c" > >>> | > >>> -- "d" > >>> > >>> When determine synced LSNs, we need to consider group1's LSN using by > >>> priority method at first, and then we can determine main's LSN using > >>> by quorum method with "a" LSNs, "b" LSNs and "group1" LSNs. > >>> So SyncRepGetSyncedLsnsUsingPriority() function would be, > >>> > >>> bool > >>> SyncRepGetSyncedLsnsUsingPriority(*group, *write_lsn, *flush_lsn) > >>> { > >>> sync_num = group->SynRepGetSyncstandbysFn(group, sync_list); > >>> > >>> if (sync_num < group->sync_num) > >>> return false; > >>> > >>> for (each member of sync_list) > >>> { > >>> if (member->type == group node) > >>> call SyncRepGetSyncedLsnsFn(member, w, f) and store w and > >>> f into lsn_list. > >>> else > >>> Store name node LSNs into lsn_list. > >>> } > >>> > >>> Determine synced LSNs of this group using lsn_list and priority > >>> method. > >>> Store synced LSNs into write_lsn and flush_lsn. > >>> return true; > >>> } > >>> > >>>> SyncRepClearStandbyGroupList is defined in syncrep.c but the > >>>> other related functions are defined in syncgroup_gram.y. It would > >>>> be better to place them together. > >>> > >>> SyncRepClearStandbyGroupList() is used by > >>> check_synchronous_standby_names(), so I put this function syncrep.c. > >>> > >>>> SyncRepStandbys are to be in multilevel and the struct is > >>>> naturally allowed to be so but SyncRepClearStandbyGroupList > >>>> assumes it in single level. > >>> > >>> Because I think that we don't need to implement to fully support > >>> nested style at first version. > >>> We have to carefully design this feature while considering > >>> expandability, but overkill implementation could be cause of crash. > >>> Consider remaining time for 9.6, I feel we could implement quorum > >>> method at best. > >>> > >>>> This is a comment from the aspect of abstractness of objects. > >>>> The callers of SyncRepGetSyncStandbysUsingPriority() need to care > >>>> the inside of SyncGroupNode but what the function should just > >>>> return seems to be the list of wansnds element. Element number is > >>>> useless when the SyncGroupNode nests. > >>>> > int > >>>> > SyncRepGetSyncStandbysUsingPriority(SyncGroupNode *group, volatile > >>>> > WalSnd **sync_list) > >>>> This might need to expose 'volatile WalSnd*' (only pointer type) > >>>> outside of walsender. > >>>> Or it should return the list of index number of > >>>> *WalSndCtl->walsnds*. > >>> > >>> SyncRepGetSyncStandbysUsingPriority() already returns the list of > >>> index number of "WalSndCtl->walsnd" as sync_list, no? > >>> As I mentioned above, SyncRepGetSyncStandbysFn() doesn't need care the > >>> inside of SyncGroupNode in my design. > >>> Selecting sync nodes from its group doesn't depend on the type of node. > >>> What SyncRepGetSyncStandbyFn() should do is to select sync node from > >>> *its* group. > >>> > >> > >> Previous patch has bug around GUC parameter handling. > >> Attached updated version. > > > > Thanks for updating the patch! > > > > Now I'm fixing some problems (e.g., current patch doesn't work > > with EXEC_BACKEND environment) and revising the patch. > > I will post the revised version this weekend or the first half > > of next week. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers