Background

In the Pulsar, it has two features:

   -

   The first feature allows users to set group and rack information for
   bookies using pulsar-admin bookies set-bookie-rack.

Here, users set bookie1 to bookie5 to the default group and bookie6 to
bookie10 to the share group using commands, they don't care about rack
information, they only care about which group the bookie belongs to.

default={bookie1:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie2:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie3:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie4:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie5:3181=BookieInfoImpl(rack=default-rack,
hostname=null)}

_shared_={bookie6:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie7:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie8:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie9:3181=BookieInfoImpl(rack=default-rack,
hostname=null), bookie10:3181=BookieInfoImpl(rack=default-rack,
hostname=null)}


   -

   The second feature allows users to set the priority of traffic for a
   namespace, where traffic is directed to the primary group first and then to
   the secondary group. Users can set this priority using pulsar-admin
   ns-isolation-policy set --namespaces public/default --primary "group"
   --secondary "group".

Here, users set the primary group of the /public/default namespace to
"share" using a command.

{
  "bookkeeperAffinityGroupPrimary" : "share"
}

After this work is completed, all traffic under the /public/default
namespace will be directed to bookie6-10 in the "share" group.

Drawbacks

After a period of time, users added some new bookies [bk11, bk12, bk13,
bk14, bk15] to the bookie cluster, they found that some traffic under the
/public/default namespace was directed to the newly added machines. After
investigation, we eventually found that this was a defect in the working
mechanism of bookkeeperAffinityGroupPrimary.

*bookkeeperAffinityGroupPrimary work mechanism*

All bookies in the cluster: bk1-bk15.

Here are the steps of the broker pick bookies.

   1.

   Get the bookie rack info config default: [bk1, bk2, bk3, bk4, bk5]; share:
   [bk6, bk7, bk8, bk9, bk10]
   2.

   Exclude the bookies which are not the bookkeeperAffinityGroupPrimary
   (share).
   3.

   Exclude the default group bookies [bk1, bk2, bk3, bk4, bk5].
   4.

   Pick bookies from the remaining bookies [bk6, bk7, bk8, bk9, bk10, bk11,
   bk12, bk13, bk14, bk15]

Therefore, some traffic may go to bk11-bk15, which is not what the users
expect. The reason is that the new bookies, bk11 to bk15, did not have rack
information set and were not part of any group.

We provided a workaround for users to set the rack information for bk11 to
bk15 in advance using the command pulsar-admin bookies set-bookie-rack
before starting them. After user adopting this workaround, the traffic
worked as expected.

For user, it may be a bit inconvenient as they need to set rack information
in advance before bringing new bookies online. In scenarios where there are
strict limitations on traffic, if the bookie operation and maintenance
personnel overlook this step, it could cause problems.

Improvement

I would like to introduce a new configuration strict for
bookkeeperAffinityGroupPrimary. The default value for this configuration is
false, which means that for old users upgrading to the new version, the
logic will remain the same and bookies without rack information will not be
constrained.

If users manually set strict to true using the command pulsar-admin
ns-isolation-policy set --namespaces public/default --primary "group"
--secondary "group" --strict true, when the broker selects a bookie, it
will only choose from the bookies in the primary group. If there are not
enough bookies in the primary group, it will choose from the bookies in the
secondary group. If there are not enough bookies in either group, an
exception will be thrown. Bookies without rack information set using
pulsar-admin
bookies set-bookie-rack will not be selected.

Compatibility

When users upgrade from the old version to the new version, the working
mechanism of bookkeeperAffinityGroupPrimary remains the same as before.
When users upgrade to the new version and set strict to true using the
command pulsar-admin ns-isolation-policy set --namespaces public/default
--primary "group" --secondary "group" --strict true, and then roll back to
the old version, the broker should be able to correctly parse the
ns-isolation-policy configuration.

Reply via email to