Hi Robert,

please see below

On 06.04.2020 17:23, Robert Altnoeder wrote:
On 06 Apr 2020, at 10:17, Volodymyr Litovka <[email protected]> wrote:

To avoid this, I'd propose to add additional layer like proxy, which will:

- reside on every satellite
- receive data over unicast
** thus, drbd code will get minimal changes (now - it sends unicast data
to 1+ neighbors, after changes - it will send the same unicast to single
neighbor)
** to minimize delay - use local sockets
- resend it over multicast
- but manage control traffic (e.g. acknowledgments from remote peers)
over unicast
This would probably still require many changes in the DRBD kernel module, add 
another layer of complexity, another component that can fail independently, and 
makes the system as a whole harder to maintain and troubleshoot.

Delay would probably also be rather unpredictable, because different threads in 
kernel- and user-space must be activated and paused frequently for the IPC to 
work, and Linux, as a monolithical kernel, does not offer any specialized 
mechanisms for direct low-latency context switches/thread activation in a chain 
of I/O servers like those mechanisms that are found in most microkernels, or at 
least something in the general direction like e.g. “door calls” in the SunOS 
kernel (the kernel of the Solaris OS).

Well, I fully believe in what you're saying and my try to find "a better solution" doesn't look too convincing :-)

Multicast in DRBD would certainly make sense in various scenarios, but it would 
probably have to be implemented directly in DRBD.

Nice to hear this ;-)

Anyway, I don’t see that much difference between diskless nodes and nodes with 
storage. Any one of these nodes always sends write requests to all connected 
storage nodes, the only difference with diskless nodes is that they also use 
the replication link for reading data, which storage nodes rather do locally 
(load-balancing may cause read requests over the network too). So the only 
thing that would make write performance on a diskless node worse than write 
performance on a node with local storage would be network saturation due to 
lots of read requests putting load on the network.

I see a bit different picture, in fact. I'm using VM (disk is /dev/drbd/by-res/m1/0) to produce load. I launch test VM (using virsh) on different nodes and get corresponding resource usage like this:

# linstor resource list
╭─────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊
╞═════════════════════════════════════════════════════════╡
┊ m1           ┊ stor1 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
┊ m1           ┊ stor2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊
┊ m1           ┊ stor3 ┊ 7000 ┊ InUse  ┊ Ok    ┊ Diskless ┊
╰─────────────────────────────────────────────────────────╯
# linstor resource list-volumes
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node  ┊ Resource ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ 
Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ stor1 ┊ m1       ┊ drbdpool             ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 
50.01 GiB ┊ Unused ┊ UpToDate ┊
┊ stor2 ┊ m1       ┊ drbdpool             ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 
50.01 GiB ┊ Unused ┊ UpToDate ┊
┊ stor3 ┊ m1       ┊ DfltDisklessStorPool ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊   
        ┊ InUse  ┊ Diskless ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯

when m1 is InUse on stor2 ("disk" node) and I launch 'dd' there, I see the following tcpdump output on host (stor2):

# tcpdump -i eno4 'src host stor2 and dst port 7000'
[ ... lot of similar packets to stor1 and no packets to stor3 ... ]
18:59:22.640966 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 
1073298560:1073363720, ack 1, win 18449, options [nop,nop,TS val 2730290186 ecr 
2958297760], length 65160
18:59:22.641495 IP stor2.39897 > stor1.afs3-fileserver: Flags [P.], seq 
1073363720:1073426344, ack 1, win 18449, options [nop,nop,TS val 2730290186 ecr 
2958297761], length 62624
18:59:22.642053 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 
1073426344:1073491504, ack 1, win 18449, options [nop,nop,TS val 2730290187 ecr 
2958297761], length 65160
18:59:22.642606 IP stor2.39897 > stor1.afs3-fileserver: Flags [.], seq 
1073491504:1073556664, ack 1, win 18449, options [nop,nop,TS val 2730290187 ecr 
2958297761], length 65160

when m1 is InUse on stor3 ("diskless" node) and I launch 'dd' there, I see the following tcpdump output on host (stor3):

# tcpdump -i eno4 'src host stor3 and dst port 7000'
[ ... lot of similar packets to both stor1 and stor2 ... ]
19:05:56.451425 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 
1069351304:1069416464, ack 16425, win 11444, options [nop,nop,TS val 3958888538 
ecr 1765301734], length 65160
19:05:56.452077 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 
1069416464:1069481624, ack 16425, win 11444, options [nop,nop,TS val 3958888539 
ecr 1765301735], length 65160
19:05:56.452664 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 
1069481624:1069546784, ack 16425, win 11444, options [nop,nop,TS val 3958888540 
ecr 1765301736], length 65160
19:05:56.453324 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1071808472:1071873632, ack 61481, win 6365, options [nop,nop,TS val 1547141177 ecr 
2878616029], length 65160
19:05:56.454142 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1071873632:1071938792, ack 61481, win 6365, options [nop,nop,TS val 1547141177 ecr 
2878616029], length 65160
19:05:56.454926 IP stor3.40171 > stor2.afs3-fileserver: Flags [P.], seq 
1071938792:1072002920, ack 61481, win 6365, options [nop,nop,TS val 1547141178 ecr 
2878616030], length 64128
19:05:56.455700 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1072002920:1072068080, ack 61481, win 6365, options [nop,nop,TS val 1547141179 ecr 
2878616031], length 65160
19:05:56.456490 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 
1069546784:1069611944, ack 16425, win 11444, options [nop,nop,TS val 3958888543 
ecr 1765301739], length 65160
19:05:56.457121 IP stor3.59577 > stor1.afs3-fileserver: Flags [P.], seq 
1069611944:1069676072, ack 16425, win 11444, options [nop,nop,TS val 3958888544 
ecr 1765301740], length 64128
19:05:56.457730 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 
1069676072:1069741232, ack 16425, win 11444, options [nop,nop,TS val 3958888545 
ecr 1765301741], length 65160
19:05:56.458292 IP stor3.59577 > stor1.afs3-fileserver: Flags [.], seq 
1069741232:1069806392, ack 16425, win 11444, options [nop,nop,TS val 3958888546 
ecr 1765301741], length 65160
19:05:56.458939 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1072068080:1072133240, ack 61481, win 6365, options [nop,nop,TS val 1547141182 ecr 
2878616034], length 65160
19:05:56.459735 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1072133240:1072198400, ack 61481, win 6365, options [nop,nop,TS val 1547141182 ecr 
2878616034], length 65160
19:05:56.460598 IP stor3.40171 > stor2.afs3-fileserver: Flags [P.], seq 
1072198400:1072261048, ack 61481, win 6365, options [nop,nop,TS val 1547141183 ecr 
2878616035], length 62648
19:05:56.461097 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1072261048:1072326208, ack 61481, win 6365, options [nop,nop,TS val 1547141184 ecr 
2878616035], length 65160
19:05:56.461633 IP stor3.40171 > stor2.afs3-fileserver: Flags [.], seq 
1072326208:1072391368, ack 61481, win 6365, options [nop,nop,TS val 1547141184 ecr 
2878616036], length 65160

and, using bmon (realtime traffic rate monitor for linux) I always see about 1Gbps on originating host and:

- in 1st case (originator is "disk" node) : about 1Gbps on receiving host
- in 2nd case (originator is "diskless" node) : about 500Mbps on every of receiving hosts

From what I see I conclude, that in 1st case (originator is "disk" node) the single copy of replicated data travels through network, while in 2nd case (originator is "diskless" node) there are two copies travel through network.

Thank you.


--
Volodymyr Litovka
  "Vision without Execution is Hallucination." -- Thomas Edison

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
[email protected]
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to