Thanks!

Does that mean that occasional iSCSI path drop-outs are somewhat expected? We 
are using SSDs for WAL/DB on each OSD server, so at least that. 

Do you think that If we buy additional 6/12 HDDs would that help with the IOPS 
for the VMs? 

Regards,
Martin



> On 4 Oct 2020, at 15:17, Martin Verges <martin.ver...@croit.io> wrote:
> 
> Hello,
> 
> no iSCSI + VMware works without such problems.
> 
> > We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s Ethernet, 
> > erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
> 
> Nautilus is a good choice
> 12*10TB HDD is not good for VMs
> 25Gbit/s on HDD is way to much for that system
> 200 PGs per OSD is to much, I would suggest 75-100 PGs per OSD
> 
> You can improve latency on HDD clusters using external DB/WAL on NVMe. That 
> might help you
> 
> --
> Martin Verges
> Managing director
> 
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io>
> Chat: https://t.me/MartinVerges <https://t.me/MartinVerges>
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> 
> Web: https://croit.io <https://croit.io/>
> YouTube: https://goo.gl/PGE1Bx <https://goo.gl/PGE1Bx>
> 
> 
> Am So., 4. Okt. 2020 um 14:37 Uhr schrieb Golasowski Martin 
> <martin.golasow...@vsb.cz <mailto:martin.golasow...@vsb.cz>>:
> Hi,
> does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are 
> hitting the 5 second timeout limit on software HBA in ESXi. It appears 
> whenever there is increased load on the cluster, like deep scrub or 
> rebalance. Is it normal behaviour in production? Or is there something 
> special we need to tune?
> 
> We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s Ethernet, 
> erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
> 
> 
> ESXi Log:
> 
> 2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk: 
> iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive 
> data: Connection closed by peer
> 2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk: 
> iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx 
> notifying failure: Failed to Receive. State=Bound
> 2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk: 
> iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being 
> marked "OFFLINE" (Event:4)
> 2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA: 
> satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path 
> "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA.
> 
> 
> OSD Log:
> 
> [303088.450088] Did not receive response to NOPIN on CID: 0, failing 
> connection for I_T Nexus 
> iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
> [324926.694077] Did not receive response to NOPIN on CID: 0, failing 
> connection for I_T Nexus 
> iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
> [407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
> [407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
> [411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
> [411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722
> [481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930
> [481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7930
> 
> Cheers,
> Martin_______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to