Xiubo,

Thank you for all the help so far. I was finally able to figure out what the 
trigger for the issue was and how to make sure it doesn’t happen - at least not 
in a steady state. There is still the possibility of running into the bug in a 
failover scenario of some kind, but at least for now I think I’m stable.

I now have two iSCSI gateways running now and I’m not seeing the locks flapping 
back and forth between the two after making a change on the ESXi cluster that 
I’ll describe below.

I have 50 ESXi hosts communicating with the Ceph cluster. What happened was 
that for some reason, some of the hosts did not see the full list of paths to 
all the iSCSI gateways. In my case, each host should have seen a total of 44 
paths for all the LUNs but some were only seeing 32 or 37 (or some other 
number). This meant that if one of the paths it wasn’t seeing happened to be 
the primary path, it was not using it and using another path instead. This 
appear to be what was causing the images to flap back and forth between the two 
gateways. Once I went through each host and manually rescanned the adapter to 
discover all the available paths after adding the second iSCSI gateway, 
everything stabilized. If even one host in the environment doesn’t see all the 
paths, this flapping occurs.

Am I right to assume that the iSCSI gateways automatically determine which LUN 
they will advertise being primary for? Is there a command that lets me view 
which gateway is primary for which LUN? I’m guessing when another gateway gets 
added, the calculation of who is primary for each LUN gets re-calculated and 
advertised out to the clients?

-Paul




I did a quick test where I re-enabled a second iSCSI gateway to take a closer 
look at the paths on the ESXi hosts and I definitely see that when the second 
path becomes available, different hosts are pointing to different gateways for 
the Active I/O Path.

I was reading on how ALUA works and as far as I can tell, isn’t CEPH supposed 
to indicate to the ESXi hosts which iSCSI gateway “owns” a given LUN at any 
point so that the hosts know which path to make active?

Yeah, the ceph-iscsi/tcmu-runner services will do that. It will report this to 
the clients.


Could there be something wrong where more than one iSCSI gateway is advertising 
that it owns the LUN to the ESXi hosts?


This has been test and working well in linux in product and the logic never 
changed for several years.

I am not very sure how the ESXi internal will handle this but it should be in 
compliance with the iscsi proto, in linux the multipath could successfully 
detect which path is active and will choose it.

-Paul


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to