Dear All, In the past months, we encountered several times of Lustre I/O abnormally slowing down. It is quite mysterious that there seems no problem on the network hardware, nor the lustre itself since there is no error message at all in MDT/OST/client sides.
Recently we probably found a way to reproduce it, and then have some suspections. We found that if we continuously perform I/O on a client without stop, then after some time threshold (probably more than 24 hours), the additional file I/O bandwidth of that client will be shriked dramatically. Our configuration is the following: - One MDT and one OST server, based on ZFS + Lustre-2.12.4. - The OST is served by a RAID 5 system with 15 SAS hard disks. - Some clients connect to MDT/OST through Infiniband, some through gigabit ethernet. Our test was focused on the clients using infiniband, which is described in the following: We have a huge (several TB) amount of data stored in the Lustre file system to be transferred to outside network. In order not to exhaust the network bandwidth of our institute, we transfer the data with limited bandwidth via the following command: rsync -av --bwlimit=1000 <data_in_Lustre> <out_side_server>:/<out_side_path>/ That is, the transferring rate is 1 MB per second, which is relatively low. The client read the data from Lustre through infiniband. So during data transmission, presumably there is no problem to do other data I/O on the same client. On average, when copy a 600 MB file from one directory to another directory (both in the same Lustre file system), it took about 1.0 - 2.0 secs, even when the rsync process still working. But after about 24 hours of continuously sending data via rsync, the additional I/O on the same client was dramatically shrinked. When it happens, it took more than 1 minute to copy a 600 MB from somewhere to another place (both in the same Lustre) while rsync is still running. Then, we stopped the rsync process, and wait for a while (about one hour). The I/O performance of copying that 600 MB file returns normal. Based on this observation, we are suspecting that whether there is a hidden QoS mechanism built in Lustre ? When a process occupies the I/O bandwidth for a long time and exceeded some limits, does Lustre automatically shrinked the I/O bandwidth for all processes running in the same client ? I am not against such QoS design, if it does exist. But the amount of shrinking seems to be too large for infiniband (QDR and above). Then I am further suspecting that whether this is due to that our system is mixed with clients in which some have infiniband but some do not ? Could anyone help to fix this problem ? Any suggestions will be very appreciated. Thanks very much. T.H.Hsieh _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
