Simon, Tung-Han, You may also want to watch these tickets that seem to be related to the issue you describe: https://jira.whamcloud.com/browse/LU-14124 https://jira.whamcloud.com/browse/LU-14125
-Nathan On Mon, Nov 2, 2020 at 7:18 AM Simon Guilbault < [email protected]> wrote: > Hi, > > If you set it on the MGS, it will be the new default for all the clients > and new mount on the FS, the problem is you need LU-12759 (fixed in 2.12.4) > on your clients since there was a bug on older clients and that setting was > not working correctly. > > On Mon, Nov 2, 2020 at 12:38 AM Tung-Han Hsieh < > [email protected]> wrote: > >> Dear Simon, >> >> Following your suggestions, now we confirmed that the problem of >> dropping I/O performance of a client when there is a continous >> I/O in the background is solved. It works charming. Thank you so >> much !! >> >> Here is a final question. We found that this command: >> >> lctl set_param osc.*.grant_shrink=0 >> >> can be run the client, which fixed the value of "cur_grant_bytes" >> to be the highest value 1880752127, and thereby fixed the problem. >> Whenever we remount the file system (I mean, explicitly umount and >> mount the file system), we need to execute this command again to >> set it to zero. >> >> But this command: >> >> lctl set_param -P osc.*.grant_shrink=0 >> >> has to be run in the MGS node. Only setting it in MGS but without >> setting in the client, it seems that the "cur_grant_bytes" of the >> testing client still dropping under the background continous I/O. >> So I am asking what's the meaning of this setting in MGS node. >> >> Thank you very much. >> >> >> T.H.Hsieh >> >> On Fri, Oct 30, 2020 at 01:37:01PM +0800, Tung-Han Hsieh wrote: >> > Dear Simon, >> > >> > Thank you very much for your useful information. Now we are arranging >> > the system maintenance date in order to upgrade to Lustre-2.12.5. Then >> > we will follow your suggestion to see whether this problem could be >> > fixed. >> > >> > Here I report a test of under continuous I/O, how the cur_grant_bytes >> > changed overtime. Again the client runs the following script for >> > continuous reading in the background: >> > >> > # The Lustre file system was mounted under /home >> > while [ 1 ]; do >> > tar cf - /home/large/data | ssh remote_host "cat > /dev/null" >> > done >> > >> > And every 20 mins, in the same client we copied a 600MB file from one >> > directory to another within Lustre, and check the "cur_grant_bytes" by >> > the following command running in the same client: >> > >> > /opt/lustre/sbin/lctl get_param osc.*.cur_grant_bytes >> > >> > The result is (every line separated by around 20 mins): >> > >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1880752127 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1410564096 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1059201024 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=794400768 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=595800576 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=446850432 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=335137824 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=251353368 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=188515026 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=141386270 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=106039703 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=79529778 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=59647334 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=44735501 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=33551626 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=25163720 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=18872790 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=14154593 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=10615945 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=7961959 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=5971470 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=4478603 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=3358953 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=2519215 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1889412 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1417059 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=1062795 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=797097 >> > osc.chome-OST0000-osc-ffff88a039150000.cur_grant_bytes=797097 >> > .... >> > >> > The value 797097 seems to be the minimum. When it dropped to 1062795, >> > the time of cp dramatically increased from around 1 sec to 1 min. In >> > addition, during the test, the cluster is completely idling. And it >> > is obvious that this test does not saturate the loading of both network >> > and MDT / OST hardware (they have almost no loading). >> > >> > I am wondering whether this could be a bug to report to the development >> > team. >> > >> > Best Regards, >> > >> > T.H.Hsieh >> > >> > On Thu, Oct 29, 2020 at 09:49:42AM -0400, Simon Guilbault wrote: >> > > Our current workaround was to use the following command on the MGS >> with >> > > Lustre 2.12.5 that include the patches in LU-12651 and LU-12759 (we >> were >> > > using a patched 2.12.4 a few months ago): >> > > lctl set_param -P osc.*.grant_shrink=0 >> > > >> > > We could not find the root cause of the underlying problem, dynamic >> grant >> > > shrinking seems to be useful when the OSTs are running out of free >> space. >> > > >> > > On Wed, Oct 28, 2020 at 11:47 PM Tung-Han Hsieh < >> > > [email protected]> wrote: >> > > >> > > > Dear Simon, >> > > > >> > > > Thank you very much for your hint. Yes, you are right. We compared >> > > > the grant size of two client by (running in each client): >> > > > >> > > > lctl get_param osc.*.cur_grant_bytes >> > > > >> > > > - Client A: It has run the following large data transfer for over >> 36 hrs. >> > > > >> > > > while [ 1 ]; do >> > > > tar cf - /home/large/data | ssh remote_host "cat > >> /dev/null" >> > > > done >> > > > >> > > > The value of "cur_grant_bytes" is 796134. >> > > > >> > > > - Client B: It is almost idling during the action of Client A. >> > > > >> > > > The value of "cur_grant_bytes" is 1715863552. >> > > > >> > > > If this is the reason that hit the I/O performance of Client A >> seriously, >> > > > is it possible to maintain it at a constant value at least for the >> head >> > > > node (since the head node is the most probable one to have large >> and long >> > > > time data I/O of the whole cluster, especially for a data center) ? >> > > > >> > > > I would be also like to ask: Why this value has to be dynamically >> adjusted >> > > > ? >> > > > >> > > > Thank you very much for your comment in advance. >> > > > >> > > > Best Regards, >> > > > >> > > > T.H.Hsieh >> > > > >> > > > On Wed, Oct 28, 2020 at 02:00:21PM -0400, Simon Guilbault wrote: >> > > > > Hi, we had a similar performance problem on our login/DTNs node a >> few >> > > > > months ago, the problem was the grant size was shrinking and was >> getting >> > > > > stuck under 1MB. Once under 1MB, the client had to send every >> request to >> > > > > the OST using sync IO. >> > > > > >> > > > > Check the output of the following command: >> > > > > lctl get_param osc.*.cur_grant_bytes >> > > > > >> > > > > On Wed, Oct 28, 2020 at 12:08 AM Tung-Han Hsieh < >> > > > > [email protected]> wrote: >> > > > > >> > > > > > Dear All, >> > > > > > >> > > > > > Sorry that I am not sure whether this mail was successfully >> posted to >> > > > > > the lustre-discuss mailing list or not. So I resent it again. >> Please >> > > > > > ignore it if you already read it before. >> > > > > > >> > > > > > >> > > > >> =========================================================================== >> > > > > > >> > > > > > Dear Andreas, >> > > > > > >> > > > > > Thank you very much for your kindly suggestions. These days I >> got a >> > > > chance >> > > > > > to follow your suggestions for the test. This email is to >> report the >> > > > > > results >> > > > > > I have done so far. What I have done were: >> > > > > > >> > > > > > 1. Upgrade one client (with Infiniband) to Lustre >> 2.13.56_44_gf8a8d3f >> > > > > > (obtained from github). The compiling information is: >> > > > > > >> > > > > > - Linux kernel 4.19.123. >> > > > > > - Infiniband MLNX_OFED_SRC-4.6-1.0.1.1. >> > > > > > - ./configure --prefix=/opt/lustre \ >> > > > > > --with-o2ib=/path/of/mlnx-ofed-kernel-4.6 \ >> > > > > > --disable-server --enable-mpitests=no >> > > > > > - make >> > > > > > - make install >> > > > > > >> > > > > > 2. We mounted the lustre file system (lustre MDT/OST servers: >> version >> > > > > > 2.12.4 with Infiniband with ZFS backend) by this command: >> > > > > > >> > > > > > - mount -t lustre -o flock mdt@o2ib:/chome /home >> > > > > > >> > > > > > 3. The script to simulate large data transfer is following: >> > > > > > (the directory "/home/large/data" contains 758 files, each >> size >> > > > 600MB) >> > > > > > >> > > > > > while [ 1 ]; do >> > > > > > tar cf - /home/large/data | ssh remote_host "cat > >> /dev/null" >> > > > > > done >> > > > > > >> > > > > > ps. Note that this scenario is common in a large data >> center, while >> > > > > > some users transferring large data out of the data center >> > > > through >> > > > > > the head node; while other users might copy files and do >> their >> > > > > > normal works in the same head node. >> > > > > > >> > > > > > 4. During the data transfer in the background, I occationally >> ran this >> > > > > > command in the same client to test whether there is any >> abnormality >> > > > > > in I/O performance (where /home/dir1/file has size 600MB): >> > > > > > >> > > > > > cp /home/dir1/file /home/dir2/ >> > > > > > >> > > > > > In the beginning this command can complete in about 1 sec. >> But after >> > > > > > around 18 hours (not exactly, because the test ran overnight >> while >> > > > > > I was sleeping), the problem appeared. The time to complete >> the same >> > > > > > cp command was more than 1 minute. >> > > > > > >> > > > > > During the test, I am sure that the whole cluster was >> idling. The >> > > > MDT >> > > > > > and OST servers did not have other loading. The CPU usage of >> the >> > > > testing >> > > > > > client was below 0.3. >> > > > > > >> > > > > > Then I stopped the test, and let the whole system completely >> idle. >> > > > But >> > > > > > after 3 hours, the I/O abnormality of the same "cp" command >> was >> > > > still >> > > > > > there. Only after I unmounted /home and remounted /home, the >> > > > abnormality >> > > > > > of "cp" recovered to normal. >> > > > > > >> > > > > > Before and after remounting /home (which I call "reset"), I did >> the >> > > > > > following tests: >> > > > > > >> > > > > > 1. Using "top" to check the memory usage: >> > > > > > >> > > > > > Before reset: >> > > > > > ===================================== >> > > > > > top - 10:43:15 up 35 days, 52 min, 3 users, load average: >> 0.00, 0.00, >> > > > > > 0.00 >> > > > > > Tasks: 404 total, 1 running, 162 sleeping, 0 stopped, 0 >> zombie >> > > > > > %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, >> 0.0 si, >> > > > > > 0.0 st >> > > > > > KiB Mem : 13232632+total, 13000131+free, 647784 used, 1677220 >> > > > buff/cache >> > > > > > KiB Swap: 15631240 total, 15631240 free, 0 used. >> 13076376+avail >> > > > Mem >> > > > > > >> > > > > > After reset: >> > > > > > ===================================== >> > > > > > top - 10:48:02 up 35 days, 57 min, 3 users, load average: >> 0.04, 0.01, >> > > > > > 0.00 >> > > > > > Tasks: 395 total, 1 running, 159 sleeping, 0 stopped, 0 >> zombie >> > > > > > %Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, >> 0.0 si, >> > > > > > 0.0 st >> > > > > > KiB Mem : 13232632+total, 12946539+free, 675948 used, 2184976 >> > > > buff/cache >> > > > > > KiB Swap: 15631240 total, 15631240 free, 0 used. >> 13073571+avail >> > > > Mem >> > > > > > >> > > > > > It seems that most of the memory were in "free" state. The >> amount of >> > > > > > hidden memory was neglectable. So I did not further >> investigate the >> > > > > > amount of slab memory. >> > > > > > >> > > > > > 2. Using "strace" with the following commands: >> > > > > > >> > > > > > - Before reset (took 1 min of each cp): >> > > > > > strace -c -o /tmp/log2-err.txt cp /home/dir1/file >> /home/dir2/ >> > > > > > >> > > > > > - After reset (took 1 sec of each cp): >> > > > > > strace -c -o /tmp/log2-reset.txt cp /home/dir1/file >> /home/dir2/ >> > > > > > >> > > > > > From the log files, the major time consuming was read and >> write >> > > > > > syscalls. >> > > > > > The others are neglectable. >> > > > > > >> > > > > > % time seconds usecs/call calls errors syscall >> > > > > > ------ ----------- ----------- --------- --------- >> ---------------- >> > > > > > (Before reset) >> > > > > > 71.46 0.278424 1920 145 write >> > > > > > 28.06 0.109322 705 155 read >> > > > > > (After reset) >> > > > > > 52.92 0.299091 2063 145 write >> > > > > > 46.85 0.264777 1708 155 read >> > > > > > >> > > > > > Before reset, since we have done the cp test for the same >> file a >> > > > > > few times, the file was already cached. So the reading time >> is >> > > > > > smaller before reset than that after reset (since after >> reset /home >> > > > > > was remounted). >> > > > > > >> > > > > > Hence from this result, the time of syscalls looks normal. >> The >> > > > > > performance drop seems occuring in other places. >> > > > > > >> > > > > > Now I haven't done the investigation of Lustre kernel debug log >> by >> > > > enabling >> > > > > > Lustre debug=-1. We will find another chance to do it. >> > > > > > >> > > > > > Up to now, any comments or suggestions are very welcome. >> > > > > > >> > > > > > Thanks for your help in advance. >> > > > > > >> > > > > > >> > > > > > Best Regards, >> > > > > > >> > > > > > T.H.Hsieh >> > > > > > >> > > > > > >> > > > > > On Thu, Oct 08, 2020 at 01:32:53PM -0600, Andreas Dilger wrote: >> > > > > > > On Oct 8, 2020, at 10:37 AM, Tung-Han Hsieh < >> > > > > > [email protected]> wrote: >> > > > > > > > >> > > > > > > > Dear All, >> > > > > > > > >> > > > > > > > In the past months, we encountered several times of Lustre >> I/O >> > > > > > abnormally >> > > > > > > > slowing down. It is quite mysterious that there seems no >> problem >> > > > on the >> > > > > > > > network hardware, nor the lustre itself since there is no >> error >> > > > message >> > > > > > > > at all in MDT/OST/client sides. >> > > > > > > > >> > > > > > > > Recently we probably found a way to reproduce it, and then >> have >> > > > some >> > > > > > > > suspections. We found that if we continuously perform I/O >> on a >> > > > client >> > > > > > > > without stop, then after some time threshold (probably more >> than 24 >> > > > > > > > hours), the additional file I/O bandwidth of that client >> will be >> > > > > > shriked >> > > > > > > > dramatically. >> > > > > > > > >> > > > > > > > Our configuration is the following: >> > > > > > > > - One MDT and one OST server, based on ZFS + Lustre-2.12.4. >> > > > > > > > - The OST is served by a RAID 5 system with 15 SAS hard >> disks. >> > > > > > > > - Some clients connect to MDT/OST through Infiniband, some >> through >> > > > > > > > gigabit ethernet. >> > > > > > > > >> > > > > > > > Our test was focused on the clients using infiniband, which >> is >> > > > > > described >> > > > > > > > in the following: >> > > > > > > > >> > > > > > > > We have a huge (several TB) amount of data stored in the >> Lustre >> > > > file >> > > > > > > > system to be transferred to outside network. In order not to >> > > > exhaust >> > > > > > > > the network bandwidth of our institute, we transfer the >> data with >> > > > > > limited >> > > > > > > > bandwidth via the following command: >> > > > > > > > >> > > > > > > > rsync -av --bwlimit=1000 <data_in_Lustre> >> > > > > > <out_side_server>:/<out_side_path>/ >> > > > > > > > >> > > > > > > > That is, the transferring rate is 1 MB per second, which is >> > > > relatively >> > > > > > > > low. The client read the data from Lustre through >> infiniband. So >> > > > during >> > > > > > > > data transmission, presumably there is no problem to do >> other data >> > > > I/O >> > > > > > > > on the same client. On average, when copy a 600 MB file >> from one >> > > > > > directory >> > > > > > > > to another directory (both in the same Lustre file system), >> it took >> > > > > > about >> > > > > > > > 1.0 - 2.0 secs, even when the rsync process still working. >> > > > > > > > >> > > > > > > > But after about 24 hours of continuously sending data via >> rsync, >> > > > the >> > > > > > > > additional I/O on the same client was dramatically >> shrinked. When >> > > > it >> > > > > > happens, >> > > > > > > > it took more than 1 minute to copy a 600 MB from somewhere >> to >> > > > another >> > > > > > place >> > > > > > > > (both in the same Lustre) while rsync is still running. >> > > > > > > > >> > > > > > > > Then, we stopped the rsync process, and wait for a while >> (about one >> > > > > > > > hour). The I/O performance of copying that 600 MB file >> returns >> > > > normal. >> > > > > > > > >> > > > > > > > Based on this observation, we are suspecting that whether >> there is >> > > > a >> > > > > > > > hidden QoS mechanism built in Lustre ? When a process >> occupies the >> > > > I/O >> > > > > > > > bandwidth for a long time and exceeded some limits, does >> Lustre >> > > > > > automatically >> > > > > > > > shrinked the I/O bandwidth for all processes running in the >> same >> > > > > > client ? >> > > > > > > > >> > > > > > > > I am not against such QoS design, if it does exist. But the >> amount >> > > > of >> > > > > > > > shrinking seems to be too large for infiniband (QDR and >> above). >> > > > Then >> > > > > > > > I am further suspecting that whether this is due to that our >> > > > system is >> > > > > > > > mixed with clients in which some have infiniband but some >> do not ? >> > > > > > > > >> > > > > > > > Could anyone help to fix this problem ? Any suggestions >> will be >> > > > very >> > > > > > > > appreciated. >> > > > > > > >> > > > > > > There is no "hidden QOS", unless it is so well hidden that I >> don't >> > > > know >> > > > > > > about it. >> > > > > > > >> > > > > > > You could investigate several different things to isolate the >> > > > problem: >> > > > > > > - try with a 2.13.56 client to see if the problem is already >> fixed >> > > > > > > - check if the client is using a lot of CPU when it becomes >> slow >> > > > > > > - run strace on your copy process to see which syscalls are >> slow >> > > > > > > - check memory/slab usage >> > > > > > > - enable Lustre debug=-1 and dump the kernel debug log to see >> where >> > > > > > > the process is taking a long time to complete a request >> > > > > > > >> > > > > > > It is definitely possible that there is some kind of problem, >> since >> > > > this >> > > > > > > is not a very common workload to be continuously writing to >> the same >> > > > file >> > > > > > > descriptor for over a day. You'll have to do the >> investigation on >> > > > your >> > > > > > > system to isolate the source of the problem. >> > > > > > > >> > > > > > > Cheers, Andreas >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > _______________________________________________ >> > > > > > lustre-discuss mailing list >> > > > > > [email protected] >> > > > > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> > > > > > >> > > > >> > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
