Re: 13-stable NFS server hang

Rick Macklem Sun, 03 Mar 2024 16:30:14 -0800

On Sun, Mar 3, 2024 at 4:28 PM Rick Macklem <rick.mack...@gmail.com> wrote:
>
> On Sun, Mar 3, 2024 at 3:27 PM Rick Macklem <rick.mack...@gmail.com> wrote:
> >
> > On Sun, Mar 3, 2024 at 1:17 PM Rick Macklem <rick.macklem@gmailcom> wrote:
> > >
> > > On Sat, Mar 2, 2024 at 8:28 PM Garrett Wollman <woll...@bimajority.org> 
> > > wrote:
> > > >
> > > >
> > > > I wrote previously:
> > > > > PID    TID COMM                TDNAME              KSTACK
> > > > > 997 108481 nfsd                nfsd: master        mi_switch 
> > > > > sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program 
> > > > > svc_run_internal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc 
> > > > > amd64_syscall fast_syscall_common
> > > > > 997 960918 nfsd                nfsd: service       mi_switch 
> > > > > sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid 
> > > > > nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start 
> > > > > fork_exit fork_trampoline
> > > > > 997 962232 nfsd                nfsd: service       mi_switch _cv_wait 
> > > > > txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey 
> > > > > zfs_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range 
> > > > > VOP_COPY_FILE_RANGE vn_copy_file_range nfsrvd_copy_file_range 
> > > > > nfsrvd_dorpc nfssvc_program svc_run_internal svc_thread_start 
> > > > > fork_exit fork_trampoline
> > > >
> > > > I spent some time this evening looking at this last stack trace, and
> > > > stumbled across the following comment in
> > > > sys/contrib/openzfs/module/zfs/dmu.c:
> > > >
> > > > | /*
> > > > |  * Enable/disable forcing txg sync when dirty checking for holes with 
> > > > lseek().
> > > > |  * By default this is enabled to ensure accurate hole reporting, it 
> > > > can result
> > > > |  * in a significant performance penalty for lseek(SEEK_HOLE) heavy 
> > > > workloads.
> > > > |  * Disabling this option will result in holes never being reported in 
> > > > dirty
> > > > |  * files which is always safe.
> > > > |  */
> > > > | int zfs_dmu_offset_next_sync = 1;
> > > >
> > > > I believe this explains why vn_copy_file_range sometimes takes much
> > > > longer than a second: our servers often have lots of data waiting to
> > > > be written to disk, and if the file being copied was recently modified
> > > > (and so is dirty), this might take several seconds.  I've set
> > > > vfs.zfs.dmu_offset_next_sync=0 on the server that was hurting the most
> > > > and am watching to see if we have more freezes.
> > > >
> > > > If this does the trick, then I can delay deploying a new kernel until
> > > > April, after my upcoming vacation.
> > > Interesting. Please let us know how it goes.
> > Btw, I just tried this for my trivial test and it worked very well.
> > A 1Gbyte file was cpied in two Copy RPCs of 1sec and slightly less than
> > 1sec.
> Oops, I spoke too soon.
> The Copy RPCs worked fine (as above) but the Commit RPCs took
> a long time, so it still looks like you may need the patches.
And I should mention that my test is done on a laptop without a ZIL,
so maybe a ZIL on a separate device might generate different results.


rick
>
> rick
>
> >
> > So, your vacation may be looking better, rick
> >
> > >
> > > And enjoy your vacation, rick
> > >
> > > >
> > > > -GAWollman
> > > >

Re: 13-stable NFS server hang

Reply via email to