Relatively rarely across a 200-machine cluster I get an LBUG on the
clients which seems triggered by specific access patterns (most jobs do
not trigger it) and looks quite similar to:

  https://jira.whamcloud.com/browse/LU-16637
  
http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2023-April/011016.html
  
https://git.whamcloud.com/?p=fs/lustre-release.git;a=commit;h=7bb1e211d217d5a82ac2d5e4edad5ae018090761

Since the LBUG is fatal all I get is the backtrace from the crash dump:

  lbug_with_loc.cxold.8+0x18
  ll_truncate_inode_pages_final+0xab
  vvp_prune+0x181
  cl_object_prune+0x58
  lov_layout_change.isra.49+0x1ba
  lov_conf_set+0x391
  cl_conf_set+0x60
  ll_layout_conf+0x14b
  ? _ptlrpc_req_finished+0x54d
  ll_layout_lock_set+0x3df
  ? ll_take_md_lock+0x148
  ll_layout_refresh+0x1cc
  vvp_io_init+0x22e
  cl_io_init0.isra.14+0x86
  ll_file_io_generic+0x388
  ? file_update_time+0x62
  ? srso_return_thunk+0x5
  ? __generic_file_write_iter+0x102
  ll_file_write_iter+0x558
  ? kmem_cache_freee+0x116
  new_sync_write+0x112
  vfs_write+0x5a

If this is a manifestation of LU-16637 there is a fix, but I have
checked the changelogs and LU-16637 is listed as applied to 2.16.0 but
it does not seem to be listed in the 2.15.[1-6] changelogs.
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to