On Mon Feb 3, 2025 at 3:27 PM CET, Alexander Zeidler wrote: > Existing information is slightly modified and retained. > > Add information: > * List which logs are usually helpful for troubleshooting > * Explain how to acknowledge listed Ceph crashes and view details > * List common causes of Ceph problems and link to recommendations for a > healthy cluster > * Briefly describe the common problem "OSDs down/crashed" > > Signed-off-by: Alexander Zeidler <a.zeid...@proxmox.com> > --- > pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 64 insertions(+), 8 deletions(-) > > diff --git a/pveceph.adoc b/pveceph.adoc > index 90bb975..4e1c1e2 100644 > --- a/pveceph.adoc > +++ b/pveceph.adoc > @@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the > cluster is healthy > ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors > ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands > below will also give you an overview of the current events and actions to > take. > +To stop their execution, press CTRL-C. > > ---- > -# single time output > -pve# ceph -s > -# continuously output status changes (press CTRL+C to stop) > -pve# ceph -w > +# Continuously watch the cluster status > +pve# watch ceph --status > + > +# Print the cluster status once (not being updated) > +# and continuously append lines of status events > +pve# ceph --watch > ---- > > +[[pve_ceph_ts]] > +Troubleshooting > +~~~~~~~~~~~~~~~ > + > +This section includes frequently used troubleshooting information. > +More information can be found on the official Ceph website under > +Troubleshooting > +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]. > + > +[[pve_ceph_ts_logs]] > +.Relevant Logs on Affected Node > + > +* xref:_disk_health_monitoring[Disk Health Monitoring]
For some reason, the "_disk_health_monitoring" anchor above breaks building the docs for me -- "make update" exits with an error, complaining that it can't find the anchor. The one-page docs ("pve-admin-guide.html") seems to build just fine, though. The anchor works there too, so I'm not sure what's going wrong there exactly. > +* __System -> System Log__ (or, for example, > + `journalctl --since "2 days ago"`) > +* IPMI and RAID controller logs > + > +Ceph service crashes can be listed and viewed in detail by running > +`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as > +new can be acknowledged by running, for example, > +`ceph crash archive-all`. > + > To get a more detailed view, every Ceph service has a log file under > `/var/log/ceph/`. If more detail is required, the log level can be > adjusted footnote:[Ceph log and debugging > {cephdocs-url}/rados/troubleshooting/log-and-debug/]. > > -You can find more information about troubleshooting > -footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] > -a Ceph cluster on the official website. > - > +[[pve_ceph_ts_causes]] > +.Common Causes of Ceph Problems > + > +* Network problems like congestion, a faulty switch, a shut down > +interface or a blocking firewall. Check whether all {pve} nodes are > +reliably reachable on the xref:_cluster_network[corosync] network and Would personally prefer "xref:_cluster_network[corosync network]" above, but no hard opinions there. > +on the xref:pve_ceph_install_wizard[configured] Ceph public and > +cluster network. Would also prefer [configured Ceph public and cluster network] as a whole here. > + > +* Disk or connection parts which are: > +** defective > +** not firmly mounted > +** lacking I/O performance under higher load (e.g. when using HDDs, > +consumer hardware or xref:pve_ceph_recommendation_raid[inadvisable] > +RAID controllers) Same here; I would prefer to highlight [inadvisable RAID controllers] as a whole. > + > +* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for > +a healthy Ceph cluster. > + > +[[pve_ceph_ts_problems]] > +.Common Ceph Problems > + :: > + > +OSDs `down`/crashed::: > +A faulty OSD will be reported as `down` and mostly (auto) `out` 10 > +minutes later. Depending on the cause, it can also automatically > +become `up` and `in` again. To try a manual activation via web > +interface, go to __Any node -> Ceph -> OSD__, select the OSD and click > +on **Start**, **In** and **Reload**. When using the shell, run on the > +affected node `ceph-volume lvm activate --all`. > ++ > +To activate a failed OSD, it may be necessary to > +xref:ha_manager_node_maintenance[safely] reboot the respective node And again here: Would personally prefer [safely reboot] in the anchor ref. > +or, as a last resort, to > +xref:pve_ceph_osd_replace[recreate or replace] the OSD. > > ifdef::manvolnum[] > include::pve-copyright.adoc[] Note: The only thing that really stood out to me was the "_disk_health_monitoring" refusing to build on my system; the other comments here are just tiny style suggestions. If you disagree with them, no hard feelings at all! :P _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel