Existing information is slightly modified and retained. Add information: * List which logs are usually helpful for troubleshooting * Explain how to acknowledge listed Ceph crashes and view details * List common causes of Ceph problems and link to recommendations for a healthy cluster * Briefly describe the common problem "OSDs down/crashed"
Signed-off-by: Alexander Zeidler <a.zeid...@proxmox.com> --- pveceph.adoc | 72 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 64 insertions(+), 8 deletions(-) diff --git a/pveceph.adoc b/pveceph.adoc index 90bb975..4e1c1e2 100644 --- a/pveceph.adoc +++ b/pveceph.adoc @@ -1150,22 +1150,78 @@ The following Ceph commands can be used to see if the cluster is healthy ('HEALTH_OK'), if there are warnings ('HEALTH_WARN'), or even errors ('HEALTH_ERR'). If the cluster is in an unhealthy state, the status commands below will also give you an overview of the current events and actions to take. +To stop their execution, press CTRL-C. ---- -# single time output -pve# ceph -s -# continuously output status changes (press CTRL+C to stop) -pve# ceph -w +# Continuously watch the cluster status +pve# watch ceph --status + +# Print the cluster status once (not being updated) +# and continuously append lines of status events +pve# ceph --watch ---- +[[pve_ceph_ts]] +Troubleshooting +~~~~~~~~~~~~~~~ + +This section includes frequently used troubleshooting information. +More information can be found on the official Ceph website under +Troubleshooting +footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/]. + +[[pve_ceph_ts_logs]] +.Relevant Logs on Affected Node + +* xref:_disk_health_monitoring[Disk Health Monitoring] +* __System -> System Log__ (or, for example, + `journalctl --since "2 days ago"`) +* IPMI and RAID controller logs + +Ceph service crashes can be listed and viewed in detail by running +`ceph crash ls` and `ceph crash info <crash_id>`. Crashes marked as +new can be acknowledged by running, for example, +`ceph crash archive-all`. + To get a more detailed view, every Ceph service has a log file under `/var/log/ceph/`. If more detail is required, the log level can be adjusted footnote:[Ceph log and debugging {cephdocs-url}/rados/troubleshooting/log-and-debug/]. -You can find more information about troubleshooting -footnote:[Ceph troubleshooting {cephdocs-url}/rados/troubleshooting/] -a Ceph cluster on the official website. - +[[pve_ceph_ts_causes]] +.Common Causes of Ceph Problems + +* Network problems like congestion, a faulty switch, a shut down +interface or a blocking firewall. Check whether all {pve} nodes are +reliably reachable on the xref:_cluster_network[corosync] network and +on the xref:pve_ceph_install_wizard[configured] Ceph public and +cluster network. + +* Disk or connection parts which are: +** defective +** not firmly mounted +** lacking I/O performance under higher load (e.g. when using HDDs, +consumer hardware or xref:pve_ceph_recommendation_raid[inadvisable] +RAID controllers) + +* Not fulfilling the xref:pve_ceph_recommendation[recommendations] for +a healthy Ceph cluster. + +[[pve_ceph_ts_problems]] +.Common Ceph Problems + :: + +OSDs `down`/crashed::: +A faulty OSD will be reported as `down` and mostly (auto) `out` 10 +minutes later. Depending on the cause, it can also automatically +become `up` and `in` again. To try a manual activation via web +interface, go to __Any node -> Ceph -> OSD__, select the OSD and click +on **Start**, **In** and **Reload**. When using the shell, run on the +affected node `ceph-volume lvm activate --all`. ++ +To activate a failed OSD, it may be necessary to +xref:ha_manager_node_maintenance[safely] reboot the respective node +or, as a last resort, to +xref:pve_ceph_osd_replace[recreate or replace] the OSD. ifdef::manvolnum[] include::pve-copyright.adoc[] -- 2.39.5 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel