What's specs ate the machines?
Recovery work will use more memory the general clean operation and looks like your maxing out the available memory on the machines during CEPH trying to recover. ---- On Tue, 10 Sep 2019 18:10:50 +0800 amudha...@gmail.com wrote ---- I have also found below error in dmesg. [332884.028810] systemd-journald[6240]: Failed to parse kernel command line, ignoring: Cannot allocate memory [332885.054147] systemd-journald[6240]: Out of memory. [332894.844765] systemd[1]: systemd-journald.service: Main process exited, code=exited, status=1/FAILURE [332897.199736] systemd[1]: systemd-journald.service: Failed with result 'exit-code'. [332906.503076] systemd[1]: Failed to start Journal Service. [332937.909198] systemd[1]: ceph-crash.service: Main process exited, code=exited, status=1/FAILURE [332939.308341] systemd[1]: ceph-crash.service: Failed with result 'exit-code'. [332949.545907] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart. [332949.546631] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 7. [332949.546781] systemd[1]: Stopped Journal Service. [332949.566402] systemd[1]: Starting Journal Service... [332950.190332] systemd[1]: ceph-osd@1.service: Main process exited, code=killed, status=6/ABRT [332950.190477] systemd[1]: ceph-osd@1.service: Failed with result 'signal'. [332950.842297] systemd-journald[6249]: File /var/log/journal/8f2559099bf54865adc95e5340d04447/system.journal corrupted or uncleanly shut down, renaming and replacing. [332951.019531] systemd[1]: Started Journal Service. On Tue, Sep 10, 2019 at 3:04 PM Amudhan P <amudha...@gmail.com> wrote: Hi, I am using ceph version 13.2.6 (mimic) on test setup trying with cephfs. My current setup: 3 nodes, 1 node contain two bricks and other 2 nodes contain single brick each. Volume is a 3 replica, I am trying to simulate node failure. I powered down one host and started getting msg in other systems when running any command "-bash: fork: Cannot allocate memory" and system not responding to commands. what could be the reason for this? at this stage, I could able to read some of the data stored in the volume and some just waiting for IO. output from "sudo ceph -s" cluster: id: 7c138e13-7b98-4309-b591-d4091a1742b4 health: HEALTH_WARN 1 osds down 2 hosts (3 osds) down Degraded data redundancy: 5313488/7970232 objects degraded (66.667%), 64 pgs degraded services: mon: 1 daemons, quorum mon01 mgr: mon01(active) mds: cephfs-tst-1/1/1 up {0=mon01=up:active} osd: 4 osds: 1 up, 2 in data: pools: 2 pools, 64 pgs objects: 2.66 M objects, 206 GiB usage: 421 GiB used, 3.2 TiB / 3.6 TiB avail pgs: 5313488/7970232 objects degraded (66.667%) 64 active+undersized+degraded io: client: 79 MiB/s rd, 24 op/s rd, 0 op/s wr output from : sudo ceph osd df ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 hdd 1.81940 0 0 B 0 B 0 B 0 0 0 3 hdd 1.81940 0 0 B 0 B 0 B 0 0 0 1 hdd 1.81940 1.00000 1.8 TiB 211 GiB 1.6 TiB 11.34 1.00 0 2 hdd 1.81940 1.00000 1.8 TiB 210 GiB 1.6 TiB 11.28 1.00 64 TOTAL 3.6 TiB 421 GiB 3.2 TiB 11.31 MIN/MAX VAR: 1.00/1.00 STDDEV: 0.03 regards Amudhan _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io