On Wed, Feb 18, 2015 at 10:27 PM, Florian Haas <flor...@hastexo.com> wrote: > On Wed, Feb 18, 2015 at 9:32 PM, Mark Nelson <mnel...@redhat.com> wrote: >> On 02/18/2015 02:19 PM, Florian Haas wrote: >>> >>> Hey everyone, >>> >>> I must confess I'm still not fully understanding this problem and >>> don't exactly know where to start digging deeper, but perhaps other >>> users have seen this and/or it rings a bell. >>> >>> System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2 >>> different rulesets where the problem applies to hosts and PGs using a >>> bog-standard default crushmap. >>> >>> Symptom: out of the blue, ceph-osd processes on a single OSD node >>> start going to 100% CPU utilization. The problems turns so bad that >>> the machine is effectively becoming CPU bound and can't cope with any >>> client requests anymore. Stopping and restarting all OSDs brings the >>> problem right back, as does rebooting the machine — right after >>> ceph-osd processes start, CPU utilization shoots up again. Stopping >>> and marking out several OSDs on the machine makes the problem go away >>> but obviously causes massive backfilling. All the logs show while CPU >>> utilization is implausibly high are slow requests (which would be >>> expected in a system that can barely do anything). >>> >>> Now I've seen issues like this before on dumpling and firefly, but >>> besides the fact that they have all been addressed and should now be >>> fixed, they always involved the prior mass removal of RBD snapshots. >>> This system only used a handful of snapshots in testing, and is >>> presently not using any snapshots at all. >>> >>> I'll be spending some time looking for clues in the log files of the >>> OSDs that were shut down which caused the problem to go away, but if >>> this sounds familiar to anyone willing to offer clues, I'd be more >>> than interested. :) Thanks! >> >> >> Hi Florian, >> >> Does a quick perf top tell you anything useful? > > Hi Mark, > > Unfortunately, quite the contrary -- but this might actually provide a > clue to the underlying issue. > > So the CPU pegging issue isn't currently present, so the perf top data > wouldn't be conclusive until the issue is reproduced. But: merely > running perf top on this host, which currently only has 2 active OSDs, > renders the host unresponsive. > > Corresponding dmesg snippet: > > [Wed Feb 18 20:53:42 2015] hrtimer: interrupt took 2243820 ns > [Wed Feb 18 20:53:49 2015] ------------[ cut here ]------------ > [Wed Feb 18 20:53:49 2015] WARNING: at > arch/x86/kernel/cpu/perf_event.c:1074 x86_pmu_start+0xc6/0x100() > [Wed Feb 18 20:53:49 2015] Modules linked in: ipmi_si binfmt_misc > mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack > iptable_filter ip_tables xfs vfat fat iTCO_w > dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul > glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich > mfd_core mei_me mei ipmi_devintf > shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd > auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod > crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt > i2c_algo_bit drm_kms_helper ttm bnx2 > x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c > dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si] > > [Wed Feb 18 20:53:49 2015] CPU: 0 PID: 12381 Comm: dsm_sa_datamgrd Not > tainted 3.10.0-123.20.1.el7.x86_64 #1 > [Wed Feb 18 20:53:49 2015] Hardware name: Dell Inc. PowerEdge > R720xd/0020HJ, BIOS 2.2.2 01/16/2014 > [Wed Feb 18 20:53:49 2015] 0000000000000000 0000000050de8931 > ffff880fef003d40 ffffffff815e2b0c > [Wed Feb 18 20:53:49 2015] ffff880fef003d78 ffffffff8105dee1 > ffff880c316a7400 ffff880fef00b9e0 > [Wed Feb 18 20:53:49 2015] 0000000000000000 ffff880fef016db0 > ffff880dbaa896c0 ffff880fef003d88 > [Wed Feb 18 20:53:49 2015] Call Trace: > [Wed Feb 18 20:53:49 2015] <IRQ> [<ffffffff815e2b0c>] dump_stack+0x19/0x1b > [Wed Feb 18 20:53:49 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 > [Wed Feb 18 20:53:49 2015] [<ffffffff8105e00a>] warn_slowpath_null+0x1a/0x20 > [Wed Feb 18 20:53:49 2015] [<ffffffff81023706>] x86_pmu_start+0xc6/0x100 > [Wed Feb 18 20:53:49 2015] [<ffffffff81136128>] > perf_adjust_freq_unthr_context.part.79+0x198/0x1b0 > [Wed Feb 18 20:53:49 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0 > [Wed Feb 18 20:53:49 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150 > [Wed Feb 18 20:53:49 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80 > [Wed Feb 18 20:53:49 2015] [<ffffffff810be055>] > tick_sched_handle.isra.16+0x25/0x60 > [Wed Feb 18 20:53:49 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60 > [Wed Feb 18 20:53:49 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0 > [Wed Feb 18 20:53:49 2015] [<ffffffff810be090>] ? > tick_sched_handle.isra.16+0x60/0x60 > [Wed Feb 18 20:53:49 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240 > [Wed Feb 18 20:53:49 2015] [<ffffffff81039717>] > local_apic_timer_interrupt+0x37/0x60 > [Wed Feb 18 20:53:49 2015] [<ffffffff815f552f>] > smp_apic_timer_interrupt+0x3f/0x60 > [Wed Feb 18 20:53:49 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80 > [Wed Feb 18 20:53:49 2015] <EOI> [<ffffffff81092d46>] ? > finish_task_switch+0x56/0x170 > [Wed Feb 18 20:53:49 2015] [<ffffffff815e7db5>] __schedule+0x2c5/0x790 > [Wed Feb 18 20:53:49 2015] [<ffffffff815e82a9>] schedule+0x29/0x70 > [Wed Feb 18 20:53:49 2015] [<ffffffff815e73b5>] do_nanosleep+0xc5/0x130 > [Wed Feb 18 20:53:49 2015] [<ffffffff8108a78d>] hrtimer_nanosleep+0xad/0x170 > [Wed Feb 18 20:53:49 2015] [<ffffffff81089720>] ? hrtimer_get_res+0x50/0x50 > [Wed Feb 18 20:53:49 2015] [<ffffffff815e7361>] ? do_nanosleep+0x71/0x130 > [Wed Feb 18 20:53:49 2015] [<ffffffff8108a8ce>] SyS_nanosleep+0x7e/0x90 > [Wed Feb 18 20:53:49 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b > [Wed Feb 18 20:53:49 2015] ---[ end trace bbcc155358ec2b47 ]--- > [Wed Feb 18 21:00:35 2015] ------------[ cut here ]------------ > [Wed Feb 18 21:00:35 2015] WARNING: at > arch/x86/kernel/cpu/perf_event_intel.c:1205 > intel_pmu_handle_irq+0x3f7/0x410() > [Wed Feb 18 21:00:35 2015] perfevents: irq loop stuck! > [Wed Feb 18 21:00:35 2015] Modules linked in: ipmi_si binfmt_misc > mpt3sas mptctl mptbase dell_rbu 8021q garp stp mrp llc sg ipt_REJECT > nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack > iptable_filter ip_tables xfs vfat fat iTCO_w > dt iTCO_vendor_support dcdbas coretemp kvm_intel kvm crct10dif_pclmul > crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul > glue_helper ablk_helper cryptd pcspkr sb_edac edac_core lpc_ich > mfd_core mei_me mei ipmi_devintf > shpchp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq mperf nfsd > auth_rpcgss nfs_acl lockd sunrpc ext4 mbcache jbd2 raid1 sd_mod > crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt > i2c_algo_bit drm_kms_helper ttm bnx2 > x drm mpt2sas i2c_core raid_class mdio scsi_transport_sas libcrc32c > dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si] > > [Wed Feb 18 21:00:35 2015] CPU: 9 PID: 4608 Comm: ceph-osd Tainted: G > W -------------- 3.10.0-123.20.1.el7.x86_64 #1 > [Wed Feb 18 21:00:35 2015] Hardware name: Dell Inc. PowerEdge > R720xd/0020HJ, BIOS 2.2.2 01/16/2014 > [Wed Feb 18 21:00:35 2015] ffff880fef125d10 000000000bc39f12 > ffff880fef125cc8 ffffffff815e2b0c > [Wed Feb 18 21:00:35 2015] ffff880fef125d00 ffffffff8105dee1 > 0000000000000064 ffff880fef12b9e0 > [Wed Feb 18 21:00:35 2015] ffff880c316a4400 0000000000000040 > ffff880fef12bbe0 ffff880fef125d68 > [Wed Feb 18 21:00:35 2015] Call Trace: > [Wed Feb 18 21:00:35 2015] <NMI> [<ffffffff815e2b0c>] dump_stack+0x19/0x1b > [Wed Feb 18 21:00:35 2015] [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80 > [Wed Feb 18 21:00:35 2015] [<ffffffff8105df5c>] warn_slowpath_fmt+0x5c/0x80 > [Wed Feb 18 21:00:35 2015] [<ffffffff810294c7>] > intel_pmu_handle_irq+0x3f7/0x410 > [Wed Feb 18 21:00:35 2015] [<ffffffff81356ff4>] ? > ghes_copy_tofrom_phys+0x124/0x210 > [Wed Feb 18 21:00:35 2015] [<ffffffff8101a603>] ? native_sched_clock+0x13/0x80 > [Wed Feb 18 21:00:35 2015] [<ffffffff8101a679>] ? sched_clock+0x9/0x10 > [Wed Feb 18 21:00:35 2015] [<ffffffff8109a045>] ? sched_clock_cpu+0xb5/0x100 > [Wed Feb 18 21:00:35 2015] [<ffffffff815ec08b>] > perf_event_nmi_handler+0x2b/0x50 > [Wed Feb 18 21:00:35 2015] [<ffffffff815eb7d9>] nmi_handle.isra.0+0x69/0xb0 > [Wed Feb 18 21:00:35 2015] [<ffffffff815eb989>] do_nmi+0x169/0x340 > [Wed Feb 18 21:00:35 2015] [<ffffffff815eac31>] end_repeat_nmi+0x1e/0x2e > [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? > native_write_msr_safe+0xa/0x10 > [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? > native_write_msr_safe+0xa/0x10 > [Wed Feb 18 21:00:35 2015] [<ffffffff8104620a>] ? > native_write_msr_safe+0xa/0x10 > [Wed Feb 18 21:00:35 2015] <<EOE>> <IRQ> [<ffffffff81029bc1>] > intel_pmu_disable_event+0xa1/0x160 > [Wed Feb 18 21:00:35 2015] [<ffffffff81022956>] x86_pmu_stop+0x76/0xd0 > [Wed Feb 18 21:00:35 2015] [<ffffffff81136094>] > perf_adjust_freq_unthr_context.part.79+0x104/0x1b0 > [Wed Feb 18 21:00:35 2015] [<ffffffff811363d6>] perf_event_task_tick+0xb6/0xf0 > [Wed Feb 18 21:00:35 2015] [<ffffffff810967e5>] scheduler_tick+0xd5/0x150 > [Wed Feb 18 21:00:35 2015] [<ffffffff8106fe86>] update_process_times+0x66/0x80 > [Wed Feb 18 21:00:35 2015] [<ffffffff810be055>] > tick_sched_handle.isra.16+0x25/0x60 > [Wed Feb 18 21:00:35 2015] [<ffffffff810be0d1>] tick_sched_timer+0x41/0x60 > [Wed Feb 18 21:00:35 2015] [<ffffffff81089a57>] __run_hrtimer+0x77/0x1d0 > [Wed Feb 18 21:00:35 2015] [<ffffffff810be090>] ? > tick_sched_handle.isra.16+0x60/0x60 > [Wed Feb 18 21:00:35 2015] [<ffffffff8108a297>] hrtimer_interrupt+0xf7/0x240 > [Wed Feb 18 21:00:35 2015] [<ffffffff81039717>] > local_apic_timer_interrupt+0x37/0x60 > [Wed Feb 18 21:00:35 2015] [<ffffffff815f552f>] > smp_apic_timer_interrupt+0x3f/0x60 > [Wed Feb 18 21:00:35 2015] [<ffffffff815f3e9d>] apic_timer_interrupt+0x6d/0x80 > [Wed Feb 18 21:00:35 2015] <EOI> [<ffffffff811ca796>] ? > file_update_time+0xe6/0xf0 > [Wed Feb 18 21:00:35 2015] [<ffffffff811ca70f>] ? file_update_time+0x5f/0xf0 > [Wed Feb 18 21:00:35 2015] [<ffffffff811437a8>] > __generic_file_aio_write+0x198/0x3e0 > [Wed Feb 18 21:00:35 2015] [<ffffffff81143a4d>] > generic_file_aio_write+0x5d/0xc0 > [Wed Feb 18 21:00:35 2015] [<ffffffffa02a51a9>] > ext4_file_write+0xa9/0x450 [ext4] > [Wed Feb 18 21:00:35 2015] [<ffffffff810976b6>] ? try_to_wake_up+0x1b6/0x280 > [Wed Feb 18 21:00:35 2015] [<ffffffff810bfd68>] ? get_futex_key+0x1c8/0x2a0 > [Wed Feb 18 21:00:35 2015] [<ffffffff811af0cd>] do_sync_write+0x8d/0xd0 > [Wed Feb 18 21:00:35 2015] [<ffffffff811af86d>] vfs_write+0xbd/0x1e0 > [Wed Feb 18 21:00:35 2015] [<ffffffff811b02b8>] SyS_write+0x58/0xb0 > [Wed Feb 18 21:00:35 2015] [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b > [Wed Feb 18 21:00:35 2015] ---[ end trace bbcc155358ec2b48 ]--- > > [Wed Feb 18 21:00:35 2015] CPU#9: ctrl: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: status: 0000000000000001 > [Wed Feb 18 21:00:35 2015] CPU#9: overflow: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: fixed: 00000000000000b0 > [Wed Feb 18 21:00:35 2015] CPU#9: pebs: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: active: 0000000200000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC0 ctrl: 000000000013003c > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC0 count: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC0 left: 0000000000007a1c > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC1 ctrl: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC1 count: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC1 left: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC2 ctrl: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC2 count: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC2 left: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC3 ctrl: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC3 count: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: gen-PMC3 left: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC0 count: 0000000000000000 > [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC1 count: 0000fffbe795d19b > [Wed Feb 18 21:00:35 2015] CPU#9: fixed-PMC2 count: 0000000000000000 > [Wed Feb 18 21:00:35 2015] perf_event_intel: clearing PMU state on CPU#9 > [Wed Feb 18 21:00:35 2015] perf samples too long (64102 > 62500), > lowering kernel.perf_event_max_sample_rate to 2000 > > In case you're wondering, no the ext4 file write does not come from an > OSD. All OSDs use XFS, and the node does not host a mon either. > > The kernel build is the latest currently available for CentOS 7, with > matching perf obviously.
I wonder, has anyone else seen this? I'm trying to determine whether this looks like - this is just a generally fishy system; - something in either the kernel or its matching perf-tools could be causing this; - the root cause would be somewhere in Ceph. As it is, a simple "perf top" basically hosing the system wouldn't be something that is generally considered expected. Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com