Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

Vickey Singh Wed, 09 Sep 2015 10:23:14 -0700

Hello Jan

On Wed, Sep 9, 2015 at 11:59 AM, Jan Schermer <j...@schermer.cz> wrote:


> Just to recapitulate - the nodes are doing "nothing" when it drops to
> zero? Not flushing something to drives (iostat)? Not cleaning pagecache
> (kswapd and similiar)? Not out of any type of memory (slab,
> min_free_kbytes)? Not network link errors, no bad checksums (those are hard
> to spot, though)?
>
> Unless you find something I suggest you try disabling offloads on the NICs
> and see if the problem goes away.
>

Could you please elaborate this point , how do you disable / offload on the
NIC ? what does it mean ? how to do it ? how its gonna help.

Sorry i don't know about this.

- Vickey -



>
> Jan
>
> > On 08 Sep 2015, at 18:26, Lincoln Bryant <linco...@uchicago.edu> wrote:
> >
> > For whatever it’s worth, my problem has returned and is very similar to
> yours. Still trying to figure out what’s going on over here.
> >
> > Performance is nice for a few seconds, then goes to 0. This is a similar
> setup to yours (12 OSDs per box, Scientific Linux 6, Ceph 0.94.3, etc)
> >
> >  384      16     29520     29504   307.287      1188 0.0492006  0.208259
> >  385      16     29813     29797   309.532      1172 0.0469708  0.206731
> >  386      16     30105     30089   311.756      1168 0.0375764  0.205189
> >  387      16     30401     30385   314.009      1184  0.036142  0.203791
> >  388      16     30695     30679   316.231      1176 0.0372316  0.202355
> >  389      16     30987     30971    318.42      1168 0.0660476  0.200962
> >  390      16     31282     31266   320.628      1180 0.0358611  0.199548
> >  391      16     31568     31552   322.734      1144 0.0405166  0.198132
> >  392      16     31857     31841   324.859      1156 0.0360826  0.196679
> >  393      16     32090     32074   326.404       932 0.0416869   0.19549
> >  394      16     32205     32189   326.743       460 0.0251877  0.194896
> >  395      16     32302     32286   326.897       388 0.0280574  0.194395
> >  396      16     32348     32332   326.537       184 0.0256821  0.194157
> >  397      16     32385     32369   326.087       148 0.0254342  0.193965
> >  398      16     32424     32408   325.659       156 0.0263006  0.193763
> >  399      16     32445     32429   325.054        84 0.0233839  0.193655
> > 2015-09-08 11:22:31.940164 min lat: 0.0165045 max lat: 67.6184 avg lat:
> 0.193655
> >  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
> >  400      16     32445     32429   324.241         0         -  0.193655
> >  401      16     32445     32429   323.433         0         -  0.193655
> >  402      16     32445     32429   322.628         0         -  0.193655
> >  403      16     32445     32429   321.828         0         -  0.193655
> >  404      16     32445     32429   321.031         0         -  0.193655
> >  405      16     32445     32429   320.238         0         -  0.193655
> >  406      16     32445     32429    319.45         0         -  0.193655
> >  407      16     32445     32429   318.665         0         -  0.193655
> >
> > needless to say, very strange.
> >
> > —Lincoln
> >
> >
> >> On Sep 7, 2015, at 3:35 PM, Vickey Singh <vickey.singh22...@gmail.com>
> wrote:
> >>
> >> Adding ceph-users.
> >>
> >> On Mon, Sep 7, 2015 at 11:31 PM, Vickey Singh <
> vickey.singh22...@gmail.com> wrote:
> >>
> >>
> >> On Mon, Sep 7, 2015 at 10:04 PM, Udo Lembke <ulem...@polarzone.de>
> wrote:
> >> Hi Vickey,
> >> Thanks for your time in replying to my problem.
> >>
> >> I had the same rados bench output after changing the motherboard of the
> monitor node with the lowest IP...
> >> Due to the new mainboard, I assume the hw-clock was wrong during
> startup. Ceph health show no errors, but all VMs aren't able to do IO (very
> high load on the VMs - but no traffic).
> >> I stopped the mon, but this don't changed anything. I had to restart
> all other mons to get IO again. After that I started the first mon also
> (with the right time now) and all worked fine again...
> >>
> >> Thanks i will try to restart all OSD / MONS and report back , if it
> solves my problem
> >>
> >> Another posibility:
> >> Do you use journal on SSDs? Perhaps the SSDs can't write to garbage
> collection?
> >>
> >> No i don't have journals on SSD , they are on the same OSD disk.
> >>
> >>
> >>
> >> Udo
> >>
> >>
> >> On 07.09.2015 16:36, Vickey Singh wrote:
> >>> Dear Experts
> >>>
> >>> Can someone please help me , why my cluster is not able write data.
> >>>
> >>> See the below output  cur MB/S  is 0  and Avg MB/s is decreasing.
> >>>
> >>>
> >>> Ceph Hammer  0.94.2
> >>> CentOS 6 (3.10.69-1)
> >>>
> >>> The Ceph status says OPS are blocked , i have tried checking , what
> all i know
> >>>
> >>> - System resources ( CPU , net, disk , memory )    -- All normal
> >>> - 10G network for public and cluster network  -- no saturation
> >>> - Add disks are physically healthy
> >>> - No messages in /var/log/messages OR dmesg
> >>> - Tried restarting OSD which are blocking operation , but no luck
> >>> - Tried writing through RBD  and Rados bench , both are giving same
> problemm
> >>>
> >>> Please help me to fix this problem.
> >>>
> >>> #  rados bench -p rbd 60 write
> >>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds
> or 0 objects
> >>> Object prefix: benchmark_data_stor1_1791844
> >>>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
> lat
> >>>     0       0         0         0         0         0         -
>  0
> >>>     1      16       125       109   435.873       436  0.022076
> 0.0697864
> >>>     2      16       139       123   245.948        56  0.246578
> 0.0674407
> >>>     3      16       139       123   163.969         0         -
> 0.0674407
> >>>     4      16       139       123   122.978         0         -
> 0.0674407
> >>>     5      16       139       123    98.383         0         -
> 0.0674407
> >>>     6      16       139       123   81.9865         0         -
> 0.0674407
> >>>     7      16       139       123   70.2747         0         -
> 0.0674407
> >>>     8      16       139       123   61.4903         0         -
> 0.0674407
> >>>     9      16       139       123   54.6582         0         -
> 0.0674407
> >>>    10      16       139       123   49.1924         0         -
> 0.0674407
> >>>    11      16       139       123   44.7201         0         -
> 0.0674407
> >>>    12      16       139       123   40.9934         0         -
> 0.0674407
> >>>    13      16       139       123   37.8401         0         -
> 0.0674407
> >>>    14      16       139       123   35.1373         0         -
> 0.0674407
> >>>    15      16       139       123   32.7949         0         -
> 0.0674407
> >>>    16      16       139       123   30.7451         0         -
> 0.0674407
> >>>    17      16       139       123   28.9364         0         -
> 0.0674407
> >>>    18      16       139       123   27.3289         0         -
> 0.0674407
> >>>    19      16       139       123   25.8905         0         -
> 0.0674407
> >>> 2015-09-07 15:54:52.694071min lat: 0.022076 max lat: 0.46117 avg lat:
> 0.0674407
> >>>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
> lat
> >>>    20      16       139       123    24.596         0         -
> 0.0674407
> >>>    21      16       139       123   23.4247         0         -
> 0.0674407
> >>>    22      16       139       123     22.36         0         -
> 0.0674407
> >>>    23      16       139       123   21.3878         0         -
> 0.0674407
> >>>    24      16       139       123   20.4966         0         -
> 0.0674407
> >>>    25      16       139       123   19.6768         0         -
> 0.0674407
> >>>    26      16       139       123     18.92         0         -
> 0.0674407
> >>>    27      16       139       123   18.2192         0         -
> 0.0674407
> >>>    28      16       139       123   17.5686         0         -
> 0.0674407
> >>>    29      16       139       123   16.9628         0         -
> 0.0674407
> >>>    30      16       139       123   16.3973         0         -
> 0.0674407
> >>>    31      16       139       123   15.8684         0         -
> 0.0674407
> >>>    32      16       139       123   15.3725         0         -
> 0.0674407
> >>>    33      16       139       123   14.9067         0         -
> 0.0674407
> >>>    34      16       139       123   14.4683         0         -
> 0.0674407
> >>>    35      16       139       123   14.0549         0         -
> 0.0674407
> >>>    36      16       139       123   13.6645         0         -
> 0.0674407
> >>>    37      16       139       123   13.2952         0         -
> 0.0674407
> >>>    38      16       139       123   12.9453         0         -
> 0.0674407
> >>>    39      16       139       123   12.6134         0         -
> 0.0674407
> >>> 2015-09-07 15:55:12.697124min lat: 0.022076 max lat: 0.46117 avg lat:
> 0.0674407
> >>>   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
> lat
> >>>    40      16       139       123   12.2981         0         -
> 0.0674407
> >>>    41      16       139       123   11.9981         0         -
> 0.0674407
> >>>
> >>>
> >>>
> >>>
> >>>    cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
> >>>     health HEALTH_WARN
> >>>            1 requests are blocked > 32 sec
> >>>     monmap e3: 3 mons at {stor0111=
> 10.100.1.111:6789/0,stor0113=10.100.1.113:6789/0,stor011
> >>> 5=10.100.1.115:6789/0}
> >>>            election epoch 32, quorum 0,1,2 stor0111,stor0113,stor0115
> >>>     osdmap e19536: 50 osds: 50 up, 50 in
> >>>      pgmap v928610: 2752 pgs, 9 pools, 30476 GB data, 4183 kobjects
> >>>            91513 GB used, 47642 GB / 135 TB avail
> >>>                2752 active+clean
> >>>
> >>>
> >>> Tried using RBD
> >>>
> >>>
> >>> # dd if=/dev/zero of=file1 bs=4K count=10000 oflag=direct
> >>> 10000+0 records in
> >>> 10000+0 records out
> >>> 40960000 bytes (41 MB) copied, 24.5529 s, 1.7 MB/s
> >>>
> >>> # dd if=/dev/zero of=file1 bs=1M count=100 oflag=direct
> >>> 100+0 records in
> >>> 100+0 records out
> >>> 104857600 bytes (105 MB) copied, 1.05602 s, 9.3 MB/s
> >>>
> >>> # dd if=/dev/zero of=file1 bs=1G count=1 oflag=direct
> >>> 1+0 records in
> >>> 1+0 records out
> >>> 1073741824 bytes (1.1 GB) copied, 293.551 s, 3.7 MB/s
> >>> ]#
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>>
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cluster NO read / write performance :: Ops are blocked

Reply via email to