Re: [ceph-users] All OSD fails after few requests to RGW

Anton Dmitriev Wed, 10 May 2017 21:29:23 -0700

"recent enough version of the ceph-objectstore-tool" - sounds veryinteresting. Would it be released in one of next Jewel minor releases?


On 10.05.2017 19:03, David Turner wrote:

PG subfolder splitting is the primary reason people are going to bedeploying Luminous and Bluestore much faster than any other majorrelease of Ceph. Bluestore removes the concept of subfolders in PGs.

I have had clusters that reached what seemed a hardcoded maximum of12,800 objects in a subfolder. It would take an osd_heartbeat_graceof 240 or 300 to let them finish splitting their subfolders withoutbeing marked down. Recently I came across a cluster that had asetting of 240 objects per subfolder before splitting, so it wassplitting all the time, and several of the OSDs took longer than 30seconds to finish splitting into subfolders. That led to moreproblems as we started adding backfilling to everything and we lost asignificant amount of throughput on the cluster.

I have yet to manage a cluster with a recent enough version of theceph-objectstore-tool (hopefully I'll have one this month) thatincludes the ability to take an osd offline, split the subfolders,then bring it back online. If you set up a way to monitor how bigyour subfolders are getting, you can leave the ceph settings as highas you want, and then go in and perform maintenance on your cluster 1failure domain at a time splitting all of the PG subfolders on theOSDs. This approach would remove this ever happening in the wild.

On Wed, May 10, 2017 at 5:37 AM Piotr Nowosielski<piotr.nowosiel...@allegrogroup.com<mailto:piotr.nowosiel...@allegrogroup.com>> wrote:


    It is difficult for me to clearly state why some PGs have not been
    migrated.
    crushmap settings? Weight of OSD?

    One thing is certain - you will not find any information about the
    split
    process in the logs ...

    pn

    -----Original Message-----
    From: Anton Dmitriev [mailto:t...@enumnet.ru <mailto:t...@enumnet.ru>]
    Sent: Wednesday, May 10, 2017 10:14 AM
    To: Piotr Nowosielski <piotr.nowosiel...@allegrogroup.com
    <mailto:piotr.nowosiel...@allegrogroup.com>>;
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    Subject: Re: [ceph-users] All OSD fails after few requests to RGW

    When I created cluster, I made a mistake in configuration, and set
    split
    parameter to 32 and merge to 40, so 32*40*16 = 20480 files per folder.
    After that I changed split to 8, and increased number of pg and
    pgp from
    2048 to 4096 for pool, where problem occurs. While it was
    backfilling I
    observed, that placement groups were backfilling from one set of 3
    OSD to
    another set of 3 OSD (replicated size = 3), so I made a
    conclusion, that PGs
    are completely recreating while increasing PG and PGP for pool and
    after
    this process number of files per directory must be Ok. But when
    backfilling
    finished I found many directories in this pool with ~20
    000 files. Why Increasing PG num did not helped? Or maybe after
    this process
    some files will be deleted with some delay?

    I couldn`t find any information about directory split process in
    logs, also
    with osd and filestore debug 20. What pattern and in what log I
    need to grep
    for finding it?

    On 10.05.2017 10:36, Piotr Nowosielski wrote:
    > You can:
    > - change these parameters and use ceph-objectstore-tool
    > - add OSD host - rebuild the cluster will reduce the number of files
    > in the directories
    > - wait until "split" operations are over ;-)
    >
    > In our case, we could afford to wait until the "split" operation is
    > over (we have 2 clusters in slightly different configurations
    storing
    > the same data)
    >
    > hint:
    > When creating a new pool, use the parameter "expected_num_objects"
    >
    https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_
    > pools_operate.html
    >
    > Piotr Nowosielski
    > Senior Systems Engineer
    > Zespół Infrastruktury 5
    > Grupa Allegro sp. z o.o.
    > Tel: +48 512 08 55 92
    >
    >
    > -----Original Message-----
    > From: Anton Dmitriev [mailto:t...@enumnet.ru
    <mailto:t...@enumnet.ru>]
    > Sent: Wednesday, May 10, 2017 9:19 AM
    > To: Piotr Nowosielski <piotr.nowosiel...@allegrogroup.com
    <mailto:piotr.nowosiel...@allegrogroup.com>>;
    > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    > Subject: Re: [ceph-users] All OSD fails after few requests to RGW
    >
    > How did you solved it? Set new split/merge thresholds, and manually
    > applied it by ceph-objectstore-tool --data-path
    > /var/lib/ceph/osd/ceph-${osd_num} --journal-path
    > /var/lib/ceph/osd/ceph-${osd_num}/journal
    > --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
    > apply-layout-settings --pool default.rgw.buckets.data
    >
    > on each OSD?
    >
    > How I can see in logs, that split occurs?
    >
    > On 10.05.2017 10:13, Piotr Nowosielski wrote:
    >> Hey,
    >> We had similar problems. Look for information on "Filestore
    merge and
    >> split".
    >>
    >> Some explain:
    >> The OSD, after reaching a certain number of files in the directory
    >> (it depends of 'filestore merge threshold' and 'filestore split
    multiple'
    >> parameters) rebuilds the structure of this directory.
    >> If the files arrives, the OSD creates new subdirectories and moves
    >> some of the files there.
    >> If the files are missing the OSD will reduce the number of
    >> subdirectories.
    >>
    >>
    >> --
    >> Piotr Nowosielski
    >> Senior Systems Engineer
    >> Zespół Infrastruktury 5
    >> Grupa Allegro sp. z o.o.
    >> Tel: +48 512 08 55 92
    >>
    >> Grupa Allegro Sp. z o.o. z siedzibą w Poznaniu, 60-166 Poznań,
    przy ul.
    >> Grunwaldzka 182, wpisana do rejestru przedsiębiorców prowadzonego
    >> przez Sąd Rejonowy Poznań - Nowe Miasto i Wilda, Wydział VIII
    >> Gospodarczy Krajowego Rejestru Sądowego pod numerem KRS
    0000268796, o
    >> kapitale zakładowym w wysokości 33 976 500,00 zł, posiadająca numer
    >> identyfikacji podatkowej NIP: 5272525995.
    >>
    >>
    >>
    >> -----Original Message-----
    >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
    <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf
    >> Of Anton Dmitriev
    >> Sent: Wednesday, May 10, 2017 8:14 AM
    >> To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    >> Subject: Re: [ceph-users] All OSD fails after few requests to RGW
    >>
    >> Hi!
    >>
    >> I increased pg_num and pgp_num for pool
    default.rgw.buckets.data from
    >> 2048 to 4096, and it seems that situation became a bit better,
    >> cluster dies after 20-30 PUTs, not after 1. Could someone
    please give
    >> me some recommendations how to rescue the cluster?
    >>
    >> On 27.04.2017 09:59, Anton Dmitriev wrote:
    >>> Cluster was going well for a long time, but on the previous week
    >>> osds start to fail.
    >>> We use cluster like image storage for Opennebula with small
    load and
    >>> like object storage with high load.
    >>> Sometimes disks of some osds utlized by 100 %, iostat shows
    avgqu-sz
    >>> over 1000, while reading or writing a few kilobytes in a second,
    >>> osds on this disks become unresponsive and cluster marks them
    down.
    >>> We lower the load to object storage and situation became better.
    >>>
    >>> Yesterday situation became worse:
    >>> If RGWs are disabled and there is no requests to object storage
    >>> cluster performing well, but if enable RGWs and make a few PUTs or
    >>> GETs all not SSD osds on all storages become in the same
    situation,
    >>> described above.
    >>> IOtop shows, that xfsaild/<disk> burns disks.
    >>>
    >>> trace-cmd record -e xfs\*  for a 10 seconds shows 10 milion
    objects,
    >>> as i understand it means ~360 000 objects to push per one osd
    for a
    >>> 10 seconds
    >>>      $ wc -l t.t
    >>> 10256873 t.t
    >>>
    >>> fragmentation on one of such disks is about 3%
    >>>
    >>> more information about cluster:
    >>>
    >>> https://yadi.sk/d/Y63mXQhl3HPvwt
    >>>
    >>> also debug logs for osd.33 while problem occurs
    >>>
    >>> https://yadi.sk/d/kiqsMF9L3HPvte
    >>>
    >>> debug_osd = 20/20
    >>> debug_filestore = 20/20
    >>> debug_tp = 20/20
    >>>
    >>>
    >>>
    >>> Ubuntu 14.04
    >>> $ uname -a
    >>> Linux storage01 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29
    >>> 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
    >>>
    >>> Ceph 10.2.7
    >>>
    >>> 7 storages: Supermicro 28 osd 4tb 7200 JBOD + journal raid10 4 ssd
    >>> intel 3510 800gb + 2 osd SSD intel 3710 400gb for rgw meta and
    index
    >>> One of this storages differs only in number of osd, it has 26
    osd on
    >>> 4tb, instead of 28 on others
    >>>
    >>> Storages connect to each other by bonded 2x10gbit Clients
    connect to
    >>> storages by bonded 2x1gbit
    >>>
    >>> in 5 storages 2 x CPU E5-2650v2  and 256 gb RAM in 2 storages 2 x
    >>> CPU
    >>> E5-2690v3  and 512 gb RAM
    >>>
    >>> 7 mons
    >>> 3 rgw
    >>>
    >>> Help me please to rescue the cluster.
    >>>
    >>>
    >> --
    >> Dmitriev Anton
    >>
    >> _______________________________________________
    >> ceph-users mailing list
    >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    >
    > --
    > Dmitriev Anton


    --
    Dmitriev Anton
    _______________________________________________
    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Dmitriev Anton

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All OSD fails after few requests to RGW

Reply via email to