Using ceph-objectstore-tool apply-layout-settings I applied layout on all storages to
  filestore_merge_threshold = 40
  filestore_split_multiple = 8

I checked some directories in OSDs, there were 1200-2000 files per folder
Split will occur on 5120 files per folder

But problem still exists, after I PUT 25-50 objects to RGW, one of OSD disks became busy 100%, iotop shows that xfaild makes it busy, number of slow requests increasing to 800-1000. After some time busy OSD hit suicide timeout, restarts and cluster works well, until next write to RGW.

0> 2017-05-21 12:02:26.105597 7f23bd1fa700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f23bd1fa700 time 2017-05-21 12:02:26.050994
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5620cbbb56db] 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2b1) [0x5620cbafba91]
 3: (ceph::HeartbeatMap::is_healthy()+0xc6) [0x5620cbafc256]
 4: (OSD::handle_osd_ping(MOSDPing*)+0x8e2) [0x5620cb557752]
 5: (OSD::heartbeat_dispatch(Message*)+0x3cb) [0x5620cb55899b]
 6: (DispatchQueue::fast_dispatch(Message*)+0x76) [0x5620cbc71906]
 7: (Pipe::reader()+0x1d38) [0x5620cbcaceb8]
 8: (Pipe::Reader::entry()+0xd) [0x5620cbcb4a0d]
 9: (()+0x8184) [0x7f244a0ea184]
 10: (clone()+0x6d) [0x7f2448215bed]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


2017-05-21 12:02:26.161763 7f23bd1fa700 -1 *** Caught signal (Aborted) **
 in thread 7f23bd1fa700 thread_name:ms_pipe_read



On 11.05.2017 20:11, David Turner wrote:

I honestly haven't investigated the command line structure that it would need, but that looks about what I'd expect.


On Thu, May 11, 2017, 7:58 AM Anton Dmitriev <t...@enumnet.ru <mailto:t...@enumnet.ru>> wrote:

    I`m on Jewel 10.2.7
    Do you mean this:
    ceph-objectstore-tool --data-path
    /var/lib/ceph/osd/ceph-${osd_num} --journal-path
    /var/lib/ceph/osd/ceph-${osd_num}/journal
    --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
    apply-layout-settings --pool default.rgw.buckets.data --debug

    ?
    And before running it I need to stop OSD and flush its journal


    On 11.05.2017 14:52, David Turner wrote:

    If you are on the current release of Ceph Hammer 0.94.10 or Jewel
    10.2.7, you have it already. I don't remember which release it
    came out in, but it's definitely in the current releases..


    On Thu, May 11, 2017, 12:24 AM Anton Dmitriev <t...@enumnet.ru
    <mailto:t...@enumnet.ru>> wrote:

        "recent enough version of the ceph-objectstore-tool" - sounds
        very interesting. Would it be released in one of next Jewel
        minor releases?


        On 10.05.2017 19:03, David Turner wrote:
        PG subfolder splitting is the primary reason people are
        going to be deploying Luminous and Bluestore much faster
        than any other major release of Ceph.  Bluestore removes the
        concept of subfolders in PGs.

        I have had clusters that reached what seemed a hardcoded
        maximum of 12,800 objects in a subfolder.  It would take an
        osd_heartbeat_grace of 240 or 300 to let them finish
splitting their subfolders without being marked down. Recently I came across a cluster that had a setting of 240
        objects per subfolder before splitting, so it was splitting
        all the time, and several of the OSDs took longer than 30
        seconds to finish splitting into subfolders.  That led to
        more problems as we started adding backfilling to everything
        and we lost a significant amount of throughput on the cluster.

        I have yet to manage a cluster with a recent enough version
        of the ceph-objectstore-tool (hopefully I'll have one this
        month) that includes the ability to take an osd offline,
        split the subfolders, then bring it back online.  If you set
        up a way to monitor how big your subfolders are getting, you
        can leave the ceph settings as high as you want, and then go
        in and perform maintenance on your cluster 1 failure domain
at a time splitting all of the PG subfolders on the OSDs. This approach would remove this ever happening in the wild.

        On Wed, May 10, 2017 at 5:37 AM Piotr Nowosielski
        <piotr.nowosiel...@allegrogroup.com
        <mailto:piotr.nowosiel...@allegrogroup.com>> wrote:

            It is difficult for me to clearly state why some PGs
            have not been migrated.
            crushmap settings? Weight of OSD?

            One thing is certain - you will not find any information
            about the split
            process in the logs ...

            pn

            -----Original Message-----
            From: Anton Dmitriev [mailto:t...@enumnet.ru
            <mailto:t...@enumnet.ru>]
            Sent: Wednesday, May 10, 2017 10:14 AM
            To: Piotr Nowosielski
            <piotr.nowosiel...@allegrogroup.com
            <mailto:piotr.nowosiel...@allegrogroup.com>>;
            ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            Subject: Re: [ceph-users] All OSD fails after few
            requests to RGW

            When I created cluster, I made a mistake in
            configuration, and set split
            parameter to 32 and merge to 40, so 32*40*16 = 20480
            files per folder.
            After that I changed split to 8, and increased number of
            pg and pgp from
            2048 to 4096 for pool, where problem occurs. While it
            was backfilling I
            observed, that placement groups were backfilling from
            one set of 3 OSD to
            another set of 3 OSD (replicated size = 3), so I made a
            conclusion, that PGs
            are completely recreating while increasing PG and PGP
            for pool and after
            this process number of files per directory must be Ok.
            But when backfilling
            finished I found many directories in this pool with ~20
            000 files. Why Increasing PG num did not helped? Or
            maybe after this process
            some files will be deleted with some delay?

            I couldn`t find any information about directory split
            process in logs, also
            with osd and filestore debug 20. What pattern and in
            what log I need to grep
            for finding it?

            On 10.05.2017 10:36, Piotr Nowosielski wrote:
            > You can:
            > - change these parameters and use ceph-objectstore-tool
            > - add OSD host - rebuild the cluster will reduce the
            number of files
            > in the directories
            > - wait until "split" operations are over ;-)
            >
            > In our case, we could afford to wait until the "split"
            operation is
            > over (we have 2 clusters in slightly different
            configurations storing
            > the same data)
            >
            > hint:
            > When creating a new pool, use the parameter
            "expected_num_objects"
            >
            
https://www.suse.com/documentation/ses-4/book_storage_admin/data/ceph_
            > pools_operate.html
            >
            > Piotr Nowosielski
            > Senior Systems Engineer
            > Zespół Infrastruktury 5
            > Grupa Allegro sp. z o.o.
            > Tel: +48 512 08 55 92
            >
            >
            > -----Original Message-----
            > From: Anton Dmitriev [mailto:t...@enumnet.ru
            <mailto:t...@enumnet.ru>]
            > Sent: Wednesday, May 10, 2017 9:19 AM
            > To: Piotr Nowosielski
            <piotr.nowosiel...@allegrogroup.com
            <mailto:piotr.nowosiel...@allegrogroup.com>>;
            > ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            > Subject: Re: [ceph-users] All OSD fails after few
            requests to RGW
            >
            > How did you solved it? Set new split/merge thresholds,
            and manually
            > applied it by ceph-objectstore-tool --data-path
            > /var/lib/ceph/osd/ceph-${osd_num} --journal-path
            > /var/lib/ceph/osd/ceph-${osd_num}/journal
            >
            --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log
            --op
            > apply-layout-settings --pool default.rgw.buckets.data
            >
            > on each OSD?
            >
            > How I can see in logs, that split occurs?
            >
            > On 10.05.2017 10:13, Piotr Nowosielski wrote:
            >> Hey,
            >> We had similar problems. Look for information on
            "Filestore merge and
            >> split".
            >>
            >> Some explain:
            >> The OSD, after reaching a certain number of files in
            the directory
            >> (it depends of 'filestore merge threshold' and
            'filestore split multiple'
            >> parameters) rebuilds the structure of this directory.
            >> If the files arrives, the OSD creates new
            subdirectories and moves
            >> some of the files there.
            >> If the files are missing the OSD will reduce the
            number of
            >> subdirectories.
            >>
            >>
            >> --
            >> Piotr Nowosielski
            >> Senior Systems Engineer
            >> Zespół Infrastruktury 5
            >> Grupa Allegro sp. z o.o.
            >> Tel: +48 512 08 55 92
            >>
            >> Grupa Allegro Sp. z o.o. z siedzibą w Poznaniu,
            60-166 Poznań, przy ul.
            >> Grunwaldzka 182, wpisana do rejestru przedsiębiorców
            prowadzonego
            >> przez Sąd Rejonowy Poznań - Nowe Miasto i Wilda,
            Wydział VIII
            >> Gospodarczy Krajowego Rejestru Sądowego pod numerem
            KRS 0000268796, o
            >> kapitale zakładowym w wysokości 33 976 500,00 zł,
            posiadająca numer
            >> identyfikacji podatkowej NIP: 5272525995.
            >>
            >>
            >>
            >> -----Original Message-----
            >> From: ceph-users
            [mailto:ceph-users-boun...@lists.ceph.com
            <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf
            >> Of Anton Dmitriev
            >> Sent: Wednesday, May 10, 2017 8:14 AM
            >> To: ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            >> Subject: Re: [ceph-users] All OSD fails after few
            requests to RGW
            >>
            >> Hi!
            >>
            >> I increased pg_num and pgp_num for pool
            default.rgw.buckets.data from
            >> 2048 to 4096, and it seems that situation became a
            bit better,
            >> cluster dies after 20-30 PUTs, not after 1. Could
            someone please give
            >> me some recommendations how to rescue the cluster?
            >>
            >> On 27.04.2017 09:59, Anton Dmitriev wrote:
            >>> Cluster was going well for a long time, but on the
            previous week
            >>> osds start to fail.
            >>> We use cluster like image storage for Opennebula
            with small load and
            >>> like object storage with high load.
            >>> Sometimes disks of some osds utlized by 100 %,
            iostat shows avgqu-sz
            >>> over 1000, while reading or writing a few kilobytes
            in a second,
            >>> osds on this disks become unresponsive and cluster
            marks them down.
            >>> We lower the load to object storage and situation
            became better.
            >>>
            >>> Yesterday situation became worse:
            >>> If RGWs are disabled and there is no requests to
            object storage
            >>> cluster performing well, but if enable RGWs and make
            a few PUTs or
            >>> GETs all not SSD osds on all storages become in the
            same situation,
            >>> described above.
            >>> IOtop shows, that xfsaild/<disk> burns disks.
            >>>
            >>> trace-cmd record -e xfs\* for a 10 seconds shows 10
            milion objects,
            >>> as i understand it means ~360 000 objects to push
            per one osd for a
            >>> 10 seconds
            >>>      $ wc -l t.t
            >>> 10256873 t.t
            >>>
            >>> fragmentation on one of such disks is about 3%
            >>>
            >>> more information about cluster:
            >>>
            >>> https://yadi.sk/d/Y63mXQhl3HPvwt
            >>>
            >>> also debug logs for osd.33 while problem occurs
            >>>
            >>> https://yadi.sk/d/kiqsMF9L3HPvte
            >>>
            >>> debug_osd = 20/20
            >>> debug_filestore = 20/20
            >>> debug_tp = 20/20
            >>>
            >>>
            >>>
            >>> Ubuntu 14.04
            >>> $ uname -a
            >>> Linux storage01 4.2.0-42-generic #49~14.04.1-Ubuntu
            SMP Wed Jun 29
            >>> 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
            >>>
            >>> Ceph 10.2.7
            >>>
            >>> 7 storages: Supermicro 28 osd 4tb 7200 JBOD +
            journal raid10 4 ssd
            >>> intel 3510 800gb + 2 osd SSD intel 3710 400gb for
            rgw meta and index
            >>> One of this storages differs only in number of osd,
            it has 26 osd on
            >>> 4tb, instead of 28 on others
            >>>
            >>> Storages connect to each other by bonded 2x10gbit
            Clients connect to
            >>> storages by bonded 2x1gbit
            >>>
            >>> in 5 storages 2 x CPU E5-2650v2  and 256 gb RAM in 2
            storages 2 x
            >>> CPU
            >>> E5-2690v3  and 512 gb RAM
            >>>
            >>> 7 mons
            >>> 3 rgw
            >>>
            >>> Help me please to rescue the cluster.
            >>>
            >>>
            >> --
            >> Dmitriev Anton
            >>
            >> _______________________________________________
            >> ceph-users mailing list
            >> ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
            >
            > --
            > Dmitriev Anton


            --
            Dmitriev Anton
            _______________________________________________
            ceph-users mailing list
            ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- Dmitriev Anton



-- Dmitriev Anton



--
Dmitriev Anton

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to