[ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

Andras Pataki Tue, 16 Jan 2018 12:51:24 -0800

Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) toLuminous (12.2.2). The upgrade went smoothly for the most part, exceptwe seem to be hitting an issue with cephfs. After about a day or two ofuse, the MDS start complaining about clients failing to respond to cachepressure:

   [root@cephmon00 ~]# *ceph -s*
      cluster:
        id:     d7b33135-0940-4e48-8aa6-1d2026597c2f
        health: HEALTH_WARN
   *            1 MDSs have many clients failing to respond to cache
   pressure*
                noout flag(s) set
                1 osds down

      services:
        mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
        mgr: cephmon00(active), standbys: cephmon01, cephmon02
        mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
        osd: 2208 osds: 2207 up, 2208 in
             flags noout

      data:
        pools:   6 pools, 42496 pgs
        objects: 919M objects, 3062 TB
        usage:   9203 TB used, 4618 TB / 13822 TB avail
        pgs:     42470 active+clean
                 22    active+clean+scrubbing+deep
                 4     active+clean+scrubbing

      io:
        client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

   [root@cephmon00 ~]# *ceph health detail*
   HEALTH_WARN 1 MDSs have many clients failing to respond to cache
   pressure; noout flag(s) set; 1 osds down
   *MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond
   to cache pressure**
   **    mdscephmon00(mds.0): Many clients (103) failing to respond to
   cache pressureclient_count: 103*
   OSDMAP_FLAGS noout flag(s) set
   OSD_DOWN 1 osds down
        osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is
   down

We are using exclusively the 12.2.2 fuse client on about 350 nodes or so(out of which it seems 100 are not responding to cache pressure in thislog). When this happens, clients appear pretty sluggish also (listingdirectories, etc.). After bouncing the MDS, everything returns onnormal after the failover for a while. Ignore the message about 1 OSDdown, that corresponds to a failed drive and all data has beenre-replicated since.

We were also using the 12.2.2 fuse client with the Jewel back end beforethe upgrade, and have not seen this issue.

We are running with a larger MDS cache than usual, we havemds_cache_size set to 4 million. All other MDS configs are the defaults.

Is this a known issue? If not, any hints on how to further diagnose theproblem?

Andras

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

Reply via email to