Hi Jason,

Tried to follow the instructions and setting the debug level to 15 worked
OK, but the daemon appeared to silently ignore the restart command (nothing
indicating a restart seen in the log).
So I set the log level to 15 in the config file and restarted the rbd
mirror daemon. The output surprised me though, my previous perception of
the issue might be completely wrong...
Lots of "image_replayer::BootstrapRequest:.... failed to create local
image: (2) No such file or directory" and ":ImageReplayer: ....  replay
encountered an error: (42) No message of desired type"

https://pastebin.com/1bTETNGs

Best regards
/Magnus

Den tis 9 apr. 2019 kl 18:35 skrev Jason Dillaman <jdill...@redhat.com>:

> Can you pastebin the results from running the following on your backup
> site rbd-mirror daemon node?
>
> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 15
> ceph --admin-socket /path/to/asok rbd mirror restart nova
> .... wait a minute to let some logs accumulate ...
> ceph --admin-socket /path/to/asok config set debug_rbd_mirror 0/5
>
> ... and collect the rbd-mirror log from /var/log/ceph/ (should have
> lots of "rbd::mirror"-like log entries.
>
>
> On Tue, Apr 9, 2019 at 12:23 PM Magnus Grönlund <mag...@gronlund.se>
> wrote:
> >
> >
> >
> > Den tis 9 apr. 2019 kl 17:48 skrev Jason Dillaman <jdill...@redhat.com>:
> >>
> >> Any chance your rbd-mirror daemon has the admin sockets available
> >> (defaults to /var/run/ceph/cephdr-client.<id>.<pid>.<random>.asok)? If
> >> so, you can run "ceph --admin-daemon /path/to/asok rbd mirror status".
> >
> >
> > {
> >     "pool_replayers": [
> >         {
> >             "pool": "glance",
> >             "peer": "uuid: df30fb21-d1de-4c3a-9c00-10eaa4b30e00 cluster:
> production client: client.productionbackup",
> >             "instance_id": "869081",
> >             "leader_instance_id": "869081",
> >             "leader": true,
> >             "instances": [],
> >             "local_cluster_admin_socket":
> "/var/run/ceph/client.backup.1936211.backup.94225674131712.asok",
> >             "remote_cluster_admin_socket":
> "/var/run/ceph/client.productionbackup.1936211.production.94225675210000.asok",
> >             "sync_throttler": {
> >                 "max_parallel_syncs": 5,
> >                 "running_syncs": 0,
> >                 "waiting_syncs": 0
> >             },
> >             "image_replayers": [
> >                 {
> >                     "name":
> "glance/ea5e4ad2-090a-4665-b142-5c7a095963e0",
> >                     "state": "Replaying"
> >                 },
> >                 {
> >                     "name":
> "glance/d7095183-45ef-40b5-80ef-f7c9d3bb1e62",
> >                     "state": "Replaying"
> >                 },
> > -------------------cut----------
> >                 {
> >                     "name":
> "cinder/volume-bcb41f46-3716-4ee2-aa19-6fbc241fbf05",
> >                     "state": "Replaying"
> >                 }
> >             ]
> >         },
> >          {
> >             "pool": "nova",
> >             "peer": "uuid: 1fc7fefc-9bcb-4f36-a259-66c3d8086702 cluster:
> production client: client.productionbackup",
> >             "instance_id": "889074",
> >             "leader_instance_id": "889074",
> >             "leader": true,
> >             "instances": [],
> >             "local_cluster_admin_socket":
> "/var/run/ceph/client.backup.1936211.backup.94225678548048.asok",
> >             "remote_cluster_admin_socket":
> "/var/run/ceph/client.productionbackup.1936211.production.94225679621728.asok",
> >             "sync_throttler": {
> >                 "max_parallel_syncs": 5,
> >                 "running_syncs": 0,
> >                 "waiting_syncs": 0
> >             },
> >             "image_replayers": []
> >         }
> >     ],
> >     "image_deleter": {
> >         "image_deleter_status": {
> >             "delete_images_queue": [
> >                 {
> >                     "local_pool_id": 3,
> >                     "global_image_id":
> "ff531159-de6f-4324-a022-50c079dedd45"
> >                 }
> >             ],
> >             "failed_deletes_queue": []
> >         }
> >>
> >>
> >> On Tue, Apr 9, 2019 at 11:26 AM Magnus Grönlund <mag...@gronlund.se>
> wrote:
> >> >
> >> >
> >> >
> >> > Den tis 9 apr. 2019 kl 17:14 skrev Jason Dillaman <
> jdill...@redhat.com>:
> >> >>
> >> >> On Tue, Apr 9, 2019 at 11:08 AM Magnus Grönlund <mag...@gronlund.se>
> wrote:
> >> >> >
> >> >> > >On Tue, Apr 9, 2019 at 10:40 AM Magnus Grönlund <
> mag...@gronlund.se> wrote:
> >> >> > >>
> >> >> > >> Hi,
> >> >> > >> We have configured one-way replication of pools between a
> production cluster and a backup cluster. But unfortunately the rbd-mirror
> or the backup cluster is unable to keep up with the production cluster so
> the replication fails to reach replaying state.
> >> >> > >
> >> >> > >Hmm, it's odd that they don't at least reach the replaying state.
> Are
> >> >> > >they still performing the initial sync?
> >> >> >
> >> >> > There are three pools we try to mirror, (glance, cinder, and nova,
> no points for guessing what the cluster is used for :) ),
> >> >> > the glance and cinder pools are smaller and sees limited write
> activity, and the mirroring works, the nova pool which is the largest and
> has 90% of the write activity never leaves the "unknown" state.
> >> >> >
> >> >> > # rbd mirror pool status cinder
> >> >> > health: OK
> >> >> > images: 892 total
> >> >> >     890 replaying
> >> >> >     2 stopped
> >> >> > #
> >> >> > # rbd mirror pool status nova
> >> >> > health: WARNING
> >> >> > images: 2479 total
> >> >> >     2479 unknown
> >> >> > #
> >> >> > The production clsuter has 5k writes/s on average and the backup
> cluster has 1-2k writes/s on average. The production cluster is bigger and
> has better specs. I thought that the backup cluster would be able to keep
> up but it looks like I was wrong.
> >> >>
> >> >> The fact that they are in the unknown state just means that the
> remote
> >> >> "rbd-mirror" daemon hasn't started any journal replayers against the
> >> >> images. If it couldn't keep up, it would still report a status of
> >> >> "up+replaying". What Ceph release are you running on your backup
> >> >> cluster?
> >> >>
> >> > The backup cluster is running Luminous 12.2.11 (the production
> cluster 12.2.10)
> >> >
> >> >>
> >> >> > >> And the journals on the rbd volumes keep growing...
> >> >> > >>
> >> >> > >> Is it enough to simply disable the mirroring of the pool  (rbd
> mirror pool disable <pool>) and that will remove the lagging reader from
> the journals and shrink them, or is there anything else that has to be done?
> >> >> > >
> >> >> > >You can either disable the journaling feature on the image(s)
> since
> >> >> > >there is no point to leave it on if you aren't using mirroring,
> or run
> >> >> > >"rbd mirror pool disable <pool>" to purge the journals.
> >> >> >
> >> >> > Thanks for the confirmation.
> >> >> > I will stop the mirror of the nova pool and try to figure out if
> there is anything we can do to get the backup cluster to keep up.
> >> >> >
> >> >> > >> Best regards
> >> >> > >> /Magnus
> >> >> > >> _______________________________________________
> >> >> > >> ceph-users mailing list
> >> >> > >> ceph-users@lists.ceph.com
> >> >> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > >
> >> >> > >--
> >> >> > >Jason
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jason
> >>
> >>
> >>
> >> --
> >> Jason
>
>
>
> --
> Jason
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to