[ovirt-users] Re: Fix corrupt self-hosted engine

Alex K Tue, 24 Nov 2020 02:40:08 -0800

On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David <d...@redhat.com> wrote:


> On Mon, Nov 23, 2020 at 9:54 AM Alex K <rightkickt...@gmail.com> wrote:
> >
> >
> >
> > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David <d...@redhat.com>
> wrote:
> >>
> >> On Thu, Nov 19, 2020 at 9:43 PM Alex K <rightkickt...@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkickt...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Didi,
> >>>>
> >>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David <d...@redhat.com>
> wrote:
> >>>>>
> >>>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K <rightkickt...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I have a corrupt self-hosted engine (with several file system
> errors, postgres not able to start) and thus it does not give access to the
> web UI. This happened following an unlucky split brain resolution (I am
> running 2 nodes). The two hosts are running VMs also which I would like to
> keep running as they are needed.
> >>>>>>
> >>>>>> When trying to boot into rescue mode (using
> systemd.unit=emergency.target boot parameter) I get a cursor and nothing
> else.
> >>>>>
> >>>>>
> >>>>> This means that more than just the DB is corrupt...
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> I have backups of engine files with scope all (using the
> engine-backup tool).
> >>>>>> What is the best approach to try and fix the engine or redeploy.
> >>>>>
> >>>>>
> >>>>> If you are careful, and know what you are doing, you can try
> something like the following. I am not giving many details, hopefully you
> can find on the net tutorials about how to use the things I suggest:
> >>>>>
> >>>>> 1. Move to global maintenance
> >>>>>
> >>>>> 2. Stop the current dead vm (if needed)
> >>>>>
> >>>>> 3. Find current vm conf, edit it to boot from a rescue iso image of
> your preference or from net/PXE etc., and start the vm with '--vm-conf'
> pointing to your edited file.
> >>>>>
> >>>>> 4. Connect a console (hosted-engine --console, or 'virsh console',
> or use '--add-console-password' and remote viewer, if needed)
> >>>>>
> >>>>> 5. Clean the disk and install the OS, oVirt, etc.
> >>>>>
> >>>>> 6. Copy your backup into the vm and restore with engine-backup
> >>>>>
> >>>>> 7. Then cleanly stop the machine, exit global maint, and let HA
> start it (or start it yourself with --vm-start).
> >>>>>
> >>>>> At the time, we had a bug [1] to document this. The result is [2].
> It does not detail how to boot/reinstall os/etc., only restore (if e.g. db
> is dead but fs is ok).
> >>>>> For something somewhat similar to what you want, see also [3], which
> uses guestfish. Might be useful, depending on how badly your disk is
> corrupted.
> >>>>
> >>>> I went with the guestfish approach. It has fixed some fs issues and
> now the yum etc seem fine apart from postgres.
> >>>> I had tried previously to uninstall/install packages so I ended
> installing them again with yum install ovirt\*setup\*.
> >>>> Now I think I have to run engine-setup but I get the error:
> >>>>
> >>>>  Failed to execute stage 'Environment setup': Cannot connect to
> Engine database using existing credentials: engine@localhost:5432
> >>>
> >>> Seems that I need to have psql running to be able to run engine-backup
> --mode=restore. Are there any steps how one could manually prepare pgsql
> for ovirt so as to attempt restoration?
> >>
> >>
> >> Replying again, also to conclude this part of your episode: Generally
> speaking, that's not needed. restore --provision-all-databases should do
> that for you.
> >
> > Seems that when pgsql is down nothing can be done. You need at least
> pgsql up and running (e clean state will do) so as to be able to proceed
> with restoration.
>
> Do you still have logs from this? Both engine-backup's (default to
> /var/log/ovirt-engine-backup/something if you do not pass --log) and
> ovirt-engine-provisiondb which it runs (at
> /var/log/ovirt-engine/setup).
>
I was using --provision-all-databases flag when trying to restore. I might
retest to double check. When the pgsql was down, I was getting:

2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all
file /var/backup/daily.0/engine-backup.gz
2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode 'restore'
2020-11-19 22:06:35 4947: OUTPUT: scope: all
2020-11-19 22:06:35 4947: OUTPUT: archive file:
/var/backup/daily.0/engine-backup.gz
2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log
2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10
2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore:
2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file
'/var/backup/daily.0/engine-backup.gz'
2020-11-19 22:06:35 4947: Opening tarball
/var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH
2020-11-19 22:06:35 4947: Verifying hash
2020-11-19 22:06:35 4947: Verifying version
2020-11-19 22:06:35 4947: Reading config
2020-11-19 22:06:35 4947: OUTPUT: Restoring:
2020-11-19 22:06:35 4947: OUTPUT: - Files
2020-11-19 22:06:35 4947: Restoring files
2020-11-19 22:06:36 4947: Reloading configuration
2020-11-19 22:06:36 4947: Generating pgpass
2020-11-19 22:06:36 4947: Verifying connection
2020-11-19 22:06:36 4947: pg_cmd running: psql -w -U engine -h localhost -p
5432  engine -c select 1
psql: FATAL:  Ident authentication failed for user "engine"
2020-11-19 22:06:36 4947: FATAL: Can't connect to database 'engine'. Please
see '/usr/bin/engine-backup --help'.


> Not sure what you mean in "a clean state will do". If you just install
> PG, it is not enabled by default, so is not "up and running".
>
I mean pgsql re-installed and the data stored cleaned as below:

rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/*
/opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb
systemctl restart rh-postgresql10-postgresql.service

>
> Generally speaking:
>
> If you never started/inited PG (e.g. on a clean machine), restore,
> with --provision-all-databases, does this for you. Are you sure you
> passed this?
>
I am pretty sure I used that flag but might be able to repeat for testing.

>
> If you did, and created DB/user with the same name it wants to restore
> to, but left the DB empty, it will use it.
>
> If you populated the DB, it will fail with a suitable error message.
>
Confirmed. When I created the DB and users it was failing. So I cleaned
everything, strtied pgsql and left the tool to do its job.

>
> These are the states that are intended to be supported.
>
> Anything else might break it in other ways.
>
> >>
> >>
> >> I replied to all your interim emails in private, since you replied in
> private.
> >
> > Did not notice I was replying in private :)
>
> NP :-)
>
> >>
> >>
> >> Thanks for the final message to the list.
> >>
> >> It would be nice if you send another summary of the main obstacles you
> ran into, what worked and didn't work, and especially what ideas you can
> think of to improve the code/doc for the next time something similar
> happens (also to you :-) ).
> >>
> >> If you feel like that, and have time, it sounds like a nice opportunity
> for a blog post :-) (I know I (almost?) never wrote any myself, sorry, but
> I like reading them - and they are much more approachable and useful, over
> the long run, compared to just posting to the list).
> >
> > Noted. Will check to put this in a blog.  Generally the missing part
> from the docs was that one cannot proceed with the restoration if pgsql is
> not able to start. So I had to clean re-install pgsql and initialize its
> data store before proceeding with the restoration.
>
> Well, I'd definitely not want a blog post saying you must manually
> init PG - if you indeed must, that's a bug, so I'd rather fix it
> first.
>
Noted.

>
> Thanks and best regards,
>
> >>
> >>
> >> Best regards,
> >>
> >>>>
> >>>>
> >>>> So I guess I need to follow [2]. What do you think?
> >>>>
> >>>>>
> >>>>> How did you run into a split brain? There is a lock on the shared
> storage that should prevent this.
> >>>>>
> >>>>> Good luck and best regards,
> >>>>>
> >>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710
> >>>>> [2]
> https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine
> >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4
> >>>>> --
> >>>>> Didi
> >>
> >>
> >>
> >> --
> >> Didi
> >
> > _______________________________________________
> > Users mailing list -- users@ovirt.org
> > To unsubscribe send an email to users-le...@ovirt.org
> > Privacy Statement: https://www.ovirt.org/privacy-policy.html
> > oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> > List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/6QZ4OKZTHPE7LLOHNKGJC2HMMBK662GN/
>
>
>
> --
> Didi
>
>

_______________________________________________
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/JNVSANQLRINOLRARMRT4QBIS4XRDN4RR/

[ovirt-users] Re: Fix corrupt self-hosted engine

Reply via email to