On Mon, Nov 23, 2020 at 10:09 AM Yedidyah Bar David <d...@redhat.com> wrote:
> On Mon, Nov 23, 2020 at 9:54 AM Alex K <rightkickt...@gmail.com> wrote: > > > > > > > > On Sun, Nov 22, 2020 at 8:57 AM Yedidyah Bar David <d...@redhat.com> > wrote: > >> > >> On Thu, Nov 19, 2020 at 9:43 PM Alex K <rightkickt...@gmail.com> wrote: > >>> > >>> > >>> > >>> On Thu, Nov 19, 2020 at 5:31 PM Alex K <rightkickt...@gmail.com> > wrote: > >>>> > >>>> Hi Didi, > >>>> > >>>> On Thu, Nov 19, 2020 at 5:13 PM Yedidyah Bar David <d...@redhat.com> > wrote: > >>>>> > >>>>> On Thu, Nov 19, 2020 at 4:37 PM Alex K <rightkickt...@gmail.com> > wrote: > >>>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> I have a corrupt self-hosted engine (with several file system > errors, postgres not able to start) and thus it does not give access to the > web UI. This happened following an unlucky split brain resolution (I am > running 2 nodes). The two hosts are running VMs also which I would like to > keep running as they are needed. > >>>>>> > >>>>>> When trying to boot into rescue mode (using > systemd.unit=emergency.target boot parameter) I get a cursor and nothing > else. > >>>>> > >>>>> > >>>>> This means that more than just the DB is corrupt... > >>>>> > >>>>>> > >>>>>> > >>>>>> I have backups of engine files with scope all (using the > engine-backup tool). > >>>>>> What is the best approach to try and fix the engine or redeploy. > >>>>> > >>>>> > >>>>> If you are careful, and know what you are doing, you can try > something like the following. I am not giving many details, hopefully you > can find on the net tutorials about how to use the things I suggest: > >>>>> > >>>>> 1. Move to global maintenance > >>>>> > >>>>> 2. Stop the current dead vm (if needed) > >>>>> > >>>>> 3. Find current vm conf, edit it to boot from a rescue iso image of > your preference or from net/PXE etc., and start the vm with '--vm-conf' > pointing to your edited file. > >>>>> > >>>>> 4. Connect a console (hosted-engine --console, or 'virsh console', > or use '--add-console-password' and remote viewer, if needed) > >>>>> > >>>>> 5. Clean the disk and install the OS, oVirt, etc. > >>>>> > >>>>> 6. Copy your backup into the vm and restore with engine-backup > >>>>> > >>>>> 7. Then cleanly stop the machine, exit global maint, and let HA > start it (or start it yourself with --vm-start). > >>>>> > >>>>> At the time, we had a bug [1] to document this. The result is [2]. > It does not detail how to boot/reinstall os/etc., only restore (if e.g. db > is dead but fs is ok). > >>>>> For something somewhat similar to what you want, see also [3], which > uses guestfish. Might be useful, depending on how badly your disk is > corrupted. > >>>> > >>>> I went with the guestfish approach. It has fixed some fs issues and > now the yum etc seem fine apart from postgres. > >>>> I had tried previously to uninstall/install packages so I ended > installing them again with yum install ovirt\*setup\*. > >>>> Now I think I have to run engine-setup but I get the error: > >>>> > >>>> Failed to execute stage 'Environment setup': Cannot connect to > Engine database using existing credentials: engine@localhost:5432 > >>> > >>> Seems that I need to have psql running to be able to run engine-backup > --mode=restore. Are there any steps how one could manually prepare pgsql > for ovirt so as to attempt restoration? > >> > >> > >> Replying again, also to conclude this part of your episode: Generally > speaking, that's not needed. restore --provision-all-databases should do > that for you. > > > > Seems that when pgsql is down nothing can be done. You need at least > pgsql up and running (e clean state will do) so as to be able to proceed > with restoration. > > Do you still have logs from this? Both engine-backup's (default to > /var/log/ovirt-engine-backup/something if you do not pass --log) and > ovirt-engine-provisiondb which it runs (at > /var/log/ovirt-engine/setup). > I was using --provision-all-databases flag when trying to restore. I might retest to double check. When the pgsql was down, I was getting: 2020-11-19 22:06:35 4947: Start of engine-backup mode restore scope all file /var/backup/daily.0/engine-backup.gz 2020-11-19 22:06:35 4947: OUTPUT: Start of engine-backup with mode 'restore' 2020-11-19 22:06:35 4947: OUTPUT: scope: all 2020-11-19 22:06:35 4947: OUTPUT: archive file: /var/backup/daily.0/engine-backup.gz 2020-11-19 22:06:35 4947: OUTPUT: log file: restore.log 2020-11-19 22:06:35 4947: Setting scl env for rh-postgresql10 2020-11-19 22:06:35 4947: OUTPUT: Preparing to restore: 2020-11-19 22:06:35 4947: OUTPUT: - Unpacking file '/var/backup/daily.0/engine-backup.gz' 2020-11-19 22:06:35 4947: Opening tarball /var/backup/daily.0/engine-backup.gz to /tmp/engine-backup.63eeNqt4NH 2020-11-19 22:06:35 4947: Verifying hash 2020-11-19 22:06:35 4947: Verifying version 2020-11-19 22:06:35 4947: Reading config 2020-11-19 22:06:35 4947: OUTPUT: Restoring: 2020-11-19 22:06:35 4947: OUTPUT: - Files 2020-11-19 22:06:35 4947: Restoring files 2020-11-19 22:06:36 4947: Reloading configuration 2020-11-19 22:06:36 4947: Generating pgpass 2020-11-19 22:06:36 4947: Verifying connection 2020-11-19 22:06:36 4947: pg_cmd running: psql -w -U engine -h localhost -p 5432 engine -c select 1 psql: FATAL: Ident authentication failed for user "engine" 2020-11-19 22:06:36 4947: FATAL: Can't connect to database 'engine'. Please see '/usr/bin/engine-backup --help'. > Not sure what you mean in "a clean state will do". If you just install > PG, it is not enabled by default, so is not "up and running". > I mean pgsql re-installed and the data stored cleaned as below: rm -rf /var/opt/rh/rh-postgresql10/lib/pgsql/data/* /opt/rh/rh-postgresql10/root/usr/bin/postgresql-setup --initdb systemctl restart rh-postgresql10-postgresql.service > > Generally speaking: > > If you never started/inited PG (e.g. on a clean machine), restore, > with --provision-all-databases, does this for you. Are you sure you > passed this? > I am pretty sure I used that flag but might be able to repeat for testing. > > If you did, and created DB/user with the same name it wants to restore > to, but left the DB empty, it will use it. > > If you populated the DB, it will fail with a suitable error message. > Confirmed. When I created the DB and users it was failing. So I cleaned everything, strtied pgsql and left the tool to do its job. > > These are the states that are intended to be supported. > > Anything else might break it in other ways. > > >> > >> > >> I replied to all your interim emails in private, since you replied in > private. > > > > Did not notice I was replying in private :) > > NP :-) > > >> > >> > >> Thanks for the final message to the list. > >> > >> It would be nice if you send another summary of the main obstacles you > ran into, what worked and didn't work, and especially what ideas you can > think of to improve the code/doc for the next time something similar > happens (also to you :-) ). > >> > >> If you feel like that, and have time, it sounds like a nice opportunity > for a blog post :-) (I know I (almost?) never wrote any myself, sorry, but > I like reading them - and they are much more approachable and useful, over > the long run, compared to just posting to the list). > > > > Noted. Will check to put this in a blog. Generally the missing part > from the docs was that one cannot proceed with the restoration if pgsql is > not able to start. So I had to clean re-install pgsql and initialize its > data store before proceeding with the restoration. > > Well, I'd definitely not want a blog post saying you must manually > init PG - if you indeed must, that's a bug, so I'd rather fix it > first. > Noted. > > Thanks and best regards, > > >> > >> > >> Best regards, > >> > >>>> > >>>> > >>>> So I guess I need to follow [2]. What do you think? > >>>> > >>>>> > >>>>> How did you run into a split brain? There is a lock on the shared > storage that should prevent this. > >>>>> > >>>>> Good luck and best regards, > >>>>> > >>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1482710 > >>>>> [2] > https://www.ovirt.org/documentation/administration_guide/#Overwriting_a_Self-Hosted_Engine > >>>>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1569827#c4 > >>>>> -- > >>>>> Didi > >> > >> > >> > >> -- > >> Didi > > > > _______________________________________________ > > Users mailing list -- users@ovirt.org > > To unsubscribe send an email to users-le...@ovirt.org > > Privacy Statement: https://www.ovirt.org/privacy-policy.html > > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/6QZ4OKZTHPE7LLOHNKGJC2HMMBK662GN/ > > > > -- > Didi > >
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/JNVSANQLRINOLRARMRT4QBIS4XRDN4RR/