Thanks Asif, I have re-verified reported issue. expect standby backup, others are fixed.
Thanks & Regards, Rajkumar Raghuwanshi On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr.reh...@gmail.com> wrote: > > > On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > >> Hi Asif, >> >> While testing further I observed parallel backup is not able to take >> backup of standby server. >> >> mkdir /tmp/archive_dir >> echo "archive_mode='on'">> data/postgresql.conf >> echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf >> >> ./pg_ctl -D data -l logs start >> ./pg_basebackup -p 5432 -Fp -R -D /tmp/slave >> >> echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">> >> /tmp/slave/postgresql.conf >> echo "restore_command='cp /tmp/archive_dir/%f %p'">> >> /tmp/slave/postgresql.conf >> echo "promote_trigger_file='/tmp/failover.log'">> >> /tmp/slave/postgresql.conf >> >> ./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c >> >> [edb@localhost bin]$ ./psql postgres -p 5432 -c "select >> pg_is_in_recovery();" >> pg_is_in_recovery >> ------------------- >> f >> (1 row) >> >> [edb@localhost bin]$ ./psql postgres -p 5433 -c "select >> pg_is_in_recovery();" >> pg_is_in_recovery >> ------------------- >> t >> (1 row) >> >> >> >> >> *[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs >> 6pg_basebackup: error: could not list backup files: ERROR: the standby was >> promoted during online backupHINT: This means that the backup being taken >> is corrupt and should not be used. Try taking another online >> backup.pg_basebackup: removing data directory "/tmp/bkp_s"* >> >> #same is working fine without parallel backup >> [edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1 >> [edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION >> /tmp/bkp_s/PG_VERSION >> >> Thanks & Regards, >> Rajkumar Raghuwanshi >> >> >> On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi < >> rajkumar.raghuwan...@enterprisedb.com> wrote: >> >>> Hi Asif, >>> >>> In another scenarios, bkp data is corrupted for tablespace. again this >>> is not reproducible everytime, >>> but If I am running the same set of commands I am getting the same error. >>> >>> [edb@localhost bin]$ ./pg_ctl -D data -l logfile start >>> waiting for server to start.... done >>> server started >>> [edb@localhost bin]$ >>> [edb@localhost bin]$ mkdir /tmp/tblsp >>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace >>> tblsp location '/tmp/tblsp';" >>> CREATE TABLESPACE >>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create database testdb >>> tablespace tblsp;" >>> CREATE DATABASE >>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a >>> text);" >>> CREATE TABLE >>> [edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl >>> values ('parallel_backup with tablespace');" >>> INSERT 0 1 >>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T >>> /tmp/tblsp=/tmp/tblsp_bkp --jobs 2 >>> [edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555" >>> start >>> waiting for server to start.... done >>> server started >>> [edb@localhost bin]$ ./psql postgres -p 5555 -c "select * from >>> pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'"; >>> oid | spcname | spcowner | spcacl | spcoptions >>> -------+------------+----------+--------+------------ >>> 1663 | pg_default | 10 | | >>> 16384 | tblsp | 10 | | >>> (2 rows) >>> >>> [edb@localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl"; >>> psql: error: could not connect to server: FATAL: >>> "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory >>> DETAIL: File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is >>> missing. >>> [edb@localhost bin]$ >>> [edb@localhost bin]$ ls >>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>> data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>> [edb@localhost bin]$ ls >>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION >>> ls: cannot access >>> /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or >>> directory >>> >>> >>> Thanks & Regards, >>> Rajkumar Raghuwanshi >>> >>> >>> On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi < >>> rajkumar.raghuwan...@enterprisedb.com> wrote: >>> >>>> Hi Asif, >>>> >>>> On testing further, I found when taking backup with -R, pg_basebackup >>>> crashed >>>> this crash is not consistently reproducible. >>>> >>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test (a >>>> text);" >>>> CREATE TABLE >>>> [edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test >>>> values ('parallel_backup with -R recovery-conf');" >>>> INSERT 0 1 >>>> [edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp >>>> -R >>>> Segmentation fault (core dumped) >>>> >>>> stack trace looks the same as it was on earlier reported crash with >>>> tablespace. >>>> --stack trace >>>> [edb@localhost bin]$ gdb -q -c core.37915 pg_basebackup >>>> Loaded symbols for /lib64/libnss_files.so.2 >>>> Core was generated by `./pg_basebackup -p 5432 -j 2 -D >>>> /tmp/test_bkp/bkp -R'. >>>> Program terminated with signal 11, Segmentation fault. >>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at >>>> pg_basebackup.c:3175 >>>> 3175 backupinfo->curr = fetchfile->next; >>>> Missing separate debuginfos, use: debuginfo-install >>>> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 >>>> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 >>>> openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64 >>>> (gdb) bt >>>> #0 0x00000000004099ee in worker_get_files (wstate=0xc1e458) at >>>> pg_basebackup.c:3175 >>>> #1 0x0000000000408a9e in worker_run (arg=0xc1e458) at >>>> pg_basebackup.c:2715 >>>> #2 0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at >>>> pthread_create.c:301 >>>> #3 0x00000039212e8c4d in clone () at >>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 >>>> (gdb) >>>> >>>> Thanks & Regards, >>>> Rajkumar Raghuwanshi >>>> >>>> >>>> On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke < >>>> jeevan.cha...@enterprisedb.com> wrote: >>>> >>>>> Hi Asif, >>>>> >>>>> >>>>>> Thanks Rajkumar. I have fixed the above issues and have rebased the >>>>>> patch to the latest master (b7f64c64). >>>>>> (V9 of the patches are attached). >>>>>> >>>>> >>>>> I had a further review of the patches and here are my few observations: >>>>> >>>>> 1. >>>>> +/* >>>>> + * stop_backup() - ends an online backup >>>>> + * >>>>> + * The function is called at the end of an online backup. It sends >>>>> out pg_control >>>>> + * file, optionally WAL segments and ending WAL location. >>>>> + */ >>>>> >>>>> Comments seem out-dated. >>>>> >>>> > Fixed. > > >> >>>>> 2. With parallel jobs, maxrate is now not supported. Since we are now >>>>> asking >>>>> data in multiple threads throttling seems important here. Can you >>>>> please >>>>> explain why have you disabled that? >>>>> >>>>> 3. As we are always fetching a single file and as Robert suggested, >>>>> let rename >>>>> SEND_FILES to SEND_FILE instead. >>>>> >>>> > Yes, we are fetching a single file. However, SEND_FILES is still capable > of fetching multiple files in one > go, that's why the name. > > >>>>> 4. Does this work on Windows? I mean does pthread_create() work on >>>>> Windows? >>>>> I asked this as I see that pgbench has its own implementation for >>>>> pthread_create() for WIN32 but this patch doesn't. >>>>> >>>> > patch is updated to add support for the Windows platform. > > >>>>> 5. Typos: >>>>> tablspace => tablespace >>>>> safly => safely >>>>> >>>>> Done. > > >> 6. parallel_backup_run() needs some comments explaining the states it goes >>>>> through PB_* states. >>>>> >>>>> 7. >>>>> + case PB_FETCH_REL_FILES: /* fetch files from server */ >>>>> + if (backupinfo->activeworkers == 0) >>>>> + { >>>>> + backupinfo->backupstate = PB_STOP_BACKUP; >>>>> + free_filelist(backupinfo); >>>>> + } >>>>> + break; >>>>> + case PB_FETCH_WAL_FILES: /* fetch WAL files from >>>>> server */ >>>>> + if (backupinfo->activeworkers == 0) >>>>> + { >>>>> + backupinfo->backupstate = PB_BACKUP_COMPLETE; >>>>> + } >>>>> + break; >>>>> >>>> Done. > > >> >>>>> Why free_filelist() is not called in PB_FETCH_WAL_FILES case? >>>>> >>>> Done. > > The corrupted tablespace and crash, reported by Rajkumar, have been fixed. > A pointer > variable remained uninitialized which in turn caused the system to > misbehave. > > Attached is the updated set of patches. AFAIK, to complete parallel backup > feature > set, there remain three sub-features: > > 1- parallel backup does not work with a standby server. In parallel > backup, the server > spawns multiple processes and there is no shared state being maintained. > So currently, > no way to tell multiple processes if the standby was promoted during the > backup since > the START_BACKUP was called. > > 2- throttling. Robert previously suggested that we implement throttling on > the client-side. > However, I found a previous discussion where it was advocated to be added > to the > backend instead[1]. > > So, it was better to have a consensus before moving the throttle function > to the client. > That’s why for the time being I have disabled it and have asked for > suggestions on it > to move forward. > > It seems to me that we have to maintain a shared state in order to support > taking backup > from standby. Also, there is a new feature recently committed for backup > progress > reporting in the backend (pg_stat_progress_basebackup). This functionality > was recently > added via this commit ID: e65497df. For parallel backup to update these > stats, a shared > state will be required. > > Since multiple pg_basebackup can be running at the same time, maintaining > a shared state > can become a little complex, unless we disallow taking multiple parallel > backups. > > So proceeding on with this patch, I will be working on: > - throttling to be implemented on the client-side. > - adding a shared state to handle backup from the standby. > > > > [1] > https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af > > > -- > Asif Rehman > Highgo Software (Canada/China/Pakistan) > URL : www.highgo.ca > >