from:"Melih Mutlu"

Mingw task for Cirrus CI

2022-02-25 Thread Melih Mutlu

Hi All,

I've been working on adding Windows+MinGW environment into cirrus-ci tasks
(discussion about ci is here [1]).
It uses MSYS2 to set the environment. UCRT is chosen as C standard library,
instead of MSVCRT.
The task configures postgres with features that are available in MSYS2 (see
available packages [2]) and tap tests are enabled.
I already added the necessary docker image, you can find the related PR at
[3] and a successful cirruc-ci run with these changes at [4].
Attached patch adds a task runs on Windows with MinGW for cirrus-ci

However, I cannot run configure with --with-python, --with-perl or
--with-tcl.
There are two issues I encountered while trying to enable.  e.g. for
--with-python

1-  python_ldlibrary is set to "libpython3.9.dll.a". So the related line in
configure cannot retrieve "ldlibrary"
2-  src/pl/plpython/Makefile looks under "C:/Windows/system32/" for
PYTHONDLL. gendef cannot open that file, give an error like " failed to
open()" when creating a .def file.

To overcome those, I added the correct pattern to extract ldlibrary by
appending  "-e 's/\.dll.a$//'" at the end of the related line.
Then, I tried to use python dll from MSYS instead of windows/system32 by
changing PYTHONDLL path to "/ucrt64/bin/libpython3.9.dll". I'm not sure if
this is the correct dll.
Here is the diff of these two changes:

diff --git a/configure b/configure
index f3cb5c2b51..42ea580442 100755
--- a/configure
+++ b/configure
@@ -10536,7 +10536,7 @@ python_libdir=`${PYTHON} -c "import sysconfig;
print(' '.join(filter(None,syscon
 python_ldlibrary=`${PYTHON} -c "import sysconfig; print('
'.join(filter(None,sysconfig.get_config_vars('LDLIBRARY'"`

 # If LDLIBRARY exists and has a shlib extension, use it verbatim.
-ldlibrary=`echo "${python_ldlibrary}" | sed -e 's/\.so$//' -e 's/\.dll$//'
-e 's/\.dylib$//' -e 's/\.sl$//'`
+ldlibrary=`echo "${python_ldlibrary}" | sed -e 's/\.so$//' -e 's/\.dll$//'
-e 's/\.dylib$//' -e 's/\.sl$//' -e 's/\.dll.a$//'`
 if test -e "${python_libdir}/${python_ldlibrary}" -a
x"${python_ldlibrary}" != x"${ldlibrary}"
 then
ldlibrary=`echo "${ldlibrary}" | sed "s/^lib//"`
diff --git a/src/pl/plpython/Makefile b/src/pl/plpython/Makefile
index a83ae8865c..4254ef94d7 100644
--- a/src/pl/plpython/Makefile
+++ b/src/pl/plpython/Makefile
@@ -61,7 +61,8 @@ INCS =plpython.h \
 ifeq ($(PORTNAME), win32)

 pytverstr=$(subst .,,${python_version})
-PYTHONDLL=$(subst \,/,$(WINDIR))/system32/python${pytverstr}.dll
+#PYTHONDLL=$(subst \,/,$(WINDIR))/system32/python${pytverstr}.dll
+PYTHONDLL=/ucrt64/bin/libpython3.9.dll

 OBJS += libpython${pytverstr}.a



In the end, make check-world still fails, even though I was able to run
configure and make without any obvious error.
I see bunch of errors in tests like:
+ERROR:  language "plpython3u" does not exist
+HINT:  Use CREATE EXTENSION to load the language into the database.

Here is the logs from failed ci run:
https://api.cirrus-ci.com/v1/artifact/task/4645682031099904/log/build/src/pl/plpython/regression.diffs


Any thoughts on how postgres can be built with --with-python etc. on mingw?

Best,
Melih


[1]
https://www.postgresql.org/message-id/flat/20211001222752.wrz7erzh4cajvgp6%40alap3.anarazel.de

[2] https://packages.msys2.org/package/

[3] https://github.com/anarazel/pg-vm-images/pull/8

[4] https://cirrus-ci.com/build/4999469182746624

Re: Mingw task for Cirrus CI

2022-03-03 Thread Melih Mutlu

Hi Andres,


> This presumably is due to using mingw's python rather than python.org's
> python. Seems like a reasonable thing to support for the mingw build.
>
> Melih, you could try to build against the python.org python (i installed
> in
> the CI container).
>

I tried to use the installed python from python.org in the container.
The problem with this is that "LIBDIR" and "LDLIBRARY" configs of python
for windows from python.org are empty. Therefore python_libdir or other
related variables in configure file are not set correctly.


Seems we're doing something wrong and end up with a 4 byte off_t, whereas
> python ends up with an 8byte one. We probably need to fix that. But it's
> not
> the cause of this problem.
>
>
> You could take out -s from the make flags and see whether plpython3.dll is
> being built and installed, and where to.
>

Here is a run without -s: https://cirrus-ci.com/task/4569363104661504
I couldn't get what's wrong from these logs to be honest. But I see that
plpython3.dll exists under src/pl/plpython after build when I run these
steps locally.

Best,
Melih

Re: Mingw task for Cirrus CI

2022-03-22 Thread Melih Mutlu

Hi Andres,

Rebased it.
I also removed the temp installation task and
used NoDefaultCurrentDirectoryInExePath env variable instead.

Best,
Melih
From e8a1dae0ec10efd8a967070e0d412d2bf875fa93 Mon Sep 17 00:00:00 2001
From: Melih Mutlu 
Date: Mon, 21 Feb 2022 14:46:05 +0300
Subject: [PATCH] Added Windows with MinGW environment in Cirrus CI

---
 .cirrus.yml | 79 +++--
 1 file changed, 65 insertions(+), 14 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index e5335fede7..1ed40347cf 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -23,7 +23,6 @@ env:
   CHECKFLAGS: -Otarget
   PROVE_FLAGS: --timer
   PGCTLTIMEOUT: 120 # avoids spurious failures during parallel tests
-  TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
   PG_TEST_EXTRA: kerberos ldap ssl
 
 
@@ -334,13 +333,30 @@ task:
 cores_script: src/tools/ci/cores_backtrace.sh macos "${HOME}/cores"
 
 
+WINDOWS_ENVIRONMENT_BASE: &WINDOWS_ENVIRONMENT_BASE
+env:
+  # Half the allowed per-user CPU cores
+  CPUS: 4
+  # The default working dir is in a directory msbuild complains about
+  CIRRUS_WORKING_DIR: "c:/cirrus"
+  TEMP_CONFIG: ${CIRRUS_WORKING_DIR}/src/tools/ci/pg_ci_base.conf
+
+  # Avoids port conflicts between concurrent tap test runs
+  PG_TEST_USE_UNIX_SOCKETS: 1
+
+only_if: $CIRRUS_CHANGE_MESSAGE !=~ '.*\nci-os-only:.*' || $CIRRUS_CHANGE_MESSAGE =~ '.*\nci-os-only:[^\n]*windows.*'
+
+sysinfo_script: |
+  chcp
+  systeminfo
+  powershell -Command get-psdrive -psprovider filesystem
+  set
+
 task:
+  << : *WINDOWS_ENVIRONMENT_BASE
   name: Windows - Server 2019, VS 2019
 
   env:
-# Half the allowed per-user CPU cores
-CPUS: 4
-
 # Our windows infrastructure doesn't have test concurrency above the level
 # of a single vcregress test target. Due to that, it's useful to run prove
 # with multiple jobs. For the other tasks it isn't, because two sources
@@ -350,15 +366,11 @@ task:
 # likely can be improved upon further.
 PROVE_FLAGS: -j10 --timer
 
-# The default cirrus working dir is in a directory msbuild complains about
-CIRRUS_WORKING_DIR: "c:/cirrus"
 # Avoid re-installing over and over
 NO_TEMP_INSTALL: 1
 # git's tar doesn't deal with drive letters, see
 # https://postgr.es/m/b6782dc3-a7b0-ed56-175f-f8f54cb08d67%40dunslane.net
 TAR: "c:/windows/system32/tar.exe"
-# Avoids port conflicts between concurrent tap test runs
-PG_TEST_USE_UNIX_SOCKETS: 1
 PG_REGRESS_SOCK_DIR: "c:/cirrus/"
 # -m enables parallelism
 # verbosity:minimal + Summary reduce verbosity, while keeping a summary of
@@ -389,12 +401,6 @@ task:
 cpu: $CPUS
 memory: 4G
 
-  sysinfo_script: |
-chcp
-systeminfo
-powershell -Command get-psdrive -psprovider filesystem
-set
-
   setup_additional_packages_script: |
 REM choco install -y --no-progress ...
 
@@ -454,6 +460,51 @@ task:
   path: "crashlog-*.txt"
   type: text/plain
 
+task:
+  << : *WINDOWS_ENVIRONMENT_BASE
+  name: Windows - Server 2019, MinGW64
+  windows_container:
+image: $CONTAINER_REPO/windows_ci_mingw64:latest
+cpu: $CPUS
+memory: 4G
+  env:
+CCACHE_DIR: C:/msys64/ccache
+BUILD_DIR: "%CIRRUS_WORKING_DIR%/build"
+
+  ccache_cache:
+folder: ${CCACHE_DIR}
+
+  mingw_info_script:
+- C:\msys64\usr\bin\bash.exe -lc "where gcc"
+- C:\msys64\usr\bin\bash.exe -lc "gcc --version"
+- C:\msys64\usr\bin\bash.exe -lc "where perl"
+- C:\msys64\usr\bin\bash.exe -lc "perl --version"
+
+  configure_script:
+- C:\msys64\usr\bin\bash.exe -lc "mkdir %BUILD_DIR% &&
+  cd %BUILD_DIR% &&
+  %CIRRUS_WORKING_DIR%/configure
+--host=x86_64-w64-mingw32
+--enable-cassert
+--enable-tap-tests
+--with-icu
+--with-libxml
+--with-libxslt
+--with-lz4
+--enable-debug
+CC='ccache gcc'
+CXX='ccache g++'"
+
+  build_script:
+C:\msys64\usr\bin\bash.exe -lc "cd %BUILD_DIR% && make -s world-bin -j${CPUS}"
+
+  upload_caches: ccache
+
+  tests_script:
+  - set "NoDefaultCurrentDirectoryInExePath=0"
+  - C:\msys64\usr\bin\bash.exe -lc "cd %BUILD_DIR% && make -s ${CHECK} ${CHECKFLAGS} -j${CPUS} TMPDIR=%BUILD_DIR%/tmp_install"
+
+  on_failure: *on_failure
 
 task:
   name: CompilerWarnings
-- 
2.35.1.windows.2

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-01-16 Thread Melih Mutlu

Hi,

Thanks for reviewing.

torikoshia , 10 Oca 2024 Çar, 09:37 tarihinde
şunu yazdı:

> > + 
> > +   > role="column_definition">
> > +   context_id int4
> > +  
> > +  
> > +   Current context id. Note that the context id is a temporary id
> > and may
> > +   change in each invocation
> > +  
> > + 
> > +
> > + 
> > +   > role="column_definition">
> > +   path int4[]
> > +  
> > +  
> > +   Path to reach the current context from TopMemoryContext.
> > Context ids in
> > +   this list represents all parents of the current context. This
> > can be
> > +   used to build the parent and child relation
> > +  
> > + 
> > +
> > + 
> > +   > role="column_definition">
> > +   total_bytes_including_children
> > int8
> > +  
> > +  
> > +   Total bytes allocated for this memory context including its
> > children
> > +  
> > + 
>
> These columns are currently added to the bottom of the table, but it may
> be better to put semantically similar items close together and change
> the insertion position with reference to other system views. For
> example,
>
> - In pg_group and pg_user, 'id' is placed on the line following 'name',
> so 'context_id' be placed on the line following 'name'
> - 'path' is similar with 'parent' and 'level' in that these are
> information about the location of the context, 'path' be placed to next
> to them.
>
> If we do this, orders of columns in the system view should be the same,
> I think.
>

I've done what you suggested. Also moved "total_bytes_including_children"
right after "total_bytes".


14dd0f27d have introduced new macro foreach_int.
> It seems to be able to make the code a bit simpler and the commit log
> says this macro is primarily intended for use in new code. For example:
>

Makes sense. Done.

Thanks,
-- 
Melih Mutlu
Microsoft


v5-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
Description: Binary data

Re: Improve WALRead() to suck data directly from WAL buffers when possible

2024-01-22 Thread Melih Mutlu

Hi Bharath,

Thanks for working on this. It seems like a nice improvement to have.

Here are some comments on 0001 patch.

1-  xlog.c
+ /*
+ * Fast paths for the following reasons: 1) WAL buffers aren't in use when
+ * server is in recovery. 2) WAL is inserted into WAL buffers on current
+ * server's insertion TLI. 3) Invalid starting WAL location.
+ */

Shouldn't the comment be something like "2) WAL is *not* inserted into WAL
buffers on current server's insertion TLI" since the condition to break is tli
!= GetWALInsertionTimeLine()

2-
+ /*
+ * WAL being read doesn't yet exist i.e. past the current insert position.
+ */
+ if ((startptr + count) > reservedUpto)
+ return ntotal;

This question may not even make sense but I wonder whether we can read from
startptr only to reservedUpto in case of startptr+count exceeds
reservedUpto?

3-
+ /* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+ if ((startptr + count) > LogwrtResult.Write &&
+ (startptr + count) <= reservedUpto)
+ WaitXLogInsertionsToFinish(startptr + count);

Do we need to check if (startptr + count) <= reservedUpto as we already
verified this condition a few lines above?

4-
+ Assert(nread > 0);
+ memcpy(dst, data, nread);
+
+ /*
+ * Make sure we don't read xlblocks down below before the page
+ * contents up above.
+ */
+ pg_read_barrier();
+
+ /* Recheck if the read page still exists in WAL buffers. */
+ endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+ /* Return if the page got initalized while we were reading it. */
+ if (expectedEndPtr != endptr)
+ break;
+
+ /*
+ * Typically, we must not read a WAL buffer page that just got
+ * initialized. Because we waited enough for the in-progress WAL
+ * insertions to finish above. However, there can exist a slight
+ * window after the above wait finishes in which the read buffer page
+ * can get replaced especially under high WAL generation rates. After
+ * all, we are reading from WAL buffers without any locks here. So,
+ * let's not count such a page in.
+ */
+ phdr = (XLogPageHeader) page;
+ if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+   phdr->xlp_tli == tli))
+ break;

I see that you recheck if the page still exists and so at the end. What
would you think about memcpy'ing only after being sure that we will need
and use the recently read data? If we break the loop during the recheck, we
simply discard the data read in the latest attempt. I guess that this may
not be a big deal but the data would be unnecessarily copied into the
destination in such a case.

5- xlogreader.c
+ nread = XLogReadFromBuffers(startptr, tli, count, buf);
+
+ if (nread > 0)
+ {
+ /*
+ * Check if its a full read, short read or no read from WAL buffers.
+ * For short read or no read, continue to read the remaining bytes
+ * from WAL file.
+ *
+ * XXX: It might be worth to expose WAL buffer read stats.
+ */
+ if (nread == count) /* full read */
+ return true;
+ else if (nread < count) /* short read */
+ {
+ buf += nread;
+ startptr += nread;
+ count -= nread;
+ }

Typo in the comment. Should be like "Check if *it's* a full read, short
read or no read from WAL buffers."

Also I don't think XLogReadFromBuffers() returns anything less than 0 and
more than count. Is verifying nread > 0 necessary? I think if nread does
not equal to count, we can simply assume that it's a short read. (or no
read at all in case nread is 0 which we don't need to handle specifically)

6-
+ /*
+ * We determined how much of the page can be validly read as 'count', read
+ * that much only, not the entire page. Since WALRead() can read the page
+ * from WAL buffers, in which case, the page is not guaranteed to be
+ * zero-padded up to the page boundary because of the concurrent
+ * insertions.
+ */

I'm not sure about pasting this into the most places we call WalRead().
Wouldn't it be better if we mention this somewhere around WALRead() only
once?


Best,
-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-01-30 Thread Melih Mutlu

Hi Heikki,

Heikki Linnakangas , 29 Oca 2024 Pzt, 19:12 tarihinde şunu
yazdı:

> > Proposed change modifies socket_putmessage to send any data larger than
> > 8K immediately without copying it into the send buffer. Assuming that
> > the send buffer would be flushed anyway due to reaching its limit, the
> > patch just gets rid of the copy part which seems unnecessary and sends
> > data without waiting.
>
> If there's already some data in PqSendBuffer, I wonder if it would be
> better to fill it up with data, flush it, and then send the rest of the
> data directly. Instead of flushing the partial data first. I'm afraid
> that you'll make a tiny call to secure_write(), followed by a large one,
> then a tine one again, and so forth. Especially when socket_putmessage
> itself writes the msgtype and len, which are tiny, before the payload.
>

I agree that I could do better there without flushing twice for both
PqSendBuffer and
input data. PqSendBuffer always has some data, even if it's tiny, since
msgtype and len are added.


> Perhaps we should invent a new pq_putmessage() function that would take
> an input buffer with 5 bytes of space reserved before the payload.
> pq_putmessage() could then fill in the msgtype and len bytes in the
> input buffer and send that directly. (Not wedded to that particular API,
> but something that would have the same effect)
>

I thought about doing this. The reason why I didn't was because I think
that such a change would require adjusting all input buffers wherever
pq_putmessage is called, and I did not want to touch that many different
places. These places where we need pq_putmessage might not be that many
though, I'm not sure.


>
> > This change affects places where pq_putmessage is used such as
> > pg_basebackup, COPY TO, walsender etc.
> >
> > I did some experiments to see how the patch performs.
> > Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
> > STDOUT". Here are perf results of both the patch and HEAD > ...
> > The patch brings a ~5% gain in socket_putmessage.
> >
> > [1]
> > CREATE TABLE test(id int, name text, time TIMESTAMP);
> > INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100)
> > AS name, NOW() AS time FROM generate_series(1, 1) AS i;
>
> I'm surprised by these results, because each row in that table is < 600
> bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in
> that test. Am I missing something?
>

You're absolutely right. I made a silly mistake there. I also think that
the way I did perf analysis does not make much sense, even if one row of
the table is greater than 8kB.
Here are some quick timing results after being sure that it triggers this
patch's optimization. I need to think more on how to profile this with
perf. I hope to share proper results soon.

I just added a bit more zeros [1] and ran [2] (hopefully measured the
correct thing)

HEAD:
real2m48,938s
user0m9,226s
sys 1m35,342s

Patch:
real2m40,690s
user0m8,492s
sys 1m31,001s

[1]
 INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 1)
AS name, NOW() AS time FROM generate_series(1, 100) AS i;

[2]
 rm /tmp/dummy && echo 3 | sudo tee /proc/sys/vm/drop_caches && time psql
-d postgres -c "COPY test TO STDOUT;" > /tmp/dummy

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-01-30 Thread Melih Mutlu

Hi Robert,

Robert Haas , 29 Oca 2024 Pzt, 20:48 tarihinde şunu
yazdı:

> > If there's already some data in PqSendBuffer, I wonder if it would be
> > better to fill it up with data, flush it, and then send the rest of the
> > data directly. Instead of flushing the partial data first. I'm afraid
> > that you'll make a tiny call to secure_write(), followed by a large one,
> > then a tine one again, and so forth. Especially when socket_putmessage
> > itself writes the msgtype and len, which are tiny, before the payload.
> >
> > Perhaps we should invent a new pq_putmessage() function that would take
> > an input buffer with 5 bytes of space reserved before the payload.
> > pq_putmessage() could then fill in the msgtype and len bytes in the
> > input buffer and send that directly. (Not wedded to that particular API,
> > but something that would have the same effect)
>
> I share the concern; I'm not sure about the best solution. I wonder if
> it would be useful to have pq_putmessagev() in the style of writev()
> et al. Or maybe what we need is secure_writev().
>

I thought about using writev() for not only pq_putmessage() but
pq_putmessage_noblock() too. Currently, pq_putmessage_noblock()
repallocs PqSendBuffer
and copies input buffer, which can easily be larger than 8kB, into
PqSendBuffer.I
also discussed it with Thomas off-list. The thing is that I believe we
would need secure_writev() with SSL/GSS cases handled properly. I'm just
not sure if the effort would be worthwhile considering what we gain from it.


> I also wonder if the threshold for sending data directly should be
> smaller than the buffer size, and/or whether it should depend on the
> buffer being empty.


You might be right. I'm not sure what the ideal threshold would be.


> If we have an 8kB buffer that currently has
> nothing in it, and somebody writes 2kB, I suspect it might be wrong to
> copy that into the buffer. If the same buffer had 5kB used and 3kB
> free, copying sounds a lot more likely to work out. The goal here is
> probably to condense sequences of short messages into a single
> transmission while sending long messages individually. I'm just not
> quite sure what heuristic would do that most effectively.
>

Sounds like it's difficult to come up with a heuristic that would work well
enough for most cases.
One thing with sending data instead of copying it if the buffer is empty is
that initially the buffer is empty. I believe it will stay empty forever if
we do not copy anything when the buffer is empty. We can maybe simply set
the threshold to the buffer size/2 (4kB) and hope that will work better. Or
copy the data only if it fits into the remaining space in the buffer. What
do you think?


An additional note while I mentioned pq_putmessage_noblock(), I've been
testing sending input data immediately in pq_putmessage_noblock() without
blocking and copy the data into PqSendBuffer only if the socket would block
and cannot send it. Unfortunately, I don't have strong numbers to
demonstrate any improvement in perf or timing yet. But I still like to know
what would you think about it?

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-01-31 Thread Melih Mutlu

Robert Haas , 31 Oca 2024 Çar, 20:23 tarihinde şunu
yazdı:

> On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio 
> wrote:
> > I agree that it's hard to prove that such heuristics will always be
> > better in practice than the status quo. But I feel like we shouldn't
> > let perfect be the enemy of good here.
>
> Sure, I agree.
>
> > I one approach that is a clear
> > improvement over the status quo is:
> > 1. If the buffer is empty AND the data we are trying to send is larger
> > than the buffer size, then don't use the buffer.
> > 2. If not, fill up the buffer first (just like we do now) then send
> > that. And if the left over data is then still larger than the buffer,
> > then now the buffer is empty so 1. applies.
>
> That seems like it might be a useful refinement of Melih Mutlu's
> original proposal, but consider a message stream that consists of
> messages exactly 8kB in size. If that message stream begins when the
> buffer is empty, all messages are sent directly. If it begins when
> there are any number of bytes in the buffer, we buffer every message
> forever. That's kind of an odd artifact, but maybe it's fine in
> practice. I say again that it's good to test out a bunch of scenarios
> and see what shakes out.
>

Isn't this already the case? Imagine sending exactly 8kB messages, the
first pq_putmessage() call will buffer 8kB. Any call after this point
simply sends a 8kB message already buffered from the previous call and
buffers a new 8kB message. Only difference here is we keep the message in
the buffer for a while instead of sending it directly. In theory, the
proposed idea should not bring any difference in the number of flushes and
the size of data we send in each time, but can remove unnecessary copies to
the buffer in this case. I guess the behaviour is also the same with or
without the patch in case the buffer has already some bytes.

Robert Haas , 31 Oca 2024 Çar, 21:28 tarihinde şunu
yazdı:

> Personally, I don't think it's likely that anything will get committed
> here without someone doing more legwork than I've seen on the thread
> so far. I don't have any plan to pick up this patch anyway, but if I
> were thinking about it, I would abandon the idea unless I were
> prepared to go test a bunch of stuff myself. I agree with the core
> idea of this work, but not with the idea that the bar is as low as "if
> it can't lose relative to today, it's good enough."
>

You're right and I'm open to doing more legwork. I'd also appreciate any
suggestion about how to test this properly and/or useful scenarios to test.
That would be really helpful.

I understand that I should provide more/better analysis around this change
to prove that it doesn't hurt (hopefully) but improves some cases even
though not all the cases. That may even help us to find a better approach
than what's already proposed. Just to clarify, I don't think anyone here
suggests that the bar should be at "if it can't lose relative to today,
it's good enough". IMHO "a change that improves some cases, but regresses
nowhere" does not translate to that.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-22 Thread Melih Mutlu

Hi,

Melih Mutlu , 16 Şub 2023 Per, 14:37 tarihinde şunu
yazdı:

> I see that setting originname in the catalog before actually creating it
> causes issues. My concern with setting originname when setting the state to
> FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update
> slot/origin name but it fails somewhere before reaching FINISHEDCOPY and
> after creating slot/origin, then this new created slot/origin will be left
> behind. It wouldn't be possible to find and drop them since their names are
> not stored in the catalog. Eventually, this might also cause hitting
> the max_replication_slots limit in case of such failures between origin
> creation and FINISHEDCOPY.
>
> One fix I can think is to update the catalog right after creating a new
> origin. But this would also require commiting the current transaction to
> actually persist the originname. I guess this action of commiting the
> transaction in the middle of initial sync could hurt the copy process.
>

Here are more thoughts on this:
I still believe that updating originname when setting state
to FINISHEDCOPY is not a good idea since any failure
before FINISHEDCOPY prevent us to store originname in the catalog. If an
origin or slot is not in the catalog, it's not easily possible to find and
drop origins/slot that are left behind. And we definitely do not want to
keep unnecessary origins/slots since we would hit max_replication_slots
limit.
It's better to be safe and update origin/slot names when setting state
to DATASYNC. At this point, the worker must be sure that it writes correct
origin/slot names into the catalog.
Following part actually cleans up the catalog if a table is left behind in
DATASYNC state and its slot and origin cannot be used for sync.

ReplicationSlotDropAtPubNode(LogRepWorkerWalRcvConn, prev_slotname, true);
>
> StartTransactionCommand();
> /* Replication origin might still exist. Try to drop */
> replorigin_drop_by_name(originname, true, false);
>
> /*
> * Remove replication slot and origin name from the relation's
> * catalog record
> */
> UpdateSubscriptionRel(MyLogicalRepWorker->subid,
>  MyLogicalRepWorker->relid,
>  MyLogicalRepWorker->relstate,
>  MyLogicalRepWorker->relstate_lsn,
>  NULL,
>  NULL);
>

The patch needs to refresh the origin name before it begins copying the
table. It will try to read from the catalog but won't find any slot/origin
since they are cleaned. Then, it will move on with the correct origin name
which is the one created/will be created for the current sync worker.

I tested refetching originname and seems like it fixes the errors you
reported.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-22 Thread Melih Mutlu

Hi Shveta,

Thanks for reviewing.
Please see attached patches.

shveta malik , 2 Şub 2023 Per, 14:31 tarihinde şunu
yazdı:

> On Wed, Feb 1, 2023 at 5:37 PM Melih Mutlu  wrote:
> for (int64 i = 1; i <= lastusedid; i++)
> {
> charoriginname_to_drop[NAMEDATALEN] = {0};
> snprintf(originname_to_drop,
> sizeof(originname_to_drop), "pg_%u_%lld", subid, (long long) i);
>  ...
>   }
>
> --Is it better to use the function
> 'ReplicationOriginNameForLogicalRep' here instead of sprintf, just to
> be consistent everywhere to construct origin-name?
>

ReplicationOriginNameForLogicalRep creates a slot name with current
"lastusedid" and doesn't accept that id as parameter. Here the patch needs
to check all possible id's.

> /* Drop replication origin */
> replorigin_drop_by_name(originname, true, false);
> }
>
> --Are we passing missing_ok as true (second arg) intentionally here in
> replorigin_drop_by_name? Once we fix the issue reported  in my earlier
> email (ASSERT), do you think it makes sense to  pass missing_ok as
> false here?
>

Yes, missing_ok is intentional. The user might be concurrently refreshing
the sub or the apply worker might already drop the origin at that point.
So, missing_ok is set to true.
This is also how origin drops before the worker exits are handled on HEAD
too. I only followed the same approach.

> --Do we need to palloc for each relation separately? Shall we do it
> once outside the loop and reuse it? Also pfree is not done for rstate
> here.
>

Removed palloc from the loop. No need to pfree here since the memory
context will be deleted with the next CommitTransactionCommand call.

> Can you please review the above flow (I have given line# along with),
> I think it could be problematic. We alloced prev_slotname, assigned it
> to slotname, freed prev_slotname and used slotname after freeing the
> prev_slotname.
> Also slotname is allocated some memory too, that is not freed.
>

Right, I used memcpy instead of assigning prev_slotname to slotname.
slotname is returned in the end and pfree'd later [1]

I also addressed your other reviews that I didn't explicitly mention in
this email. I simply applied the changes you pointed out. Also added some
more logs as well. I hope it's more useful now.

[1]
https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/worker.c#L4359

Thanks,
-- 
Melih Mutlu
Microsoft

v8-0001-Add-replication-protocol-cmd-to-create-a-snapsho.patch
Description: Binary data

v11-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-22 Thread Melih Mutlu

Hi Wang,

Thanks for reviewing.
Please see updated patches. [1]

wangw.f...@fujitsu.com , 7 Şub 2023 Sal, 10:28
tarihinde şunu yazdı:

> 1. In the function ApplyWorkerMain.
> I think we need to keep the worker name as "leader apply worker" in the
> comment
> like the current HEAD.
>

Done.


> I think in this case we also need to pop the error context stack before
> returning. Otherwise, I think we might use the wrong callback
> (apply error_callback) after we return from this function.
>

Done.


> 3. About the function UpdateSubscriptionRelReplicationSlot.
> This newly introduced function UpdateSubscriptionRelReplicationSlot does
> not
> seem to be invoked. Do we need this function?


Removed.

I think if 'need_full_snapshot' is false, it means we will create a snapshot
> that can read only catalogs. (see SnapBuild->building_full_snapshot)


Fixed.

```
> 'use' will use the snapshot for the current transaction executing the
> command.
> This option must be used in a transaction, and CREATE_REPLICATION_SLOT
> must be
> the first command run in that transaction.
> ```

So I think in the function CreateDecodingContext, when "need_full_snapshot"
> is
> true, we seem to need the following check, just like in the function
> CreateInitDecodingContext:

```
> if (IsTransactionState() &&
> GetTopTransactionIdIfAny() != InvalidTransactionId)
> ereport(ERROR,
> (errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
>  errmsg("cannot create logical replication
> slot in transaction that has performed writes")));
> ```


You're right to "use" the snapshot, it must be the first command in the
transaction. And that check happens here [2]. CreateReplicationSnapshot has
also similar check.
I think the check you're referring to is needed to actually create a
replication slot and it performs whether the snapshot will be "used" or
"exported". This is not the case for CreateReplicationSnapshot.

It seems that we also need to invoke the function
> CheckLogicalDecodingRequirements in the new function
> CreateReplicationSnapshot,
> just like the function CreateReplicationSlot and the function
> StartLogicalReplication.


Added this check.

3. The invocation of startup_cb_wrapper in the function
> CreateDecodingContext.
> I think we should change the third input parameter to true when invoke
> function
> startup_cb_wrapper for CREATE_REPLICATION_SNAPSHOT. BTW, after applying
> patch
> v10-0002*, these settings will be inconsistent when sync workers use
> "CREATE_REPLICATION_SLOT" and "CREATE_REPLICATION_SNAPSHOT" to take
> snapshots.
> This input parameter (true) will let us disable streaming and two-phase
> transactions in function pgoutput_startup. See the last paragraph of the
> commit
> message for 4648243 for more details.


I'm not sure if "is_init" should be set to true. CreateDecodingContext only
creates a context for an already existing logical slot and does not
initialize new one.
I think inconsistencies between "CREATE_REPLICATION_SLOT" and
"CREATE_REPLICATION_SNAPSHOT" are expected since one creates a new
replication slot and the other does not.
CreateDecodingContext is also used in other places as well. Not sure how
this change would affect those places. I'll look into this more. Please let
me know if I'm missing something.


[1]
https://www.postgresql.org/message-id/CAGPVpCQmEE8BygXr%3DHi2N2t2kOE%3DPJwofn9TX0J9J4crjoXarQ%40mail.gmail.com
[2]
https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L1108

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-02-23 Thread Melih Mutlu

shiy.f...@fujitsu.com , 23 Şub 2023 Per, 12:29
tarihinde şunu yazdı:

> On Thu, Feb 23, 2023 12:40 PM Kuroda, Hayato/黒田 隼人 <
> kuroda.hay...@fujitsu.com> wrote:
> >
> > > > I'm not sure the combination of "copy_format = binary" and
> "copy_data =
> > false"
> > > > should be accepted or not. How do you think?
> > >
> > > It seems quite useless indeed to specify the format of a copy that
> won't
> > happen.
> >
> > I understood that the conbination of "copy_format = binary" and
> "copy_data =
> > false"
> > should be rejected in parse_subscription_options() and
> AlterSubscription(). Is it
> > right?
> > I'm expecting that is done in next version.
> >
>
> The copy_data option only takes effect once in CREATE SUBSCIPTION or ALTER
> SUBSCIPTION REFRESH PUBLICATION command, but the copy_format option can
> take
> affect multiple times if the subscription is refreshed multiple times.
> Even if
> the subscription is created with copy_date=false, copy_format can take
> affect
> when executing ALTER SUBSCIPTION REFRESH PUBLICATION. So, I am not sure we
> want
> to reject this usage.
>

I believe the places copy_data and copy_format are needed are pretty much
the same. I couldn't think of a case where copy_format is needed but
copy_data isn't. Please let me know if I'm missing something.
CREATE SUBSCRIPTION, ALTER SUBSCRIPTION SET/ADD/REFRESH PUBLICATION are all
the places where initial sync can happen. For all these commands, copy_data
needs to be given as a parameter or it will be set to the default value
which is true. Even if copy_data=false when the sub was created, REFRESH
PUBLICATION (without an explicit copy_data parameter) will copy some tables
if needed regardless of what copy_data was in CREATE SUBSCRIPTION. This is
because copy_data is not something stored in pg_subscription or another
catalog. But this is not an issue for copy_fornat since its value will be
stored in the catalog. This can allow users to set the format even if
copy_data=false and no initial sync is needed at that moment. So that
future initial syncs which can be triggered by ALTER SUBSCRIPTION will be
performed in the correct format.
So, I also think we should allow setting copy_format even if
copy_data=false.

Another way to deal with this issue could be expecting the user to specify
format every time copy_format is needed, similar to the case for copy_data,
and moving on with text (default) format if it's not specified for the
current CREATE/ALTER SUBSCRIPTION execution. But I don't think this would
make things easier.

Best,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-02-27 Thread Melih Mutlu

Hi,

Thanks for all of your reviews!

So, I made some changes in the v10 according to your comments.

1- copy_format is changed to a boolean parameter copy_binary.
It sounds easier to use a boolean to enable/disable binary copy. Default
value is false, so nothing changes in the current behaviour if copy_binary
is not specified.
It's still persisted into the catalog. This is needed since its value will
be needed by tablesync workers later. And tablesync workers fetch
subscription configurations from the catalog.
In the copy_data case, it is not directly stored anywhere but it affects
the state of the table which is stored in the catalog. So, I guess this is
the convenient way if we decide to go with a new parameter.

2- copy_binary is not tied to another parameter
The patch does not disallow any combination of copy_binary with binary and
copy_data options. I tried to explain what binary copy can and cannot do in
the docs. Rest is up to the user now.

3- Removed version check for copy_binary
HEAD already does not have any check for binary option. Making binary copy
work only if publisher and subscriber are the same major version can be too
restricted.

4- Added separate test file
Although I believe 002_types.pl and 014_binary.pl can be improved more for
binary enabled subscription cases, this patch might not be the best place
to make those changes.
032_binary_copy.pl now has the tests for binary copy option. There are also
some regression tests in subscription.sql.

Finally, some other small fixes are done according to the reviews.

Also, thanks Bharath for performance testing [1]. It can be seen that the
patch has some benefits.

I appreciate further thoughts/reviews.

[1]
https://www.postgresql.org/message-id/CALj2ACUfE08ZNjKK-nK9JiwGhwUMRLM%2BqRhNKTVM9HipFk7Fow%40mail.gmail.com

Thanks,
--
Melih Mutlu
Microsoft

v10-0001-Allow-logical-replication-to-copy-table-in-binar.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-02-28 Thread Melih Mutlu

Hi,

Attached v11.

vignesh C , 28 Şub 2023 Sal, 13:59 tarihinde şunu
yazdı:

> Thanks for the patch, Few comments:
> 1) Are primary key required for the tables, if not required we could
> remove the primary key which will speed up the test by not creating
> the indexes and inserting in the indexes. Even if required just create
> for one of the tables:
>

I think that having a primary key in tables for logical replication tests
is good for checking if log. rep. duplicates any row. Other tests also have
primary keys in almost all tables.

Bharath Rupireddy , 28 Şub 2023
Sal, 15:27 tarihinde şunu yazdı:

> 1.
> +  
> +   If true, initial data synchronization will be performed in binary
> format
> +  
> It's not just the initial table sync right? The table sync can happen
> at any other point of time when ALTER SUBSCRIPTION ... REFRESH
> PUBLICATION WITH (copy = true) is run.
> How about - "If true, the subscriber requests publication for
> pre-existing data in binary format"?
>

I changed it as you suggested.
I sometimes feel like the phrase "initial sync" is used for initial sync of
a table, not a subscription. Or table syncs triggered by ALTER SUBSCRIPTION
are ignored in some places where "initial sync" is used.

2.
> Perhaps, this should cover the recommended cases for enabling this new
> option - something like below (may not need to have exact wording, but
> the recommended cases?):
> "It is recommended to enable this option only when 1) the column data
> types have appropriate binary send/receive functions, 2) not
> replicating between different major versions or different platforms,
> 3) both publisher and subscriber tables have the exact same column
> types (not when replicating from smallint to int or numeric to int8
> and so on), 4) both publisher and subscriber supports COPY with binary
> option, otherwise the table copy can fail."
>

I added a line stating that binary format is less portable across machine
architectures and versions as stated in COPY [1].
I don't think we should add line saying "recommended", but state the
restrictions clearly instead. It's also similar in COPY docs as well.
I think the explanation now covers all your points, right?

Jelte Fennema , 28 Şub 2023 Sal, 16:25 tarihinde şunu
yazdı:

> > 3. I think the newly added tests must verify if the binary COPY is
> > picked up when enabled. Perhaps, looking at the publisher's server log
> > for 'COPY ... WITH BINARY format'? Maybe it's an overkill, otherwise,
> > we have no way of testing that the option took effect.
>
> Another way to test that BINARY is enabled could be to trigger one
> of the failure cases.


Yes, there is already a failure case for binary copy which resolves with
swithcing binary_copy to false.
But I also added checks for publisher logs now too.


[1] https://www.postgresql.org/docs/devel/sql-copy.html

Thanks,
-- 
Melih Mutlu
Microsoft


v11-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-01 Thread Melih Mutlu

Hi,

Bharath Rupireddy , 1 Mar 2023 Çar,
15:02 tarihinde şunu yazdı:

> On Wed, Mar 1, 2023 at 4:47 PM Dilip Kumar  wrote:
> > I agree with this thought, basically adding an extra option will
> > always complicate things for the user.  And logically it doesn't make
> > much sense to copy data in text mode and then stream in binary mode
> > (except in some exception cases and for that, we can always alter the
> > subscription).  So IMHO it makes more sense that if the binary option
> > is selected then ideally it should choose to do the initial sync also
> > in the binary mode.
>

I agree that copying in text then streaming in binary does not have a good
use-case.

I think I was suggesting earlier to use a separate option for binary
> table sync copy based on my initial knowledge of binary COPY. Now that
> I have a bit more understanding of binary COPY and subscription's
> existing binary option, +1 for using the same option for table sync
> too.
>
> If used the existing subscription binary option for the table sync,
> there can be following possibilities for the users:
> 1. users might want to enable the binary option for table sync and
> disable it for subsequent replication
> 2. users might want to enable the binary option for both table sync
> and for subsequent replication
> 3. users might want to disable the binary option for table sync and
> enable it for subsequent replication
> 4. users might want to disable binary option for both table sync and
> for subsequent replication
>
> Binary copy use-cases are a bit narrower compared to the existing
> subscription binary option, it works only if:
> a) the column data types have appropriate binary send/receive functions
> b) not replicating between different major versions or different platforms
> c) both publisher and subscriber tables have the exact same column
> types (not when replicating from smallint to int or numeric to int8
> and so on)
> d) both publisher and subscriber supports COPY with binary option
>
> Now if one enabled the binary option for table sync, that means, they
> must have ensured all (a), (b), (c), and (d) are met. The point is if
> one decides to use binary copy for table sync, it means that the
> subsequent binary replication works too without any problem. If
> required, one can disable it for normal replication i.e. post-table
> sync.
>

That was my intention in the beginning with this patch. Then the new option
also made some sense at some point, and I added copy_binary option
according to reviews.
The earlier versions of the patch didn't have that. Without the new option,
this patch would also be smaller.

But before changing back to the point where these are all tied to binary
option without a new option, I think we should decide if that's really the
ideal way to do it.
I believe that the patch is all good now with the binary_copy option which
is not tied to anything, explanations in the doc and separate tests etc.
But I also agree that binary=true should make everything in binary and
binary=false should do them in text format. It makes more sense.

Best,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-01 Thread Melih Mutlu

Hi,

Hayato Kuroda (Fujitsu) , 1 Mar 2023 Çar, 18:40
tarihinde şunu yazdı:

> Dear Melih,
>
> If we do not have to treat the case Shi pointed out[1] as code-level, I
> agreed to
> same option binary because it is simpler.


How is this an issue if we let the binary option do binary copy and not an
issue if we have a separate copy_binary option?
You can easily have the similar errors when you set copy_binary=true if a
type is missing binary send/receive functions.
And also, as Amit mentioned, the same issue can easily be avoided if
binary=false until the initial sync is done. It can be set to true later.


> I read the use-cases addressed by Bharath[2]
> and I cannot find advantage for case 1 and 3 expect the case that binary
> functions
> are not implemented.
>

Note that case 3 is already how it works on HEAD. Its advantages, as you
already mentioned, is when some types are missing the binary functions.
I think that's why case 3 should be allowed even if a new option is added
or not.

Previously I said that the combination of "copy_format = binary" and
> "copy_data = false"
> seemed strange[3], but this approach could solve it and other related ones
> automatically.
>

I think it is quite similar to the case where binary=true and
enabled=false. In that case, the format is set but the subscription does
not replicate anything. And this is allowed.
copy_binary=true and copy_data=false combination also sets the copy format
but does not copy anything. Even if any table will not be copied at that
moment, tables which might be added later might need to be copied (by ALTER
SUBSCRIPTION). And setting the copy format beforehand can be useful in such
cases.


> I think we should add description to doc that it is more likely happen to
> fail
> the initial copy user should enable binary format after synchronization if
> tables have original datatype.
>

I tried to explain when binary copy can cause failures in the doc. What
exactly do you think is missing?

Best,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-08 Thread Melih Mutlu

Hi,

On 7 Mar 2023 Tue at 04:10 Amit Kapila  wrote:

> As per what I could read in this thread, most people prefer to use the
> existing binary option rather than inventing a new way (option) to
> binary copy in the initial sync phase. Do you agree?

I agree.
What do you think about the version checks? I removed any kind of check
since it’s currently a different option. Should we check publisher version
before doing binary copy to ensure that the publisher node supports binary
option of COPY command?

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-13 Thread Melih Mutlu

Hi,

Attached v12 with a unified option.

Setting binary = true now allows the initial sync to happen in binary
format.

Thanks,
-- 
Melih Mutlu
Microsoft


v12-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-14 Thread Melih Mutlu

Hi,

Attached v13.

Peter Smith , 14 Mar 2023 Sal, 03:07 tarihinde şunu
yazdı:

> Here are some review comments for patch v12-0001
>

Thanks for reviewing. I tried to make explanations in docs better
according to your comments.
What do you think?

 Amit Kapila , 14 Mar 2023 Sal, 06:17 tarihinde
şunu yazdı:

> I think it would better to write the tests for this feature in the
> existing test file 014_binary as that would save some time for node
> setup/shutdown and also that would be a more appropriate place for
> these tests.


I removed 032_binary_copy.pl and added those tests into 014_binary.pl.
Also added the case with different column order as Peter suggested.

Best,
-- 
Melih Mutlu
Microsoft


v13-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-15 Thread Melih Mutlu

Hi,

Please see the attached patch.

Takamichi Osumi (Fujitsu) , 14 Mar 2023 Sal,
18:20 tarihinde şunu yazdı:

> (1) create_subscription.sgml
>
> +  column types as opposed to text format. Even when this option
> is enabled,
> +  only data types having binary send and receive functions will be
> +  transferred in binary. Note that the initial synchronization
> requires
>
> (1-1)
>
> I think it's helpful to add a reference for the description about send and
> receive functions (e.g. to the page of CREATE TYPE).
>

Done.


>
> (1-2)
>
> Also, it would be better to have a cross reference from there to this doc
> as one paragraph probably in "Notes". I suggested this, because send and
> receive functions are described as "optional" there and missing them leads
> to error in the context of binary table synchronization.
>

I'm not sure whether this is necessary. In case of missing send/receive
functions, error logs are already clear about what's wrong and logical
replication docs also explain what could go wrong with binary.


> (3) copy_table()
>
> +   /*
> +* If the publisher version is earlier than v14, it COPY command
> doesn't
> +* support the binary option.
> +*/
>
> This sentence doesn't look correct grammatically. We can replace "it COPY
> command" with "subscription" for example. Kindly please fix it.
>

Changed this with Amit's suggestion [1].


[1]
https://www.postgresql.org/message-id/CAA4eK1%2BC7ykvdBxh_t1BdbX5Da1bM1BgsE%3D-i2koPkd3pSid0A%40mail.gmail.com

Thanks,
-- 
Melih Mutlu
Microsoft


v14-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-15 Thread Melih Mutlu

Hi,

Please see v14 [1].

Peter Smith , 15 Mar 2023 Çar, 09:22 tarihinde şunu
yazdı:

> Here are some review comments for v13-0001
>
> ==
> doc/src/sgml/logical-replication.sgml
>
> 1.
> That's why in the previous v12 review [1] (comment #3) I suggested
> that this page should just say something quite generic like "However,
> replication in binary format is more restrictive", and link back to
> the other page which has all the gory details.
>

You're right. Changed it with what you suggested.


> 2.
> My previous v12 review [1] (comment #6) suggested maybe updating this
> page. But it was not done in v13. Did you accidentally miss the review
> comment, or chose not to do it?
>

Sorry, I missed this. Added a line leading to CREATE SUBSCRIPTION doc.


> ==
> doc/src/sgml/ref/create_subscription.sgml
>
> 3.
> BEFORE
> Binary format is also very data type specific, it will not allow
> copying between different column types as opposed to text format.
>
> SUGGESTION (worded more similar to what is already on the COPY page [2])
> Binary format is very data type specific; for example, it will not
> allow copying from a smallint column to an integer column, even though
> that would work fine in text format.
>

Done.


> 4.
> SUGGESTION
> If the publisher is a PostgreSQL version
> before 14, then any initial table synchronization will use text format
> even if binary = true.
>

Done.


> SUGGESTION
> If the publisher version is earlier than v14, we use text format COPY.
> Note - In fact COPY syntax "WITH (FORMAT binary)" has existed since
> v9, but since the logical replication binary mode transfer was not
> introduced until v14 it was decided to check using the later version.
>

Changed it as suggested here [2].

[1]
https://www.postgresql.org/message-id/CAGPVpCTaXYctCUp3z%3D_BstonHiZcC5Jj7584i7B8jeZQq4RJkw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAA4eK1%2BC7ykvdBxh_t1BdbX5Da1bM1BgsE%3D-i2koPkd3pSid0A%40mail.gmail.com


Thanks,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-15 Thread Melih Mutlu

Amit Kapila , 15 Mar 2023 Çar, 12:31 tarihinde
şunu yazdı:

> On Tue, Mar 14, 2023 at 4:32 PM Melih Mutlu 
> wrote:
>


> What purpose does this test serve w.r.t this patch? Before checking
> the sync for different column orders, the patch has already changed
> binary to false, so it doesn't seem to test the functionality of this
> patch. Am, I missing something?
>

I missed that binary has changed to false before testing column orders. I
moved that test case up before changing binary to false.
Please see v14 [1].

[1]
https://www.postgresql.org/message-id/CAGPVpCTaXYctCUp3z%3D_BstonHiZcC5Jj7584i7B8jeZQq4RJkw%40mail.gmail.com

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-15 Thread Melih Mutlu

Hi,

vignesh C , 15 Mar 2023 Çar, 18:12 tarihinde şunu
yazdı:

> One comment:
> 1) There might be a chance the result order of select may vary as
> "ORDER BY" is not specified,  Should we add "ORDER BY" as the table
> has multiple rows:
> +# Check the synced data on the subscriber
> +$result = $node_subscriber->safe_psql('postgres', 'SELECT a,b FROM
> +public.test_col_order;');
> +
> +is( $result, '1|2
> +3|4', 'check synced data on subscriber for different column order');
>

Right, it needs to be ordered. Fixed.

Thanks,
-- 
Melih Mutlu
Microsoft


v15-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-16 Thread Melih Mutlu

Hi,

Please see the attached v16.

Peter Smith , 16 Mar 2023 Per, 03:03 tarihinde şunu
yazdı:

> Here are some review comments for v15-0001
>

I applied your comments in the updated patch.

shiy.f...@fujitsu.com , 16 Mar 2023 Per, 05:35
tarihinde şunu yazdı:

> On Thu, Mar 16, 2023 2:26 AM Melih Mutlu  wrote:
> >
> > Right, it needs to be ordered. Fixed.
> >
>
> Hi,
>
> Thanks for updating the patch. I tested some cases like toast data,
> combination
> of row filter and column lists, and it works well.
>

Thanks for testing. I changed wait_for_log lines as you suggested.

houzj.f...@fujitsu.com , 16 Mar 2023 Per, 05:55
tarihinde şunu yazdı:

> 1.
> +   {
> +   appendStringInfo(&cmd, " WITH (FORMAT binary)");
>
> We could use appendStringInfoString here.
>

Done.


> 2.
> I think it would be better to pass the log offset when using wait_for_log,
> because otherwise it will check the whole log file to find the target
> message,
> This might not be a big problem, but it has a risk of getting unexpected
> log message
> which was generated by previous commands.
>

You're right. I added offsets for wait_for_log's .

Thanks,
-- 
Melih Mutlu
Microsoft


v16-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-16 Thread Melih Mutlu

Amit Kapila , 16 Mar 2023 Per, 06:25 tarihinde
şunu yazdı:

> On Thu, Mar 16, 2023 at 8:27 AM Euler Taveira  wrote:
> >
> > On Wed, Mar 8, 2023, at 11:50 PM, Amit Kapila wrote:
> >
> > It is not clear to me which version check you wanted to add because we
> > seem to have a binary option in COPY from the time prior to logical
> > replication. I feel we need a publisher version 14 check as that is
> > where we start to support binary mode transfer in logical replication.
> > See the check in function libpqrcv_startstreaming().
> >
> > ... then you are breaking existent cases. Even if you have a convincing
> > argument, you are introducing a behavior change in prior versions (commit
> > messages should always indicate that you are breaking backward
> compatibility).
> >
> > +
> > +   /*
> > +* The binary option for replication is supported since v14
> > +*/
> > +   if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 14 &&
> > +   MySubscription->binary)
> > +   {
> > +   appendStringInfo(&cmd, " WITH (FORMAT binary)");
> > +   options = lappend(options, makeDefElem("format", (Node *)
> makeString("binary"), -1));
> > +   }
> > +
> >
> > What are the arguments to support since v14 instead of the to-be-released
> > version? I read the thread but it is not clear. It was said about the
> > restrictive nature of this feature and it will be frustrating to see
> that the
> > same setup (with the same commands) works with v14 and v15 but it
> doesn't with
> > v16.
> >
>
> If the failure has to happen it will anyway happen later when the
> publisher will be upgraded to v16. The reason for the version checks
> as v14 was to allow the initial sync from the same version where the
> binary mode for replication was introduced. However, if we expect
> failures in the existing setup, I am fine with supporting this for >=
> v16.
>

Upgrading the subscriber to v16 and keeping the subscriber in v14 could
break existing subscriptions. I don't know how likely such a case is.

I don't have a strong preference on this. What do you think? Should we
change it >=v16 or keep it as it is?

Best,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-17 Thread Melih Mutlu

Hi,

Sharing v17.

Amit Kapila , 17 Mar 2023 Cum, 03:02 tarihinde
şunu yazdı:

> I think to reduce the risk of breakage, let's change the check to
> >=v16. Also, accordingly, update the doc and commit message.
>

Done.

Peter Smith , 17 Mar 2023 Cum, 04:58 tarihinde şunu
yazdı:

> IMO the sentence "However, logical replication in binary format is
> more restrictive." should just be plain text.
>

Done.

 shiy.f...@fujitsu.com , 17 Mar 2023 Cum, 05:26
tarihinde şunu yazdı:

> It looks that you forgot to pass `offset` into wait_for_log().


Yes, I somehow didn't include those lines into the patch. Thanks for
noticing. Fixed them now.


Thanks,
-- 
Melih Mutlu
Microsoft


v17-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-20 Thread Melih Mutlu

Hi,

Please see the attached patch.

vignesh C , 18 Mar 2023 Cmt, 12:03 tarihinde şunu
yazdı:

> On Fri, 17 Mar 2023 at 17:55, Melih Mutlu  wrote:
> 1) Currently we refer the link to the beginning of create subscription
> page, this can be changed to refer to binary option contents in create
> subscription:
>

Done.


> 2) Running pgperltidy shows the test script 014_binary.pl could be
> slightly improved as in the attachment.
>

Couldn't apply the patch you attached but I ran pgperltidy myself and I
guess the result should be similar.


Peter Smith , 18 Mar 2023 Cmt, 12:41 tarihinde şunu
yazdı:

> Commit message
>
> 1.
> Binary copy is supported for v16 or later.
>

Done as you suggested.

Amit Kapila , 18 Mar 2023 Cmt, 13:11 tarihinde
şunu yazdı:

> On Sat, Mar 18, 2023 at 3:11 PM Peter Smith  wrote:
> >
> > SUGGESTION
> > If the publisher is v16 or later, then any initial table
> > synchronization will use the same format as specified by the
> > subscription binary mode. If the publisher is before v16, then any
> > initial table synchronization will use text format regardless of the
> > subscription binary mode.
> >
>
> I agree that the previous comment should be updated but I would prefer
> something along the lines: "Prior to v16, initial table
> synchronization will use text format even if the binary option is
> enabled for a subscription."
>

Done.

Amit Kapila , 20 Mar 2023 Pzt, 05:13 tarihinde
şunu yazdı:

> On Mon, Mar 20, 2023 at 3:37 AM Peter Smith  wrote:
> >
> > There are a couple of TAP tests where the copy binary is expected to
> > fail. And when it fails, you do binary=false (changing the format back
> > to 'text') so the test is then expected to be able to proceed.
> >
> > I don't know if this happens in practice, but IIUC in theory, if the
> > timing is extremely bad, the tablesync could relaunch in binary mode
> > multiple times (any fail multiple times?) before your binary=false
> > change takes effect.
> >
> > So, I was wondering if it could help to use the subscription
> > 'disable_on_error=true' parameter for those cases so that the
> > tablesync won't needlessly attempt to relaunch until you have set
> > binary=false and then re-enabled the subscription.
> >
>
> +1. That would make tests more reliable.
>

Done.

Hayato Kuroda (Fujitsu) , 20 Mar 2023 Pzt, 07:13
tarihinde şunu yazdı:

> Dear Melih,
>
> Thank you for updating the patch.
> I checked your added description about initial data sync and I think it's
> OK.
>
> Few minor comments:
>

Fixed both comments.

Thanks,
-- 
Melih Mutlu
Microsoft


v18-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-03-21 Thread Melih Mutlu

Amit Kapila , 21 Mar 2023 Sal, 09:03 tarihinde
şunu yazdı:

> On Tue, Mar 21, 2023 at 7:03 AM Peter Smith  wrote:
> >
> >
> > ==
> > src/test/subscription/t/014_binary.pl
> >
> > 4.
> > # -
> > # Test mismatched column types with/without binary mode
> > # -
> >
> > # Test syncing tables with mismatching column types
> > $node_publisher->safe_psql(
> > 'postgres', qq(
> > CREATE TABLE public.test_mismatching_types (
> > a bigint PRIMARY KEY
> > );
> > INSERT INTO public.test_mismatching_types (a)
> > VALUES (1), (2);
> > ));
> >
> > # Check the subscriber log from now on.
> > $offset = -s $node_subscriber->logfile;
> >
> > # Ensure the subscription is enabled. disable_on_error is still true,
> > # so the subscription can be disabled due to missing realtion until
> > # the test_mismatching_types table is created.
> > $node_subscriber->safe_psql(
> > 'postgres', qq(
> > CREATE TABLE public.test_mismatching_types (
> > a int PRIMARY KEY
> > );
> > ALTER SUBSCRIPTION tsub ENABLE;
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > ));
> >
> > ~~
> >
> > I found the "Ensure the subscription is enabled..." comment and the
> > necessity for enabling the subscription to be confusing.
> >
> > Can't some complications all be eliminated just by creating the table
> > on the subscribe side first?
> >
>
> Hmm, that would make this test inconsistent with other tests and
> probably difficult to understand and extend. I don't like to say this
> but I think introducing disable_on_error has introduced more
> complexities in the patch due to the requirement of enabling
> subscription again and again. I feel it would be better without using
> disable_on_error option in these tests.
>

While this change would make the test inconsistent, I think it also would
make more confusing.
Explaining the issue explicitly with a comment seems better to me than the
trick of changing order of table creation just for some test cases.
But I'm also ok with removing the use of disable_on_error if that's what
you agree on.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-03-21 Thread Melih Mutlu

Hi,

Peter Smith , 21 Mar 2023 Sal, 04:33 tarihinde şunu
yazdı:

> Here are my review comments for v18-0001
>
> ==
> doc/src/sgml/logical-replication.sgml
>
> 1.
> +   target table. However, logical replication in binary format is more
> +   restrictive. See the binary option of
> +   CREATE
> SUBSCRIPTION
> +   for details.
>
>
> Because you've changed the linkend to be the binary option, IMO now
> the  part also needs to be modified. Otherwise, this page has
> multiple "CREATE SUBSCRIPTION" links which jump to different places,
> which just seems wrong to me.
>

Makes sense. I changed it as you suggested.


> 3.
> I think there can only be 0 or 1 list element in 'options'.
>
> So, why does the code here use lappend(options,...) instead of just
> using list_make1(...)?
>

Changed it to list_make1.

Amit Kapila , 21 Mar 2023 Sal, 12:27 tarihinde
şunu yazdı:

> > Explaining the issue explicitly with a comment seems better to me than
> the trick of changing order of table creation just for some test cases.
> > But I'm also ok with removing the use of disable_on_error if that's what
> you agree on.
> >
>
> Let's do that way for now.
>

Done.

Thanks,
-- 
Melih Mutlu
Microsoft
From 8d94a5f18a433efaeae907b5dc0225272b7442aa Mon Sep 17 00:00:00 2001
From: Melih Mutlu 
Date: Mon, 8 Aug 2022 14:14:44 +0300
Subject: [PATCH v19] Allow logical replication to copy table in binary

If binary option is enabled in a subscription, then copy tables in binary format
during table synchronization.

Without this patch, tables are copied in text format even if the
subscription is created with binary option enabled. This patch allows logical replication
to perform in binary format starting from initial sync. Copying
tables in binary format may reduce the time spent depending on column
types.

Binary copy for initial table synchronization is supported
only when both publisher and subscriber are v16 or later.

Discussion: https://postgr.es/m/CAGPVpCQvAziCLknEnygY0v1-KBtg%2BOm-9JHJYZOnNPKFJPompw%40mail.gmail.com
---
 doc/src/sgml/logical-replication.sgml   |   4 +-
 doc/src/sgml/ref/alter_subscription.sgml|   5 +
 doc/src/sgml/ref/create_subscription.sgml   |  28 ++-
 src/backend/replication/logical/tablesync.c |  17 +-
 src/test/subscription/t/014_binary.pl   | 180 ++--
 5 files changed, 216 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/logical-replication.sgml b/doc/src/sgml/logical-replication.sgml
index 6b0e300adc..3836d13ad3 100644
--- a/doc/src/sgml/logical-replication.sgml
+++ b/doc/src/sgml/logical-replication.sgml
@@ -251,7 +251,9 @@
column of type bigint.  The target table can also have
additional columns not provided by the published table.  Any such columns
will be filled with the default value as specified in the definition of the
-   target table.
+   target table. However, logical replication in binary format is more
+   restrictive. See the binary
+   option of CREATE SUBSCRIPTION for details.
   
 
   
diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml
index 964fcbb8ff..e92346edef 100644
--- a/doc/src/sgml/ref/alter_subscription.sgml
+++ b/doc/src/sgml/ref/alter_subscription.sgml
@@ -177,6 +177,11 @@ ALTER SUBSCRIPTION name RENAME TO <
   how copy_data = true can interact with the
   origin parameter.
  
+ 
+  See the binary
+  option of CREATE SUBSCRIPTION for details
+  about copying pre-existing data in binary format.
+ 
 

   
diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml
index 51c45f17c7..9d4b9d4e33 100644
--- a/doc/src/sgml/ref/create_subscription.sgml
+++ b/doc/src/sgml/ref/create_subscription.sgml
@@ -185,15 +185,25 @@ CREATE SUBSCRIPTION subscription_name
 
-   
+   
 binary (boolean)
 
  
-  Specifies whether the subscription will request the publisher to
-  send the data in binary format (as opposed to text).
-  The default is false.
-  Even when this option is enabled, only data types having
-  binary send and receive functions will be transferred in binary.
+  Specifies whether the subscription will request the publisher to send
+  the data in binary format (as opposed to text). The default is
+  false. Any initial table synchronization copy
+  (see copy_data) also uses the same format. Binary
+  format can be faster than the text format, but it is less portable
+  across machine architectures and PostgreSQL
+  versions. Binary format is very data type specific; for example, it
+  will not allow copying from a smallint column to an
+  integer column, even though that woul

Re: Allow logical replication to copy tables in binary format

2023-03-23 Thread Melih Mutlu

Hi,

Amit Kapila , 23 Mar 2023 Per, 08:48 tarihinde
şunu yazdı:

> Pushed the 0001. It may be better to start a separate thread for 0002.
>

Great! Thanks.

Best,
-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-03-14 Thread Melih Mutlu

Hi hackers,

I did some experiments with this patch, after previous discussions. This
probably does not answer all questions, but would be happy to do more if
needed.

First, I updated the patch according to what suggested here [1]. PSA  v2.
I tweaked the master branch a bit to not allow any buffering. I compared
HEAD, this patch and no buffering at all.
I also added a simple GUC to control PqSendBufferSize, this change only
allows to modify the buffer size and should not have any impact on
performance.

I again ran the COPY TO STDOUT command and timed it. AFAIU COPY sends data
row by row, and I tried running the command under different scenarios with
different # of rows and row sizes. You can find the test script attached
(see test.sh).
All timings are in ms.

1- row size = 100 bytes, # of rows = 100
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 1036   │ 998  │ 940  │ 910  │ 894  │ 874  │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 1107   │ 1032 │ 980  │ 957  │ 917  │ 909  │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 6230   │ 6125 │ 6282 │ 6279 │ 6255 │ 6221 │
└───┴┴──┴──┴──┴──┴──┘

2-  row size = half of the rows are 1KB and rest is 10KB , # of rows =
100
┌───┬┬───┬───┬───┬───┬───┐
│   │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───┼┼───┼───┼───┼───┼───┤
│ HEAD  │ 25197  │ 23414 │ 20612 │ 19206 │ 18334 │ 18033 │
├───┼┼───┼───┼───┼───┼───┤
│ patch │ 19843  │ 19889 │ 19798 │ 19129 │ 18578 │ 18260 │
├───┼┼───┼───┼───┼───┼───┤
│ no buffer │ 23752  │ 23565 │ 23602 │ 23622 │ 23541 │ 23599 │
└───┴┴───┴───┴───┴───┴───┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 3137   │ 2937 │ 2687 │ 2551 │ 2456 │ 2465 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 2399   │ 2390 │ 2402 │ 2415 │ 2417 │ 2422 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 2417   │ 2414 │ 2429 │ 2418 │ 2435 │ 2404 │
└───┴┴──┴──┴──┴──┴──┘

3-  row size = all rows are 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 6113   │ 5764 │ 5281 │ 5009 │ 4885 │ 4872 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 4759   │ 4754 │ 4754 │ 4758 │ 4782 │ 4805 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 4756   │ 4774 │ 4793 │ 4766 │ 4770 │ 4774 │
└───┴┴──┴──┴──┴──┴──┘

Some quick observations:
1- Even though I expect both the patch and HEAD behave similarly in case of
small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.
2- In cases where the data does not fit into the buffer, the patch starts
performing better than HEAD. For example, in case 2, patch seems faster
until the buffer size exceeds the data length. When the buffer size is set
to something larger than 10KB (16KB/32KB in this case), there is again a
slight performance loss with the patch as in case 1.
3- With large row sizes (i.e. sizes that do not fit into the buffer) not
buffering at all starts performing better than HEAD. Similarly the patch
performs better too as it stops buffering if data does not fit into the
buffer.



[1]
https://www.postgresql.org/message-id/CAGECzQTYUhnC1bO%3DzNiSpUgCs%3DhCYxVHvLD2doXNx3My6ZAC2w%40mail.gmail.com


Thanks,
-- 
Melih Mutlu
Microsoft
From 7f1b72a0948156f8e35ce3b07b5e763a5578d641 Mon Sep 17 00:00:00 2001
From: Melih Mutlu 
Date: Mon, 26 Feb 2024 14:41:35 +0300
Subject: [PATCH] Added pq_send_buffer_size GUC

---
 src/backend/libpq/pqcomm.c  |  2 +-
 src/backend/utils/misc/guc_tables.c | 11 +++
 src/include/libpq/libpq.h   |  1 +
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index c606bf3447..92708e46e6 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -116,7 +116,7 @@ static List *sock_paths = NIL;
  * enlarged by pq_putmessage_noblock() if the message doesn't fit otherwise.
  */
 
-#define PQ_SEND_BUFFER_SIZE 8192
+int PQ_SEND_BUFFER_SIZE = 8192;
 #define PQ_RECV_BUFFER_SIZE 8192
 
 s

Re: Flushing large data immediately in pqcomm

2024-03-20 Thread Melih Mutlu

David Rowley , 21 Mar 2024 Per, 00:54 tarihinde şunu
yazdı:

> On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio 
> wrote:
> >
> > On Thu, 14 Mar 2024 at 13:12, Robert Haas  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu 
> wrote:
> > > > 1- Even though I expect both the patch and HEAD behave similarly in
> case of small data (case 1: 100 bytes), the patch runs slightly slower than
> HEAD.
> > >
> > > I wonder why this happens. It seems like maybe something that could be
> fixed.
> >
> > some wild guesses:
> > 1. maybe it's the extra call overhead of the new internal_flush
> > implementation. What happens if you make that an inline function?
> > 2. maybe swap these conditions around (the call seems heavier than a
> > simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize
>
> I agree these are both worth trying.  For #2, I wonder if the
> pq_is_send_pending() call is even worth checking at all. It seems to
> me that the internal_flush_buffer() code will just do nothing if
> nothing is pending.  Also, isn't there almost always going to be
> something pending when the "len >= PqSendBufferSize" condition is met?
>  We've just added the msgtype and number of bytes to the buffer which
> is 5 bytes. If the previous message was also more than
> PqSendBufferSize, then the buffer is likely to have 5 bytes due to the
> previous flush, otherwise isn't it a 1 in 8192 chance that the buffer
> is empty?
>
> If that fails to resolve the regression, maybe it's worth memcpy()ing
> enough bytes out of the message to fill the buffer then flush it and
> check if we still have > PqSendBufferSize remaining and skip the
> memcpy() for the rest.  That way there are no small flushes of just 5
> bytes and only ever the possibility of reducing the flushes as no
> pattern should cause the number of flushes to increase.
>

In len > PqSendBufferSize cases, the buffer should be filled as much as
possible if we're sure that it will be flushed at some point. Otherwise we
might end up with small flushes. The cases where we're sure that the buffer
will be flushed is when the buffer is not empty. If it's empty, there is no
need to fill it unnecessarily as it might cause an additional flush. AFAIU
from what you said, we shouldn't be worried about such a case since it's
unlikely to have the buffer empty due to the first 5 bytes. I guess the
only case where the buffer can be empty is when the buffer has
PqSendBufferSize-5
bytes from previous messages and adding 5 bytes of the current message will
flush the buffer. I'm not sure if removing the check may cause any
regression in any case, but it's just there to be safe.

What if I do a simple comparison like PqSendStart == PqSendPointer instead
of calling pq_is_send_pending() as Heikki suggested, then this check should
not hurt that much. Right? Does that make sense?

-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-03-21 Thread Melih Mutlu

Heikki Linnakangas , 14 Mar 2024 Per, 15:46 tarihinde şunu
yazdı:

> On 14/03/2024 13:22, Melih Mutlu wrote:
> > @@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
> >   if (internal_flush())
> >   return EOF;
> >   }
> > - amount = PqSendBufferSize - PqSendPointer;
> > - if (amount > len)
> > - amount = len;
> > - memcpy(PqSendBuffer + PqSendPointer, s, amount);
> > - PqSendPointer += amount;
> > - s += amount;
> > - len -= amount;
> > +
> > + /*
> > +  * If the buffer is empty and data length is larger than
> the buffer
> > +  * size, send it without buffering. Otherwise, put as much
> data as
> > +  * possible into the buffer.
> > +  */
> > + if (!pq_is_send_pending() && len >= PqSendBufferSize)
> > + {
> > + int start = 0;
> > +
> > + socket_set_nonblocking(false);
> > + if (internal_flush_buffer(s, &start, (int *)&len))
> > + return EOF;
> > + }
> > + else
> > + {
> > + amount = PqSendBufferSize - PqSendPointer;
> > + if (amount > len)
> > + amount = len;
> > + memcpy(PqSendBuffer + PqSendPointer, s, amount);
> > + PqSendPointer += amount;
> > + s += amount;
> > + len -= amount;
> > + }
> >   }
> > +
> >   return 0;
> >  }
>
> Two small bugs:
>
> - the "(int *) &len)" cast is not ok, and will break visibly on
> big-endian systems where sizeof(int) != sizeof(size_t).
>
> - If internal_flush_buffer() cannot write all the data in one call, it
> updates 'start' for how much it wrote, and leaves 'end' unchanged. You
> throw the updated 'start' value away, and will send the same data again
> on next iteration.
>

There are two possible options for internal_flush_buffer() in
internal_putbytes() case:
1- Write all the data and return 0. We don't need start or end of the data
in this case.
2- Cannot write all and return EOF. In this case internal_putbytes() also
returns EOF immediately and does not really retry. There will be no next
iteration.

If it was non-blocking, then we may need to keep the new value. But I think
we do not need the updated start value in both cases here. What do you
think?

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-03-21 Thread Melih Mutlu

Hi,

PSA v3.

Jelte Fennema-Nio , 21 Mar 2024 Per, 12:58 tarihinde
şunu yazdı:

> On Thu, 21 Mar 2024 at 01:24, Melih Mutlu  wrote:
> > What if I do a simple comparison like PqSendStart == PqSendPointer
> instead of calling pq_is_send_pending()
>
> Yeah, that sounds worth trying out. So the new suggestions to fix the
> perf issues on small message sizes would be:
>
> 1. add "inline" to internal_flush function
> 2. replace pq_is_send_pending() with PqSendStart == PqSendPointer
> 3. (optional) swap the order of PqSendStart == PqSendPointer and len
> >= PqSendBufferSize
>

I did all of the above changes and it seems like those resolved the
regression issue.
Since the previous results were with unix sockets, I share here the results
of v3 when using unix sockets for comparison.
Sharing only the case where all messages are 100 bytes, since this was when
the regression was most visible.

row size = 100 bytes, # of rows = 100
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 1106   │ 1006 │ 947  │ 920  │ 899  │ 888  │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 1094   │ 997  │ 943  │ 913  │ 894  │ 881  │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 6389   │ 6195 │ 6214 │ 6271 │ 6325 │ 6211 │
└───┴┴──┴──┴──┴──┴──┘

David Rowley , 21 Mar 2024 Per, 00:57 tarihinde şunu
yazdı:

> On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas  wrote:
> > - the "(int *) &len)" cast is not ok, and will break visibly on
> > big-endian systems where sizeof(int) != sizeof(size_t).
>
> I think fixing this requires adjusting the signature of
> internal_flush_buffer() to use size_t instead of int.   That also
> means that PqSendStart and PqSendPointer must also become size_t, or
> internal_flush() must add local size_t variables to pass to
> internal_flush_buffer and assign these back again to the global after
> the call.  Upgrading the globals might be the cleaner option.
>
> David


This is done too.

I actually tried to test it over a real network for a while. However, I
couldn't get reliable-enough numbers with both HEAD and the patch due to
network related issues.
I've decided to go with Jelte's suggestion [1]  which is decreasing MTU of
the loopback interface to 1500 and using localhost.

Here are the results:

1- row size = 100 bytes, # of rows = 100
┌───┬┬───┬───┬───┬───┬───┐
│   │ 1400 bytes │ 2KB   │  4KB  │  8KB  │  16KB │  32KB │
├───┼┼───┼───┼───┼───┼───┤
│ HEAD  │ 1351   │ 1233  │ 1074  │  988  │  944  │  916  │
├───┼┼───┼───┼───┼───┼───┤
│ patch │ 1369   │ 1232  │ 1073  │  981  │  928  │  907  │
├───┼┼───┼───┼───┼───┼───┤
│ no buffer │ 14949  │ 14533 │ 14791 │ 14864 │ 14612 │ 14751 │
└───┴┴───┴───┴───┴───┴───┘

2-  row size = half of the rows are 1KB and rest is 10KB , # of rows =
100
┌───┬┬───┬───┬───┬───┬───┐
│   │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───┼┼───┼───┼───┼───┼───┤
│ HEAD  │ 37212  │ 31372 │ 25520 │ 21980 │ 20311 │ 18864 │
├───┼┼───┼───┼───┼───┼───┤
│ patch │ 23006  │ 23127 │ 23147 │ 9 │ 20367 │ 19155 │
├───┼┼───┼───┼───┼───┼───┤
│ no buffer │ 30725  │ 31090 │ 30917 │ 30796 │ 30984 │ 30813 │
└───┴┴───┴───┴───┴───┴───┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 4296   │ 3713 │ 3040 │ 2711 │ 2528 │ 2449 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 2401   │ 2411 │ 2404 │ 2374 │ 2395 │ 2408 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 2399   │ 2403 │ 2408 │ 2389 │ 2402 │ 2403 │
└───┴┴──┴──┴──┴──┴──┘

4-  row size = all rows are 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 8335   │ 7370 │ 6017 │ 5368 │ 5009 │ 4843 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 4711   │ 4722 │ 4708 │ 4693 │ 4724

Re: Flushing large data immediately in pqcomm

2024-03-28 Thread Melih Mutlu

On Wed, Mar 27, 2024 at 14:39 David Rowley  wrote:

> On Fri, 22 Mar 2024 at 12:46, Melih Mutlu  wrote:
> > I did all of the above changes and it seems like those resolved the
> regression issue.
>
> Thanks for adjusting the patch.   The numbers do look better, but on
> looking at your test.sh script from [1], I see:
>
> meson setup --buildtype debug -Dcassert=true
> --prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \
>
> can you confirm if the test was done in debug with casserts on?   If
> so, it would be much better to have asserts off and have
> -Dbuildtype=release.


Yes, previous numbers were with --buildtype debug -Dcassert=true. I can
share new numbers with release build and asserts off soon.

Thanks,
Melih

Re: Flushing large data immediately in pqcomm

2024-03-28 Thread Melih Mutlu

On Wed, Mar 27, 2024 at 18:54 Robert Haas  wrote:

> On Wed, Mar 27, 2024 at 7:39 AM David Rowley  wrote:
> > Robert, I understand you'd like a bit more from this patch. I'm
> > wondering if you planning on blocking another committer from going
> > ahead with this? Or if you have a reason why the current state of the
> > patch is not a meaningful enough improvement that would justify
> > possibly not getting any improvements in this area for PG17?
>
> So, I think that the first version of the patch, when it got a big
> chunk of data, would just flush whatever was already in the buffer and
> then send the rest without copying.


Correct.

The current version, as I
> understand it, only does that if the buffer is empty; otherwise, it
> copies data as much data as it can into the partially-filled buffer.


Yes, currently it should fill and flush the buffer first, if it’s not
already empty. Only then it sends the rest without copying.

Thanks,
Melih

Re: Separate memory contexts for relcache and catcache

2024-04-03 Thread Melih Mutlu

vignesh C , 27 Oca 2024 Cmt, 06:01 tarihinde şunu
yazdı:

> On Wed, 3 Jan 2024 at 16:56, Melih Mutlu  wrote:
> CFBot shows that the patch does not apply anymore as in [1]:
> === Applying patches on top of PostgreSQL commit ID
> 729439607ad210dbb446e31754e8627d7e3f7dda ===
> === applying patch
> ./v2-0001-Separate-memory-contexts-for-relcache-and-catcach.patch
> patching file src/backend/utils/cache/catcache.c
> ...
> Hunk #8 FAILED at 1933.
> Hunk #9 succeeded at 2253 (offset 84 lines).
> 1 out of 9 hunks FAILED -- saving rejects to file
> src/backend/utils/cache/catcache.c.rej
>
> Please post an updated version for the same.
>
> [1] - http://cfbot.cputube.org/patch_46_4554.log
>
> Regards,
> Vignesh
>

Rebased. PSA.


-- 
Melih Mutlu
Microsoft


v3-0001-Separate-memory-contexts-for-relcache-and-catcach.patch
Description: Binary data

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-04-03 Thread Melih Mutlu

Hi,

Michael Paquier , 14 Şub 2024 Çar, 10:23 tarihinde
şunu yazdı:

> On Fri, Jan 19, 2024 at 05:41:45PM +0900, torikoshia wrote:
> > We already have additional description below the table which explains
> each
> > column of the system view. For example pg_locks:
> > https://www.postgresql.org/docs/devel/view-pg-locks.html
>
> I was reading the patch, and using int[] as a representation of the
> path of context IDs up to the top-most parent looks a bit strange to
> me, with the relationship between each parent -> child being
> preserved, visibly, based on the order of the elements in this array
> made of temporary IDs compiled on-the-fly during the function
> execution.  Am I the only one finding that a bit strange?  Could it be
> better to use a different data type for this path and perhaps switch
> to the names of the contexts involved?
>

Do you find having the path column strange all together? Or only using
temporary IDs to generate that column? The reason why I avoid using context
names is because there can be multiple contexts with the same name. This
makes it difficult to figure out which context, among those with that
particular name, is actually included in the path. I couldn't find any
other information that is unique to each context.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

2024-04-04 Thread Melih Mutlu

Hi,

Melih Mutlu , 28 Mar 2024 Per, 22:44 tarihinde şunu
yazdı:
>
> On Wed, Mar 27, 2024 at 14:39 David Rowley  wrote:
>>
>> On Fri, 22 Mar 2024 at 12:46, Melih Mutlu  wrote:
>> can you confirm if the test was done in debug with casserts on?   If
>> so, it would be much better to have asserts off and have
>> -Dbuildtype=release.
>
>
> Yes, previous numbers were with --buildtype debug -Dcassert=true. I can
share new numbers with release build and asserts off soon.

While testing the patch without --buildtype debug -Dcassert=true, I felt
like there was still a slight regression. I changed internal_flush() to an
inline function, results look better this way.


1- row size = 100 bytes, # of rows = 100
┌───┬┬───┬───┬───┬───┬───┐
│   │ 1400 bytes │ 2KB   │  4KB  │  8KB  │  16KB │  32KB │
├───┼┼───┼───┼───┼───┼───┤
│ HEAD  │  861   │ 765   │ 612   │  521  │  477  │  480  │
├───┼┼───┼───┼───┼───┼───┤
│ patch │  869   │ 766   │ 612   │  519  │  482  │  467  │
├───┼┼───┼───┼───┼───┼───┤
│ no buffer │ 13978  │ 13746 │ 13909 │ 13956 │ 13920 │ 13895 │
└───┴┴───┴───┴───┴───┴───┘

2-  row size = half of the rows are 1KB and rest is 10KB , # of rows =
100
┌───┬┬───┬───┬───┬───┬───┐
│   │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───┼┼───┼───┼───┼───┼───┤
│ HEAD  │ 30195  │ 26455 │ 17338 │ 14562 │ 12844 │ 11652 │
├───┼┼───┼───┼───┼───┼───┤
│ patch │ 14744  │ 15830 │ 15697 │ 14273 │ 12794 │ 11652 │
├───┼┼───┼───┼───┼───┼───┤
│ no buffer │ 24054  │ 23992 │ 24162 │ 23951 │ 23901 │ 23925 │
└───┴┴───┴───┴───┴───┴───┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 3546   │ 3029 │ 2373 │ 2032 │ 1873 │ 1806 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 1715   │ 1723 │ 1724 │ 1731 │ 1729 │ 1709 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 1749   │ 1748 │ 1742 │ 1744 │ 1757 │ 1744 │
└───┴┴──┴──┴──┴──┴──┘

4-  row size = all rows are 1MB , # of rows = 1000
┌───┬┬──┬──┬──┬──┬──┐
│   │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───┼┼──┼──┼──┼──┼──┤
│ HEAD  │ 7089   │ 5987 │ 4697 │ 4048 │ 3737 │ 3523 │
├───┼┼──┼──┼──┼──┼──┤
│ patch │ 3438   │ 3411 │ 3400 │ 3416 │ 3399 │ 3429 │
├───┼┼──┼──┼──┼──┼──┤
│ no buffer │ 3432   │ 3432 │ 3416 │ 3424 │ 3378 │ 3429 │
└───┴┴──┴──────┴──┴──┴──┘

Thanks,
-- 
Melih Mutlu
Microsoft


v4-0001-Flush-large-data-immediately-in-pqcomm.patch
Description: Binary data

Re: Flushing large data immediately in pqcomm

2024-04-04 Thread Melih Mutlu

Jelte Fennema-Nio , 4 Nis 2024 Per, 16:34 tarihinde
şunu yazdı:

> On Thu, 4 Apr 2024 at 13:08, Melih Mutlu  wrote:
> > I changed internal_flush() to an inline function, results look better
> this way.
>
> It seems you also change internal_flush_buffer to be inline (but only
> in the actual function definition, not declaration at the top). I
> don't think inlining internal_flush_buffer should be necessary to
> avoid the perf regressions, i.e. internal_flush is adding extra
> indirection compared to master and is only a single line, so that one
> makes sense to inline.
>

Right. It was a mistake, forgot to remove that. Fixed it in v5.



> Other than that the code looks good to me.
>
> The new results look great.
>
> One thing that is quite interesting about these results is that
> increasing the buffer size results in even better performance (by
> quite a bit). I don't think we can easily choose a perfect number, as
> it seems to be a trade-off between memory usage and perf. But allowing
> people to configure it through a GUC like in your second patchset
> would be quite useful I think, especially because larger buffers could
> be configured for connections that would benefit most for it (e.g.
> replication connections or big COPYs).
>
> I think your add-pq_send_buffer_size-GUC.patch is essentially what we
> would need there but it would need some extra changes to actually be
> merge-able:
> 1. needs docs
> 2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but
> maybe also remove the PQ_ prefix)
> 3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it
> after initial allocation
>

I agree that the GUC patch requires more work to be in good shape. I
created that for testing purposes. But if we decide to make the buffer size
customizable, then I'll start polishing up that patch and address your
suggestions.

One case that could benefit from increased COPY performance is table sync
of logical replication. It might make sense letting users to configure
buffer size to speed up table sync. I'm not sure what kind of problems this
GUC would bring though.

Thanks,
-- 
Melih Mutlu
Microsoft


v5-0001-Flush-large-data-immediately-in-pqcomm.patch
Description: Binary data

Re: Flushing large data immediately in pqcomm

2024-04-07 Thread Melih Mutlu

David Rowley , 6 Nis 2024 Cmt, 04:34 tarihinde şunu
yazdı:

> Does anyone else want to try the attached script on the v5 patch to
> see if their numbers are better?
>

I'm seeing the below results with your script on my machine (). I ran it
several times, results were almost similar each time.

master:
Run 100 100 500: 1.627905512
Run 1024 10240 20: 1.603231684
Run 1024 1048576 2000: 2.962812352
Run 1048576 1048576 1000: 2.940766748

v5:
Run 100 100 500: 1.611508155
Run 1024 10240 20: 1.603505596
Run 1024 1048576 2000: 2.727241937
Run 1048576 1048576 1000: 2.721268988

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-15 Thread Melih Mutlu

Hi,

Attached new versions of the patch with some changes/fixes.

Here also some numbers to compare the performance of log. rep. with this
patch against the current master branch.

My method of benchmarking is the same with what I did earlier in this
thread. (on a different environment, so not compare the result from this
email with the ones from earlier emails)

> With those changes, I did some benchmarking to see if it improves anything.
> This results compares this patch with the latest version of master branch.
> "max_sync_workers_per_subscription" is set to 2 as default.
> Got some results simply averaging timings from 5 consecutive runs for each
> branch.


Since this patch is expected to improve log. rep. of empty/close-to-empty
tables, started with measuring performance with empty tables.

|  10 tables  |  100 tables|  1000 tables
--
master |  283.430 ms  |  22739.107 ms  |  105226.177 ms
--
 patch  |  189.139 ms  |  1554.802 ms|  23091.434 ms

After the changes discussed here [1], concurrent replication origin drops
> by apply worker and tablesync workers may hold each other on wait due to
> locks taken by replorigin_drop_by_name.
> I see that this harms the performance of logical replication quite a bit
> in terms of speed.
> [1]
> https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depesz.com


Firstly, as I mentioned, replication origin drops made things worse for the
master branch.
Locks start being a more serious issue when the number of tables increases.
The patch reuses the origin so does not need to drop them in each
iteration. That's why the difference between the master and the patch is
more significant now than it was when I first sent the patch.

To just show that the improvement is not only the result of reuse of
origins, but also reuse of rep. slots and workers, I just reverted those
commits which causes the origin drop issue.

  |  10 tables  |  100 tables|  1000 tables
-
reverted |  270.012 ms  |  2483.907 ms   |  31660.758 ms
-
 patch|  189.139 ms  |  1554.802 ms   |  23091.434 ms

With this patch, logical replication is still faster, even if we wouldn't
have an issue with rep. origin drops.

Also here are some numbers with 10 tables loaded with some data :

 | 10 MB  | 100 MB
--
master  |  2868.524 ms   |  14281.711 ms
--
 patch   |  1750.226 ms   |  14592.800 ms

The difference between the master and the patch is getting close when the
size of tables increase, as expected.


I would appreciate any feedback/thought on the approach/patch/numbers etc.

Thanks,
-- 
Melih Mutlu
Microsoft


v2-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v5-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-20 Thread Melih Mutlu

Hi Amit,

Amit Kapila , 16 Ara 2022 Cum, 05:46 tarihinde
şunu yazdı:

> Right, but when the size is 100MB, it seems to be taking a bit more
> time. Do we want to evaluate with different sizes to see how it looks?
> Other than that all the numbers are good.
>

I did a similar testing with again 100MB and also 1GB this time.

 | 100 MB   | 1 GB
--
master  |  14761.425 ms   |  160932.982 ms
--
 patch   |  14398.408 ms   |  160593.078 ms

This time, it seems like the patch seems slightly faster than the master.
Not sure if we can say the patch slows things down (or speeds up) if the
size of tables increases.
The difference may be something arbitrary or caused by other factors. What
do you think?

I also wondered what happens when "max_sync_workers_per_subscription" is
set to 1.
Which means tablesync will be done sequentially in both cases but the patch
will use only one worker and one replication slot during the whole
tablesync process.
Here are the numbers for that case:

 | 100 MB  | 1 GB
--
master  |  27751.463 ms  |  312424.999 ms
--
 patch   |  27342.760 ms  |  310021.767 ms

Best,
-- 
Melih Mutlu
Microsoft

Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

2022-12-22 Thread Melih Mutlu

Hi hackers,

I realized a behaviour of logical replication that seems unexpected to me,
but not totally sure.

Let's say a new table is created and added into a publication and not
created on subscriber yet. Also "ALTER SUBSCRIPTION ... REFRESH
PUBLICATION" has not been called yet.
What I expect in that case would be that logical replication continues to
work as it was working before the new table was created. The new table does
not get replicated until "REFRESH PUBLICATION" as stated here [1].
This is indeed how it actually seems to work. Until we insert a row into
the new table.

After a new row into the new table, the apply worker gets this change and
tries to apply it. As expected, it fails since the table does not exist on
the subscriber yet. And the worker keeps crashing without and can't apply
any changes for any table.
The obvious way to resolve this is creating the table on subscriber as
well. After that apply worker will be back to work and skip changes for the
new table and move to other changes.
Since REFRESH PUBLICATION is not called yet, any change for the new table
will not be replicated.

If replication of the new table will not start anyway (until REFRESH
PUBLICATION), do we really need to have that table on the subscriber for
apply worker to work?
AFAIU any change on publication would not affect logical replication setup
until the publication gets refreshed on subscriber. If this understanding
is correct, then apply worker should be able to run without needing new
tables.
What do you think?

Also; if you agree, then the attached patch attempts to fix this issue.
It relies on the info from pg_subscription_rel so that apply worker only
applies changes for the relations exist in pg_subscription_rel.
Since new tables wouldn't be in there until the next REFRESH PUBLICATION,
missing those tables won't be a problem for existing subscriptions.

[1] https://www.postgresql.org/docs/current/sql-altersubscription.html

Thanks,
-- 
Melih Mutlu
Microsoft


0001-Do-not-apply-for-tables-which-not-exist-on-catalog.patch
Description: Binary data

Re: Apply worker fails if a relation is missing on subscriber even if refresh publication has not been refreshed yet

2022-12-26 Thread Melih Mutlu

Hi Amit,

Amit Kapila , 23 Ara 2022 Cum, 09:39 tarihinde
şunu yazdı:

> I also have the same understanding but I think if we skip replicating
> some table due to the reason that the corresponding publication has
> not been refreshed then it is better to LOG that information instead
> of silently skipping it.


By skipping it, I mean the apply worker does not try to do anything with
the changes for the missing table since the worker simply cannot apply it
and only fails.
But I agree with you about logging it, the patch currently logs such cases
as warnings instead of errors.
I can make it LOG instead of WARNING, just wanted to make something
different than ERROR.

Do you have any use case in mind where the user has added a table to
> the publication even though she doesn't want it to be replicated? One
> thing that came to my mind is that due to some reason after adding a
> table to the publication, there is some delay in creating the table on
> the subscriber and then refreshing the publication and during that
> time user expects replication to proceed smoothly. But for that isn't
> it better that the user completes the setup on the subscriber before
> performing operations on such a table? Because say there is some error
> in the subscriber-side setup that the user misses then it would be a
> surprise for a user to not see the table data. In such a case, an
> ERROR/LOG information could be helpful for users.
>

I don't really see a specific use case for this. The delay between creating
a table on publisher and then on subscriber usually may not be even
that long to hurt anything. It just seems unnecessary to me that apply
worker goes into a failure loop until someone creates the table on the
subscriber, even though the table will not be replicated immediately.


Users also shouldn't expect for such tables to be replicated if they did
not refresh the publication. That will not happen with or without this
change. So I don't think it would be a surprise when they see their new
table has not been replicated yet. This issue will also be visible in the
logs, just not as an error.
And if users decide/remember to refresh the publication, they cannot do
that anyway if the table is still missing on the subscriber. So the REFRESH
PUBLICATION command will fail and then users will see an error log.

Best,
-- 
Melih Mutlu
Microsoft

Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2022-12-27 Thread Melih Mutlu

Hi,

Having  FORCE_NULL(*) and FORCE_NOT_NULL(*) sounds good, since postgres
already has FORCE_QUOTE(*).

I just quickly tried out your patch. It worked for me as expected.

 One little suggestion:

+ if (cstate->opts.force_notnull_all)
> + {
> + int i;
> + for(i = 0; i < num_phys_attrs; i++)
> + cstate->opts.force_notnull_flags[i] = true;
> + }


Instead of setting force_null/force_notnull flags for all columns, what
about simply setting "attnums" list to cstate->attnumlist?
Something like the following should be enough :

> if (cstate->opts.force_null_all)
>attnums = cstate->attnumlist;
> else
>attnums = CopyGetAttnums(tupDesc, cstate->rel, cstate->opts.force_null);


Thanks,
-- 
Melih Mutlu
Microsoft

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-01-03 Thread Melih Mutlu

Hi hackers,

Sending an updated version of this patch to get rid of compiler warnings.

I would highly appreciate any feedback.

Thanks,
-- 
Melih Mutlu
Microsoft


v2-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v6-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-01-11 Thread Melih Mutlu

Hi hackers,

Rebased the patch to resolve conflicts.

Best,
-- 
Melih Mutlu
Microsoft


v3-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v7-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-01-11 Thread Melih Mutlu

Hi,

Thanks for your review.

shiy.f...@fujitsu.com , 11 Oca 2023 Çar, 11:56
tarihinde şunu yazdı:

> On Mon, Nov 14, 2022 8:08 PM Melih Mutlu  wrote:
> 1.
> +# Binary enabled subscription should fail
> +$node_subscriber_binary->wait_for_log("ERROR:  insufficient data left in
> message");
>
> Should it be changed to "ERROR: ( [A-Z0-9]+:)? ", like other subscription
> tests.
>

Done.


> 2.
> +# Binary disabled subscription should succeed
> +$node_publisher->wait_for_catchup('tap_sub');
>
> If we want to wait for table synchronization to finish, should we call
> wait_for_subscription_sync()?
>

Done.


> 3.
> I also think it might be better to support copy binary only for publishers
> of
> v16 or later. Do you plan to implement it in the patch?
>

Done.

Thanks,
-- 
Melih Mutlu
Microsoft


v5-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-01-12 Thread Melih Mutlu

Hi,

Thanks for your reviews.

Takamichi Osumi (Fujitsu) , 12 Oca 2023 Per,
06:07 tarihinde şunu yazdı:

> On Wednesday, January 11, 2023 7:45 PM Melih Mutlu 
> wrote:
> (1-1) There is no need to log the version and the query by elog here.
> (1-2) Also, I suggest we can remove the server_version variable itself,
>   because we have only one actual reference for it.
>   There is a style that we call walrcv_server_version in the
>   'if condition' directly like existing codes in
> fetch_remote_table_info().
> (1-3) Suggestion to improve comments.
>   FROM:
>   /* If the publisher is v16 or later, copy in binary format.*/
>   TO:
>   /* If the publisher is v16 or later, copy data in the required data
> format. */
>

Forgot to remove that LOG line. Removed it now and applied other
suggestions too.


> I think if we change the order of this part of check like below, then
> it would look more aligned with other existing test codes introduced by
> this patch.
>

Right. Changed it to make it more aligned with the rest.

Thanks,
-- 
Melih Mutlu
Microsoft


v6-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-01-12 Thread Melih Mutlu

Hi,

I've a question about 032_apply_delay.pl.

+# Test ALTER SUBSCRIPTION. Delay 86460 seconds (1 day 1 minute).
> +$node_subscriber->safe_psql('postgres',
> +"ALTER SUBSCRIPTION tap_sub SET (min_apply_delay = 8646)"
> +);
> +
> +# New row to trigger apply delay.
> +$node_publisher->safe_psql('postgres',
> +"INSERT INTO test_tab VALUES (0, 'foobar')");
> +


I couldn't quite see how these lines test whether ALTER SUBSCRIPTION
successfully worked.
Don't we need to check that min_apply_delay really changed as a result?

But also I see that subscription.sql already tests this ALTER SUBSCRIPTION
behaviour.

Best,
-- 
Melih Mutlu
Microsoft

Re: pg_stat_io for the startup process

2023-04-26 Thread Melih Mutlu

Hi all,

Robert Haas , 26 Nis 2023 Çar, 15:34 tarihinde şunu
yazdı:

> On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
>  wrote:
> > 3. When should we call pgstat_report_stats on the startup process?
> >
> > During recovery, I think we can call pgstat_report_stats() (or a
> > subset of it) right before invoking WaitLatch and at segment
> > boundaries.
>
> I think this kind of idea is worth exploring. Andres mentioned timers,
> but it seems to me that something where we just do it at certain
> "convenient" points might be good enough and simpler. I'd much rather
> have statistics that were up to date as of the last time we finished a
> segment than have nothing at all.
>

I created a rough prototype of a timer-based approach for comparison.
Please see attached.

Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
handler sets a flag indicating that io stats need to be flushed.
HandleStartupProcInterrupts checks the flag to flush stats.
It's enabled if any WAL record is read (in the main loop of
PerformWalRecovery). And it's disabled before WaitLatch to avoid
unnecessary wakeups if the startup process decides to sleep.

With those changes, I see that startup related rows in pg_stat_io are
updated without waiting for the startup process to exit.

postgres=# select context, reads, extends, hits from pg_stat_io where
> backend_type = 'startup';
>   context  | reads  | extends |   hits
> ---++-+--
>  bulkread  |  0 | |0
>  bulkwrite |  0 |   0 |0
>  normal|  6 |   1 |   41
>  vacuum|  0 |   0 |0
> (4 rows)

I'm not sure this is the correct way to implement this approach though. I
appreciate any comment.

Also; some questions about this implementation if you think it's worth
discussing:
1- I set an arbitrary timeout (1 sec) for testing. I don't know what the
correct value should be. Does it make sense to use
PGSTAT_MIN/MAX/IDLE_INTERVAL instead?
2- I'm also not sure if this timeout should be registered at the beginning
of StartupProcessMain, or does it even make any difference. I tried to do
this just before the main redo loop in PerformWalRecovery, but that made
the CI red.

Best,
-- 
Melih Mutlu
Microsoft

0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patch
Description: Binary data

Re: pg_stat_io for the startup process

2023-05-03 Thread Melih Mutlu

Hi,

Andres Freund , 27 Nis 2023 Per, 19:27 tarihinde şunu
yazdı:

> Huh - do you have a link to the failure? That's how I would expect it to be
> done.


I think I should have registered it in the beginning of
PerformWalRecovery() and not just before the main redo loop
since HandleStartupProcInterrupts is called before the loop too.
Here's the error log otherise [1]

>  #ifdef WAL_DEBUG
> >   if (XLOG_DEBUG ||
> >   (record->xl_rmid == RM_XACT_ID &&
> trace_recovery_messages <= DEBUG2) ||
> > @@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
> bool randAccess,
> >   /* Do background tasks
> that might benefit us later. */
> >
>  KnownAssignedTransactionIdsIdleMaintenance();
> >
> > + /*
> > +  * Need to disable flush
> timeout to avoid unnecessary
> > +  * wakeups. Enable it
> again after a WAL record is read
> > +  * in PerformWalRecovery.
> > +  */
> > +
>  disable_startup_stat_flush_timeout();
> > +
> >   (void)
> WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
> >
>   WL_LATCH_SET | WL_TIMEOUT |
> >
>   WL_EXIT_ON_PM_DEATH,
>
> I think always disabling the timer here isn't quite right - we want to
> wake up
> *once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
> stats before waiting - potentially for a long time - for WAL. One way
> would be
> just explicitly report before the WaitLatch().
>
> Another approach, I think better, would be to not use
> enable_timeout_every(),
> and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
> called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do
> so
> at the end of WaitForWALToBecomeAvailable().  That way we also wouldn't
> repeatedly fire between calls to HandleStartupProcInterrupts().
>

Attached patch is probably not doing what you asked. IIUC
HandleStartupProcInterrupts should arm the timer but also shouldn't arm it
if it's called from  WaitForWALToBecomeAvailable. But the timer should be
armed again at the very end of WaitForWALToBecomeAvailable. I've been
thinking about how to do this properly. Should HandleStartupProcInterrupts
take a parameter to decide whether the timer needs to be armed? Or need to
add an additional global flag to rearm the timer? Any thoughts?

[1]
https://api.cirrus-ci.com/v1/artifact/task/5282291971260416/testrun/build/testrun/recovery/010_logical_decoding_timelines/log/010_logical_decoding_timelines_replica.log

Best,
-- 
Melih Mutlu
Microsoft


v2-0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patch
Description: Binary data

Re: Subscription statistics are not dropped at DROP SUBSCRIPTION in some cases

2023-05-10 Thread Melih Mutlu

Hi,

Masahiko Sawada , 8 May 2023 Pzt, 10:24 tarihinde
şunu yazdı:

> I've attached the patch. Feedback is very welcome.
>

Thanks for the patch, nice catch.
I can confirm that the issue exists on HEAD and gets resolved by this
patch. Also it looks like stats are really not affected if transaction
fails for some reason, as you explained.
IMO, the patch will be OK after commit message is added.

Thanks,
-- 
Melih Mutlu
Microsoft

Re: Mingw task for Cirrus CI

2022-04-08 Thread Melih Mutlu

Hi Andrew,

You should set MSYSTEM=UCRT64 in the environment section. Given that,
> there should be no need to specify a --host= setting for configure.
>

It's set to UCRT64 in the docker image side [1]. I didn't know --host isn't
necessary on UCRT64 environment. I'll remove it then.

 [1]
 
https://github.com/anarazel/pg-vm-images/blob/main/docker/windows_ci_mingw64#L11


Best,
Melih

Re: Mingw task for Cirrus CI

2022-04-08 Thread Melih Mutlu

> On windows that makes prep_buildtree go from 42.4s to 5.8s for me.
>

I applied Andres's faster prep build tree changes and triggered some cirrus
runs

Without these changes, preparing build tree was taking around 42.3s
(sometimes even more) [1]
It seems like with these changes it drops to around 8s [2]

[1] https://cirrus-ci.com/task/6562493345562624
[2] https://cirrus-ci.com/task/4836843802853376

Best,
Melih

Building Postgres with lz4 on Visual Studio

2022-04-13 Thread Melih Mutlu

Hi all,

I tried to build Postgres from source using Visual Studio 19. It worked all
good.
Then I wanted to build it with some dependencies, started with the ones
listed here [1]. But I'm having some issues with lz4.

First I downloaded the latest release of lz4 from this link [2].
Modified the src\tools\msvc\config.pl file as follows:

> use strict;
> use warnings;
>


our $config;
> $config->{"tap_tests"} = 1;
> $config->{"asserts"} = 1;
>


$config->{"lz4"} = "c:/lz4/";
> $config->{"openssl"} = "c:/openssl/1.1/";
> $config->{"perl"} = "c:/strawberry/$ENV{DEFAULT_PERL_VERSION}/perl/";
> $config->{"python"} = "c:/python/";
>


1;

based on /src/tools/ci/windows_build_config.pl

Then ran the following commands:

> vcvarsall x64
> perl src/tools/msvc/mkvcbuild.pl
> msbuild  -m -verbosity:minimal /p:IncludePath="C:\lz4" pgsql.sln


msbuild command fails with the following error:
"LINK : fatal error LNK1181: cannot open input file
'c:\lz4\\lib\liblz4.lib' [C:\postgres\libpgtypes.vcxproj]"

What I realized is that c:\lz4\lib\liblz4.lib actually does not exist.
The latest versions of lz4, downloaded from [2], do not contain \liblz4.lib
anymore, as opposed to what's written here [3]. Also there isn't a lib
folder too.

After those changes on lz4 side, AFAIU there seems like this line adds
library from wrong path in Solution.pm file [4].

> $proj->AddIncludeDir($self->{options}->{lz4} . '\include');
> $proj->AddLibrary($self->{options}->{lz4} . '\lib\liblz4.lib');


Even if I spent some time on this problem and tried to fix some places, I'm
not able to run a successful build yet.
This is also the case for zstd too. Enabling zstd gives the same error.

Has anyone had this issue before? Is this something that anyone is aware of
and somehow made it work?
I would appreciate any comment/suggestion etc.

Thanks,
Melih


[1]
https://www.postgresql.org/docs/current/install-windows-full.html#id-1.6.5.8.8
[2] https://github.com/lz4/lz4/releases/tag/v1.9.3
[3] https://github.com/lz4/lz4/tree/dev/lib/dll/example
[4]
https://github.com/postgres/postgres/blob/c1932e542863f0f646f005b3492452acc57c7e66/src/tools/msvc/Solution.pm#L1092

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2023-10-23 Thread Melih Mutlu

Hi,

Thanks for reviewing.
Attached the updated patch v3.

Andres Freund , 12 Eki 2023 Per, 19:23 tarihinde şunu
yazdı:

> > Here how pg_backend_memory_contexts would look like with this patch:
> >
> > postgres=# SELECT name, id, parent, parent_id, path
> > FROM pg_backend_memory_contexts
> > ORDER BY total_bytes DESC LIMIT 10;
> >   name   | id  |  parent  | parent_id | path
> >
> -+-+--+---+--
> >  CacheMemoryContext  |  27 | TopMemoryContext | 0 | {0}
> >  Timezones   | 124 | TopMemoryContext | 0 | {0}
> >  TopMemoryContext|   0 |  |   |
> >  MessageContext  |   8 | TopMemoryContext | 0 | {0}
> >  WAL record construction | 118 | TopMemoryContext | 0 | {0}
> >  ExecutorState   |  18 | PortalContext|17 | {0,16,17}
> >  TupleSort main  |  19 | ExecutorState|18 |
> {0,16,17,18}
> >  TransactionAbortContext |  14 | TopMemoryContext | 0 | {0}
> >  smgr relation table |  10 | TopMemoryContext | 0 | {0}
> >  GUC hash table  | 123 | GUCMemoryContext |   122 | {0,122}
> > (10 rows)
>
> Would we still need the parent_id column?
>

I guess not. Assuming the path column is sorted from TopMemoryContext to
the parent one level above, parent_id can be found using the path column if
needed.
Removed parent_id.


> > +
> > + 
> > +  
> > +   context_id int4
> > +  
> > +  
> > +   Current context id
> > +  
> > + 
>
> I think the docs here need to warn that the id is ephemeral and will likely
> differ in the next invocation.
>

Done.

> + 
> > +  
> > +   parent_id int4
> > +  
> > +  
> > +   Parent context id
> > +  
> > + 
> > +
> > + 
> > +  
> > +   path int4
> > +  
> > +  
> > +   Path to reach the current context from TopMemoryContext
> > +  
> > + 
>
> Perhaps we should include some hint here how it could be used?
>

I added more explanation but not sure if that is what you asked for. Do you
want a hint that is related to a more specific use case?

> + length = list_length(path);
> > + datum_array = (Datum *) palloc(length * sizeof(Datum));
> > + length = 0;
> > + foreach(lc, path)
> > + {
> > + datum_array[length++] = Int32GetDatum((int)
> lfirst_int(lc));
>
> The "(int)" in front of lfirst_int() seems redundant?
>

Removed.

I think it'd be good to have some minimal test for this. E.g. checking that
> there's multiple contexts below cache memory context or such.
>

Added new tests in sysview.sql.


Stephen Frost , 18 Eki 2023 Çar, 22:53 tarihinde şunu
yazdı:

> I wonder if we should perhaps just include
> "total_bytes_including_children" as another column?  Certainly seems
> like a very useful thing that folks would like to see.  We could do that
> either with C, or even something as simple as changing the view to do
> something like:
>
> WITH contexts AS MATERIALIZED (
>   SELECT * FROM pg_get_backend_memory_contexts()
> )
> SELECT
>   *,
>   coalesce
>   (
> (
>   (SELECT sum(total_bytes) FROM contexts WHERE ARRAY[a.id] <@ path)
>   + total_bytes
> ),
> total_bytes
>   ) AS total_bytes_including_children
> FROM contexts a;
>

I added a "total_bytes_including_children" column as you suggested. Did
that with C since it seemed faster than doing it by changing the view.

-- Calculating total_bytes_including_children by modifying the view
postgres=# select * from pg_backend_memory_contexts ;
Time: 30.462 ms

-- Calculating total_bytes_including_children with C
postgres=# select * from pg_backend_memory_contexts ;
Time: 1.511 ms


Thanks,
-- 
Melih Mutlu
Microsoft


v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
Description: Binary data

Flushing large data immediately in pqcomm

2023-11-20 Thread Melih Mutlu

Hi hackers

I've been looking into ways to reduce the overhead we're having in pqcomm
and I'd like to propose a small patch to modify how socket_putmessage works.

Currently socket_putmessage copies any input data into the pqcomm send
buffer (PqSendBuffer) and the size of this buffer is 8K. When the send
buffer gets full, it's flushed and we continue to copy more data into the
send buffer until we have no data left to be sent.
Since the send buffer is flushed whenever it's full, I think we are safe to
say that if the size of input data is larger than the buffer size, which is
8K, then the buffer will be flushed at least once (or more, depends on the
input size) to store and all the input data.

Proposed change modifies socket_putmessage to send any data larger than
8K immediately without copying it into the send buffer. Assuming that the
send buffer would be flushed anyway due to reaching its limit, the patch
just gets rid of the copy part which seems unnecessary and sends data
without waiting.

This change affects places where pq_putmessage is used such as
pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs.
Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
STDOUT". Here are perf results of both the patch and HEAD

HEAD:
-   94,13% 0,22%  postgres  postgres   [.] DoCopyTo
  - 93,90% DoCopyTo
  - 91,80% CopyOneRowTo
 + 47,35% CopyAttributeOutText
 - 26,49% CopySendEndOfRow
- 25,97% socket_putmessage
   - internal_putbytes
  - 24,38% internal_flush
 + secure_write
  + 1,47% memcpy (inlined)
 + 14,69% FunctionCall1Coll
 + 1,94% appendBinaryStringInfo
 + 0,75% MemoryContextResetOnly
  + 1,54% table_scan_getnextslot (inlined)

Patch:
-   94,40% 0,30%  postgres  postgres   [.] DoCopyTo
   - 94,11% DoCopyTo
  - 92,41% CopyOneRowTo
 + 51,20% CopyAttributeOutText
 - 20,87% CopySendEndOfRow
- 20,45% socket_putmessage
   - internal_putbytes
  - 18,50% internal_flush (inlined)
   internal_flush_buffer
 + secure_write
  + 1,61% memcpy (inlined)
 + 17,36% FunctionCall1Coll
 + 1,33% appendBinaryStringInfo
 + 0,93% MemoryContextResetOnly
  + 1,36% table_scan_getnextslot (inlined)

The patch brings a ~5% gain in socket_putmessage.

Also timed the pg_basebackup like:
time pg_basebackup -p 5432 -U replica_user -X none -c fast --no_maanifest
-D test

HEAD:
real0m10,040s
user0m0,768s
sys 0m7,758s

Patch:
real0m8,882s
user0m0,699s
sys 0m6,980s

It seems ~11% faster in this specific case.

I'd appreciate any feedback/thoughts.


[1]
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS
name, NOW() AS time FROM generate_series(1, 1) AS i;


Thanks,
-- 
Melih Mutlu
Microsoft


0001-Flush-large-data-immediately-in-pqcomm.patch
Description: Binary data

Re: Separate memory contexts for relcache and catcache

2024-01-03 Thread Melih Mutlu

Hi,

torikoshia , 4 Ara 2023 Pzt, 07:59 tarihinde
şunu yazdı:

> Hi,
>
> I also think this change would be helpful.
>
> I imagine you're working on the Andres's comments and you already notice
> this, but v1 patch cannot be applied to HEAD.
> For the convenience of other reviewers, I marked it 'Waiting on Author'.
>

Thanks for letting me know. I rebased the patch. PFA new version.




Andres Freund , 12 Eki 2023 Per, 20:01 tarihinde şunu
yazdı:

> Hi,
>
> Have you checked what the source of the remaining allocations in
> CacheMemoryContext are?
>

It's mostly typecache, around 2K. Do you think typecache also needs a
separate context?

Given the size of both CatCacheMemoryContext and RelCacheMemoryContext in a
> new backend, I think it might be worth using non-default aset parameters. A
> bit ridiculous to increase block sizes from 8k upwards in every single
> connection made to postgres ever.
>

Considering it starts from ~262K, what would be better for init size?
256K?

> +static void
> > +CreateCatCacheMemoryContext()
>
> We typically use (void) to differentiate from an older way of function
> declarations that didn't have argument types.
>
Done.

> +{
> > + if (!CacheMemoryContext)
> > + CreateCacheMemoryContext();
>
> I wish we just made sure that cache memory context were created in the
> right
> place, instead of spreading this check everywhere...
>

That would be nice. Do you have a suggestion about where that right place
would be?

I'd just delete these comments, they're just pointlessly restating the code.
>

Done.

Thanks,
-- 
Melih Mutlu
Microsoft


v2-0001-Separate-memory-contexts-for-relcache-and-catcach.patch
Description: Binary data

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-01-03 Thread Melih Mutlu

Hi,

Thanks for reviewing. Please find the updated patch attached.

torikoshia , 4 Ara 2023 Pzt, 07:43 tarihinde
şunu yazdı:

> I reviewed v3 patch and here are some minor comments:
>
> > + 
> > +   > role="column_definition">
> > +   path int4
>
> Should 'int4' be 'int4[]'?
> Other system catalog columns such as pg_groups.grolist distinguish
> whther the type is a array or not.
>

Right! Done.


>
> > +   Path to reach the current context from TopMemoryContext.
> > Context ids in
> > +   this list represents all parents of the current context. This
> > can be
> > +   used to build the parent and child relation.
>
> It seems last "." is not necessary considering other explanations for
> each field end without it.
>

Done.


> +const char *parent, int level, int
> *context_id,
> +List *path, Size
> *total_bytes_inc_chidlren)
>
> 'chidlren' -> 'children'
>

Done.


> +   elog(LOG, "pg_get_backend_memory_contexts called");
>
> Is this message necessary?
>

I guess I added this line for debugging and then forgot to remove. Now
removed.

There was warning when applying the patch:
>
>% git apply
>
> ../patch/pg_backend_memory_context_refine/v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
>
> ../patch/pg_backend_memory_context_refine/v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch:282:
>
> trailing whitespace.
>select count(*) > 0
>
> ../patch/pg_backend_memory_context_refine/v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch:283:
>
> trailing whitespace.
>from contexts
>warning: 2 lines add whitespace errors.
>

Fixed.

Thanks,
-- 
Melih Mutlu
Microsoft


v4-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2022-11-14 Thread Melih Mutlu

Hello,

osumi.takami...@fujitsu.com , 12 Eki 2022 Çar,
04:36 tarihinde şunu yazdı:

> >   I agree with the direction to support binary copy for v16 and
> later.
> >
> >   IIUC, the binary format replication with different data types
> fails even
> > during apply phase on HEAD.
> >   I thought that means, the upgrade concern only applies to a
> scenario
> > that the user executes
> >   only initial table synchronizations between the publisher and
> subscriber
> >   and doesn't replicate any data at apply phase after that. I would
> say
> >   this isn't a valid scenario and your proposal makes sense.
> >
> > No, logical replication in binary does not fail on apply phase if data
> types are
> > different.
> With HEAD, I observe in some case we fail at apply phase because of
> different data types like
> integer vs. bigint as written scenario in [1]. In short, I think we can
> slightly
> adjust your documentation and make it more general so that the description
> applies to
> both table sync phase and apply phase.
>

Yes, you're right. I somehow had the impression that HEAD supports
replication between different types in binary.
But as can be shown in the scenario you mentioned, it does not work.

I'll suggest a below change for your sentence of logical-replication.sgml.
> FROM:
> In binary case, it is not allowed to replicate data between different
> types due to restrictions inherited from COPY.
> TO:
> Binary format is type specific and does not allow to replicate data
> between different types according to its
> restrictions.
>

In this case, this change makes sense since this patch does actually not
introduce this issue. It already exists in HEAD too.


> If my idea above is correct, then I feel we can remove all the fixes for
> create_subscription.sgml.
> I'm not sure if I should pursue this perspective of the document
> improvement
> any further after this email, since this isn't essentially because of this
> patch.
>

I'm only keeping the following change in create_subscription.sgml to
indicate binary option copies in binary format now.

> -  Specifies whether the subscription will request the publisher to
> -  send the data in binary format (as opposed to text).
> +  Specifies whether the subscription will copy the initial data to
> +  synchronize relations in binary format and also request the
> publisher
> +  to send the data in binary format too (as opposed to text).




> > The concern with upgrade (if data types are not the same) would be not
> being
> > able to create a new subscription with binary enabled or replicate new
> tables
> > added into publication.
> > Replication of tables from existing subscriptions would not be affected
> by this
> > change since they will already be in the apply phase, not tablesync.
> > Do you think this would still be an issue?
> Okay, thanks for explaining this. I understand that
> the upgrade concern applies to the table sync that is executed
> between text format (before the patch) and binary format (after the patch).
>

I was thinking apply would work with different types in binary format.
Since apply also would not work, then the scenario that I tried to explain
earlier is not a concern anymore.


Attached patch with updated version of this patch.

Thanks,
Melih


v4-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-05 Thread Melih Mutlu

Hi hackers,

I've been working on/struggling with this patch for a while. But I haven't
updated this thread regularly.
So sharing what I did with this patch so far.

> Amit Kapila , 6 Ağu 2022 Cmt, 16:01 tarihinde
> şunu yazdı:
> >>
> >> I think there is some basic flaw in slot reuse design. Currently, we
> >> copy the table by starting a repeatable read transaction (BEGIN READ
> >> ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that
> >> establishes a snapshot which is first used for copy and then LSN
> >> returned by it is used in the catchup phase after the copy is done.
> >> The patch won't establish such a snapshot before a table copy as it
> >> won't create a slot each time. If this understanding is correct, I
> >> think we need to use ExportSnapshot/ImportSnapshot functionality to
> >> achieve it or do something else to avoid the problem mentioned.
>

This issue that Amit mentioned causes some problems such as duplicated rows
in the subscriber.
Basically, with this patch, tablesync worker creates a replication slot
only in its first run. To ensure table copy and sync are consistent with
each other, the worker needs the correct snapshot and LSN which both are
returned by slot create operation.
Since this patch does not create a rep. slot in each table copy and instead
reuses the one created in the beginning, we do not get a new snapshot and
LSN for each table anymore. Snapshot gets lost right after the transaction
is committed, but  the patch continues to use the same LSN for next tables
without the proper snapshot.
In the end, for example, the worker might first copy some rows, then apply
changes from rep. slot and inserts those rows again for the tables in
later iterations.

I discussed some possible ways to resolve this with Amit offline:
1- Copy all tables in one transaction so that we wouldn't need to deal with
snapshots.
Not easy to keep track of the progress. If the transaction fails, we would
need to start all over again.

2- Don't lose the first snapshot (by keeping a transaction open with the
snapshot imported or some other way) and use the same snapshot and LSN for
all tables.
I'm not sure about the side effects of keeping a transaction open that long
or using a snapshot that might be too old after some time.
Still seems like it might work.

3- For each table, get a new snapshot and LSN by using an existing
replication slot.
Even though this approach wouldn't create a new replication slot, preparing
the slot for snapshot and then taking the snapshot may be costly.
Need some numbers here to see how much this approach would improve the
performance.

I decided to go with approach 3 (creating a new snapshot with an existing
replication slot) for now since it would require less change in the
tablesync worker logic than the other options would.
To achieve this, this patch introduces a new command for Streaming
Replication Protocol.
The new REPLICATION_SLOT_SNAPSHOT command basically mimics how
CREATE_REPLICATION_SLOT creates a snapshot, but without actually creating a
new replication slot.
Later the tablesync worker calls this command if it decides not to create a
new rep. slot, uses the snapshot created and LSN returned by the command.

Also;
After the changes discussed here [1], concurrent replication origin drops
by apply worker and tablesync workers may hold each other on wait due to
locks taken by replorigin_drop_by_name.
I see that this harms the performance of logical replication quite a bit in
terms of speed.
Even though reusing replication origins was discussed in this thread
before, the patch didn't include any change to do so.
The updated version of the patch now also reuses replication origins too.
Seems like even only changes to reuse origins by itself improves the
performance significantly.

Attached two patches:
0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol.
0002: Reuses workers/replication slots and origins for tablesync

I would appreciate any feedback/review/thought on the approach and both
patches.
I will also share some numbers to compare performances of the patch and
master branch soon.

[1]
https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depesz.com

Best,
--
Melih Mutlu
Microsoft


0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v4-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE

2022-12-06 Thread Melih Mutlu

Hi Vignesh,

Looks like the patch needs a rebase.

Also one little suggestion:

+ if (ends_with(prev_wd, ')'))
> + COMPLETE_WITH(Alter_routine_options, "CALLED ON NULL INPUT",
> +  "RETURNS NULL ON NULL INPUT", "STRICT", "SUPPORT");


What do you think about gathering FUNCTION options as you did with ROUTINE
options.
Something like the following would seem nicer, I think.

#define Alter_function_options \
> Alter_routine_options, "CALLED ON NULL INPUT", \

"RETURNS NULL ON NULL INPUT", "STRICT", "SUPPORT"


Best,
--
Melih Mutlu
Microsoft

Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-06 Thread Melih Mutlu

Hi Nathan,

@@ -410,6 +411,12 @@ ExecRenameStmt(RenameStmt *stmt)
> stmt->newname);
>   table_close(catalog, RowExclusiveLock);
>
> + /*
> + * Wake up the logical replication workers to handle this
> + * change quickly.
> + */
> + LogicalRepWorkersWakeupAtCommit(address.objectId);


Is it really necessary to wake logical workers up when renaming other than
subscription or publication? address.objectId will be a valid subid only
when renaming a subscription.

@@ -322,6 +323,9 @@ UpdateSubscriptionRelState(Oid subid, Oid relid, char
> state,
>
>   /* Cleanup. */
>   table_close(rel, NoLock);
> +
> + /* Wake up the logical replication workers to handle this change
> quickly. */
> + LogicalRepWorkersWakeupAtCommit(subid);


I wonder why a wakeup call is needed every time a subscription relation is
updated.
It seems to me that there are two places where UpdateSubscriptionRelState
is called and we need another worker to wake up:
- When a relation is in SYNCWAIT state, it waits for the apply worker to
wake up and change the relation state to CATCHUP. Then tablesync worker
needs to wake up to continue from CATCHUP state.
- When the state is SYNCDONE and the apply worker has to wake up to change
the state to READY.

I think we already call logicalrep_worker_wakeup_ptr wherever it's needed
for the above cases? What am I missing here?

Best,
--
Melih Mutlu
Microsoft

Re: Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE

2022-12-07 Thread Melih Mutlu

Hi,

vignesh C , 6 Ara 2022 Sal, 22:12 tarihinde şunu yazdı:

> I did not make it as a macro for alter function options as it is used
> only in one place whereas the others were required in more than one
> place.
>

Okay, makes sense.

I tested the patch and it worked for me.

Best,
--
Melih Mutlu
Microsoft

Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-07 Thread Melih Mutlu

Hi,


> Actually, that's not quite right.  The sync worker will wake up the apply
> worker to change the state from SYNCDONE to READY.  AllTablesyncsReady()
> checks that all tables are READY, so we need to wake up all the workers
> when an apply worker changes the state to READY.  Each worker will then
> evaluate whether to restart for two_phase mode.
>

Right. I didn't think about the two phase case thoroughly. Waking up all
apply workers can help.

Do we also need to wake up all sync workers too? Even if not, I'm not
actually sure whether doing that would harm anything though.
Just asking since currently the patch wakes up all workers including sync
workers if any still exists.

Best,
--
Melih Mutlu
Microsoft

Re: [PATCH] psql: Add tab-complete for optional view parameters

2022-12-08 Thread Melih Mutlu

Hi Christoph,

I just took a quick look at your patch.
Some suggestions:

+   else if (Matches("ALTER", "VIEW", MatchAny, "SET", "("))
> +   COMPLETE_WITH_LIST(view_optional_parameters);
> +   /* ALTER VIEW xxx RESET ( yyy , ... ) */
> +   else if (Matches("ALTER", "VIEW", MatchAny, "RESET", "("))
> +   COMPLETE_WITH_LIST(view_optional_parameters);


What about combining these two cases into one like Matches("ALTER", "VIEW",
MatchAny, "SET|RESET", "(") ?

/* ALTER VIEW  */
> else if (Matches("ALTER", "VIEW", MatchAny))
> COMPLETE_WITH("ALTER COLUMN", "OWNER TO", "RENAME",
>   "SET SCHEMA");


Also seems like SET and RESET don't get auto-completed for "ALTER VIEW
".
I think it would be nice to include those missing words.

Thanks,
--
Melih Mutlu
Microsoft

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-01-23 Thread Melih Mutlu

Hi,

Thanks for your review.
Attached updated versions of the patches.

wangw.f...@fujitsu.com , 17 Oca 2023 Sal, 14:15
tarihinde şunu yazdı:

> On Wed, Jan 11, 2023 4:31 PM Melih Mutlu  wrote:
> v3-0001* patch
> ===
>
> 1. typedefs.list
> I think we also need to add "walrcv_slot_snapshot_fn" to this file.
>

Done.

> v7-0002* patch
> ===
> 1. About function ReplicationOriginNameForLogicalRep()
> Do we need to modify the API of this function? I think the original API
> could
> also meet the current needs. Since this is not a static function, I think
> it
> seems better to keep the original API if there is no reason. Please let me
> know
> if I'm missing something.
>

You're right.
I still need to modify ReplicationOriginNameForLogicalRep. Origin names are
not tied to relations anymore, so their name doesn't need to
include relation id.
But I didn't really need to change the function signature. I reverted that
part of the change in the updated version of the patch.

> 2. Comment atop the function GetSubscriptionRelReplicationSlot
>

Done

> 3. typo
> +* At this point, there shouldn't be any existing
> replication
> +* origin wit the same name.
>

Done.

> 4. In function CreateSubscription
> +   values[Anum_pg_subscription_sublastusedid - 1] = Int64GetDatum(1);
>
> I think it might be better to initialize this field to NULL or 0 here.
> Because in the patch, we always ignore the initialized value when launching
> the sync worker in the function process_syncing_tables_for_apply. And I
> think
> we could document in pg-doc that this value means that no tables have been
> synced yet.
>

I changed it to start from 0 and added a line into the related doc to
indicate that 0 means that no table has been synced yet.

> 5. New member "created_slot" in structure LogicalRepWorker
> +   /*
> +* Indicates if the sync worker created a replication slot or it
> reuses an
> +* existing one created by another worker.
> +*/
> +   boolcreated_slot;
>
> I think the second half of the sentence looks inaccurate.
> Because I think this flag could be false even when we reuse an existing
> slot
> created by another worker. Assuming the first run for the worker tries to
> sync
> a table which is synced by another sync worker before, and the relstate is
> set
> to SUBREL_STATE_FINISHEDCOPY by another sync worker, I think this flag
> will not
> be set to true. (see function LogicalRepSyncTableStart)
>
> So, what if we simplify the description here and just say that this worker
> already has it's default slot?
>
> If I'm not missing something and you agree with this, please also kindly
> modify
> the relevant comment atop the if-statement
> (!MyLogicalRepWorker->created_slot)
> in the function LogicalRepSyncTableStart.
>

This "created_slot" indicates whether the current worker has created a
replication slot for its own use. If so, created_slot will be true,
otherwise false.
Let's say the tablesync worker has not created its own slot yet in its
previous runs or this is its first run. And the worker decides to reuse an
existing replication slot (which created by another tablesync worker). Then
created_slot is expected to be false. Because this particular
tablesync worker has not created its own slot yet in either of its runs.

In your example, the worker is in its first run and begin to sync a table
whose state is FINISHEDCOPY. If the table's state is FINISHEDCOPY then the
table should already have a replication slot created for its own sync
process. The worker will want to reuse that existing replication slot for
this particular table and it will not create a new replication slot. So
created_slot will be false, because the worker has not actually created any
replication slot yet.

Basically, created_slot is set to true only if "walrcv_create_slot" is
called by the tablesync worker any time during its lifetime. Otherwise,
it's possible that the worker can use existing replication slots for each
table it syncs. (e.g. if all the tables that the worker has synced were in
FINISHEDCOPY  state, then the worker will not need to create a new slot).

Does it make sense now? Maybe I need to improve comments to make it clearer.

Best,
-- 
Melih Mutlu
Microsoft

v4-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data

v8-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-01-26 Thread Melih Mutlu

Hi Shveta,

Thanks for reviewing.

shveta malik , 25 Oca 2023 Çar, 16:02 tarihinde
şunu yazdı:

> On Mon, Jan 23, 2023 at 6:30 PM Melih Mutlu 
> wrote:
> --I see initial data copied, but new catalog columns srrelslotname
> and srreloriginname are not updated:
> postgres=# select sublastusedid from pg_subscription;
>  sublastusedid
> ---
>  2
>
> postgres=# select * from pg_subscription_rel;
>  srsubid | srrelid | srsubstate | srsublsn  | srrelslotname |
> srreloriginname
>
> -+-++---+---+-
>16409 |   16384 | r  | 0/15219E0 |   |
>16409 |   16389 | r  | 0/15219E0 |   |
>16409 |   16396 | r  | 0/15219E0 |   |
>
> When are these supposed to be updated? I thought the slotname created
> will be updated here. Am I missing something here?
>

If a relation is currently being synced by a tablesync worker and uses a
replication slot/origin for that operation, then srrelslotname and
srreloriginname fields will have values.
When a relation is done with its replication slot/origin, their info gets
removed from related catalog row, so that slot/origin can be reused for
another table or dropped if not needed anymore.
In your case, all relations are in READY state so it's expected that
srrelslotname and srreloriginname are empty. READY relations do not need a
replication slot/origin anymore.

Tables are probably synced so quickly that you're missing the moments when
a tablesync worker copies a relation and stores its rep. slot/origin in the
catalog.
If initial sync is long enough, then you should be able to see the columns
get updated. I follow [1] to make it longer and test if the patch really
updates the catalog.

> Also the v8 patch does not apply on HEAD, giving merge conflicts.
>

Rebased and resolved conflicts. Please check the new version

---
[1]
publisher:
SELECT 'CREATE TABLE table_'||i||'(i int);' FROM generate_series(1, 100)
g(i) \gexec
SELECT 'INSERT INTO table_'||i||' SELECT x FROM generate_series(1, 1)
x' FROM generate_series(1, 100) g(i) \gexec
CREATE PUBLICATION mypub FOR ALL TABLES;

subscriber:
SELECT 'CREATE TABLE table_'||i||'(i int);' FROM generate_series(1, 100)
g(i) \gexec
CREATE SUBSCRIPTION mysub CONNECTION 'dbname=postgres port=5432 '
PUBLICATION mypub;
select * from pg_subscription_rel where srrelslotname <> ''; \watch 0.5

Thanks,
-- 
Melih Mutlu
Microsoft

v9-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-01-30 Thread Melih Mutlu

Hi Bharath,

Thanks for reviewing.

Bharath Rupireddy , 18 Oca 2023
Çar, 10:17 tarihinde şunu yazdı:

> On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu 
> wrote:
> 1. The performance numbers posted upthread [1] look impressive for the
> use-case tried, that's a 2.25X improvement or 55.6% reduction in
> execution times. However, it'll be great to run a few more varied
> tests to confirm the benefit.
>

Sure, do you have any specific test case or suggestion in mind?


> 2. It'll be great to capture the perf report to see if the time spent
> in copy_table() is reduced with the patch.
>

Will provide that in another email soon.


> 3. I think blending initial table sync's binary copy option with
> data-type level binary send/receive is not good. Moreover, data-type
> level binary send/receive has its own restrictions per 9de77b5453.
> IMO, it'll be good to have a new option, say copy_data_format synonyms
> with COPY command's FORMAT option but with only text (default) and
> binary values.
>

Added a "copy_format" option for subscriptions with text as default value.
So it would be possible to create a binary subscription but copy tables in
text format to avoid restrictions that you're concerned about.


> 4. Why to add tests to existing 002_types.pl? Can we add a new file
> with all the data types covered?
>

Since 002_types.pl is where the replication of data types are covered. I
thought it would be okay to test replication with the binary option in that
file.
Sure, we can add a new file with different data types for testing
subscriptions with binary option. But then I feel like it would have too
many duplicated lines with 002_types.pl.
If you think that 002_types.pl lacks some data types needs to be tested,
then we should add those into 002_types.pl too whether we test subscription
with binary option in that file or some other place, right?


> 5. It's not clear to me as to why these rows were removed in the patch?
> -(1, '{1, 2, 3}'),
> -(2, '{2, 3, 1}'),
>  (3, '{3, 2, 1}'),
>  (4, '{4, 3, 2}'),
>  (5, '{5, NULL, 3}');
>
>  -- test_tbl_arrays
>  INSERT INTO tst_arrays (a, b, c, d) VALUES
> -('{1, 2, 3}', '{"a", "b", "c"}', '{1.1, 2.2, 3.3}', '{"1
> day", "2 days", "3 days"}'),
> -('{2, 3, 1}', '{"b", "c", "a"}', '{2.2, 3.3, 1.1}', '{"2
> minutes", "3 minutes", "1 minute"}'),
>

Previously, it wasn't actually testing the initial table sync since all
tables were empty when subscription was created.
I just simply split the data initially inserted to test initial table sync.

With this patch, it inserts the first two rows for all data types before
subscriptions get created.
You can see these lines:

> +# Insert initial test data
> +$node_publisher->safe_psql(
> +   'postgres', qq(
> +   -- test_tbl_one_array_col
> +   INSERT INTO tst_one_array (a, b) VALUES
> +   (1, '{1, 2, 3}'),
> +   (2, '{2, 3, 1}');
> +
> +   -- test_tbl_arrays
> +   INSERT INTO tst_arrays (a, b, c, d) VALUES




> 6. BTW, the subbinary description is missing in pg_subscription docs
> https://www.postgresql.org/docs/devel/catalog-pg-subscription.html?
> -  Specifies whether the subscription will request the publisher to
> -  send the data in binary format (as opposed to text).
> +  Specifies whether the subscription will copy the initial data to
> +  synchronize relations in binary format and also request the
> publisher
> +  to send the data in binary format too (as opposed to text).
>

Done.


> 7. A nitpick - space is needed after >= before 16.
> +if (walrcv_server_version(LogRepWorkerWalRcvConn) >=16 &&
>

Done.


> 8. Note that the COPY binary format isn't portable across platforms
> (Windows to Linux for instance) or major versions
> https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical
> replication is
> https://www.postgresql.org/docs/devel/logical-replication.html.
> I don't see any handling of such cases in copy_table but only a check
> for the publisher version. I think we need to account for all the
> cases - allow binary COPY only when publisher and subscriber are of
> same versions, architectures, platforms. The trick here is how we
> identify if the subscriber is of the same type and taste
> (architectures and platforms) as the publisher. Looking for
> PG_VERSION_STR of publisher and subscriber might

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread Melih Mutlu

Hi,

Please see attached patches for the below changes.

shveta malik , 27 Oca 2023 Cum, 13:12 tarihinde
şunu yazdı:

> On Thu, Jan 26, 2023 at 7:53 PM Melih Mutlu 
> wrote:
> 1.
> --Can we please add a few more points to the Summary to make it more clear.
> a) something telling that reusability of workers is for tables under
> one subscription and not across multiple subscriptions.
> b) Since we are reusing both workers and slots, can we add:
> --when do we actually end up spawning a new worker
> --when do we actually end up creating a new slot in a worker rather
> than using existing one.
> --if we reuse existing slots, what happens to the snapshot?
>

I modified the commit message if that's what you mean by the Summary.


> 2.
> +   The last used ID for tablesync workers. This ID is used to
> +   create replication slots. The last used ID needs to be stored
> +   to make logical replication can safely proceed after any
> interruption.
> +   If sublastusedid is 0, then no table has been synced yet.
>
> --typo:
>  to make logical replication can safely proceed ==> to make logical
> replication safely proceed
>

Done


> 3.
> is_first_run;
> move_to_next_rel;
> --Do you think one variable is enough here as we do not get any extra
> info by using 2 variables? Can we have 1 which is more generic like
> 'ready_to_reuse'. Otherwise, please let me know if we must use 2.
>

Right. Removed is_first_run and renamed move_to_next_rel as ready_to_reuse.


> 4.
> /* missin_ok = true, since the origin might be already dropped. */
> typo: missing_ok
>

Done.


> 5. GetReplicationSlotNamesBySubId:
> errmsg("not tuple returned."));
>
> Can we have a better error msg:
> ereport(ERROR,
> errmsg("could not receive list of slots
> associated with subscription %d, error: %s", subid, res->err));
>

Done.


> 6.
> static void
> clean_sync_worker(void)
>
> --can we please add introductory comment for this function.
>

Done.


> 7.
> /*
>  * Pick the table for the next run if there is not another worker
>  * already picked that table.
>  */
> Pick the table for the next run if it is not already picked up by
> another worker.
>

Done.


> 8.
> process_syncing_tables_for_sync()
>
> /* Cleanup before next run or ending the worker. */
> --can we please improve this comment:
> if there is no more work left for this worker, stop the worker
> gracefully, else do clean-up and make it ready for the next
> relation/run.
>

Done


> 9.
> LogicalRepSyncTableStart:
>  * Read previous slot name from the catalog, if exists.
>  */
> prev_slotname = (char *) palloc0(NAMEDATALEN);
> Do we need to free this at the end?
>

Pfree'd prev_slotname after we're done with it.


> 10.
> if (strlen(prev_slotname) == 0)
>     {
> elog(ERROR, "Replication slot could not be
> found for relation %u",
>  MyLogicalRepWorker->relid);
> }
> shall we mention subid also in error msg.
>

Done.

Thanks for reviewing,
-- 
Melih Mutlu
Microsoft


v5-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v9-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread Melih Mutlu

Hi,

I mistakenly attached v9 in my previous email.
Please see attached v6 and v10 for the previous and below changes.

shveta malik , 31 Oca 2023 Sal, 12:59 tarihinde
şunu yazdı:

> On Fri, Jan 27, 2023 at 3:41 PM shveta malik 
> wrote:
> 1)
> REPLICATION_SLOT_SNAPSHOT
> --Do we need 'CREATE' prefix with it i.e. CREATE_REPLICATION_SNAPSHOT
> (or some other brief one with CREATE?).  'REPLICATION_SLOT_SNAPSHOT'
> does not look like a command/action and thus is confusing.
>

Renamed it as CREATE_REPLICATION_SNAPSHOT


> 2)
> is used in the currenct transaction. This command is currently only
> supported
> for logical replication.
> slots.
> --typo: currenct-->current
> --slots can be moved to previous line
>

Done.


> 3)
> /*
>  * Signal that we don't need the timeout mechanism. We're just creating
>  * the replication slot and don't yet accept feedback messages or send
>  * keepalives. As we possibly need to wait for further WAL the walsender
>  * would otherwise possibly be killed too soon.
>  */
> We're just creating the replication slot --> We're just creating the
> replication snapshot
>

Done.


> 4)
> I see XactReadOnly check in CreateReplicationSlot, do we need the same
> in ReplicationSlotSnapshot() as well?
>

Added this check too.


> ===
> v9-0002:
>
> 5)
>   /* We are safe to drop the replication trackin origin after this
> --typo: tracking
>

Done.


> 6)
> slot->data.catalog_xmin = xmin_horizon;
> slot->effective_xmin = xmin_horizon;
> SpinLockRelease(&slot->mutex);
> xmin_horizon =
> GetOldestSafeDecodingTransactionId(!need_full_snapshot);
> ReplicationSlotsComputeRequiredXmin(true);
>
> --do we need to set xmin_horizon in slot after
> 'GetOldestSafeDecodingTransactionId' call, otherwise it will be set to
> InvalidId in slot. Is that intentional? I see that we do set this
> correct xmin_horizon in builder->initial_xmin_horizon but the slot is
> carrying Invalid one.
>

I think you're right. Moved GetOldestSafeDecodingTransactionId call before
xmin_horizon assignment.

Thanks,
-- 
Melih Mutlu
Microsoft


v6-0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v10-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread Melih Mutlu

Hi,

wangw.f...@fujitsu.com , 31 Oca 2023 Sal, 13:40
tarihinde şunu yazdı:

> Sorry, I forgot to add the link to the email. Please refer to [1].
>
> [1] -
> https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com


Thanks for pointing out this review. I somehow skipped that, sorry.

Please see attached patches.

shiy.f...@fujitsu.com , 17 Oca 2023 Sal, 10:46
tarihinde şunu yazdı:

> On Wed, Jan 11, 2023 4:31 PM Melih Mutlu  wrote:
> 0001 patch
> 
> 1. walsender.c
> +   /* Create a tuple to send consisten WAL location */
>
> "consisten" should be "consistent" I think.
>

Done.


> 2. logical.c
> +   if (need_full_snapshot)
> +   {
> +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +
> +   SpinLockAcquire(&slot->mutex);
> +   slot->effective_catalog_xmin = xmin_horizon;
> +   slot->data.catalog_xmin = xmin_horizon;
> +   slot->effective_xmin = xmin_horizon;
> +   SpinLockRelease(&slot->mutex);
> +
> +   xmin_horizon =
> GetOldestSafeDecodingTransactionId(!need_full_snapshot);
> +   ReplicationSlotsComputeRequiredXmin(true);
> +
> +   LWLockRelease(ProcArrayLock);
> +   }
>
> It seems that we should first get the safe decoding xid, then inform the
> slot
> machinery about the new limit, right? Otherwise the limit will be
> InvalidTransactionId and that seems inconsistent with the comment.
>

You're right. Moved that call before assigning xmin_horizon.


> 3. doc/src/sgml/protocol.sgml
> +   is used in the currenct transaction. This command is currently
> only supported
> +   for logical replication.
> +   slots.
>
> We don't need to put "slots" in a new line.
>

Done.


> 0002 patch
> 
> 1.
> In pg_subscription_rel.h, I think the type of "srrelslotname" can be
> changed to
> NameData, see "subslotname" in pg_subscription.h.
>
> 2.
> +* Find the logical replication sync
> worker if exists store
> +* the slot number for dropping associated
> replication slots
> +* later.
>
> Should we add comma after "if exists"?
>

Done.

3.
> +   PG_FINALLY();
> +   {
> +   pfree(cmd.data);
> +   }
> +   PG_END_TRY();
> +   \
> +   return tablelist;
> +}
>
> Do we need the backslash?
>

Removed it.


> 4.
> +   /*
> +* Advance to the LSN got from walrcv_create_slot. This is WAL
> +* logged for the purpose of recovery. Locks are to prevent the
> +* replication origin from vanishing while advancing.
>
> "walrcv_create_slot" should be changed to
> "walrcv_create_slot/walrcv_slot_snapshot" I think.


Right, done.


>
>
5.
> +   /* Replication drop might still exist. Try to drop
> */
> +   replorigin_drop_by_name(originname, true, false);
>
> Should "Replication drop" be "Replication origin"?
>

Done.


> 6.
> I saw an assertion failure in the following case, could you please look
> into it?
> The backtrace is attached.
>
> -- pub
> CREATE TABLE tbl1 (a int, b text);
> CREATE TABLE tbl2 (a int primary key, b text);
> create publication pub for table tbl1, tbl2;
> insert into tbl1 values (1, 'a');
> insert into tbl1 values (1, 'a');
>
> -- sub
> CREATE TABLE tbl1 (a int primary key, b text);
> CREATE TABLE tbl2 (a int primary key, b text);
> create subscription sub connection 'dbname=postgres port=5432' publication
> pub;
>
> Subscriber log:
> 2023-01-17 14:47:10.054 CST [1980841] LOG:  logical replication apply
> worker for subscription "sub" has started
> 2023-01-17 14:47:10.060 CST [1980843] LOG:  logical replication table
> synchronization worker for subscription "sub", table "tbl1" has started
> 2023-01-17 14:47:10.070 CST [1980845] LOG:  logical replication table
> synchronization worker for subscription "sub", table "tbl2" has started
> 2023-01-17 14:47:10.073 CST [1980843] ERROR:  duplicate key value violates
> unique constraint "tbl1_pkey"
> 2023-01-17 14:47:10.073 CST [1980843] DETAIL:  Key (a)=(1) already exists.
> 2023-01-17 14:47:10.073 CST [1980843] CONTEXT:  COPY tbl1, line 2
> 2023-01-17 14:47:10.074 CST [1980821] LOG:  background worker "logical
> replication worker" (PID 1980843) exited with exit code 1
> 2023-

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-01 Thread Melih Mutlu

Hi Shveta,

shveta malik , 1 Şub 2023 Çar, 15:01 tarihinde şunu
yazdı:

> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu  wrote:
> 2) I found a crash in the previous patch (v9), but have not tested it
> on the latest yet. Crash happens when all the replication slots are
> consumed and we are trying to create new. I tweaked the settings like
> below so that it can be reproduced easily:
> max_sync_workers_per_subscription=3
> max_replication_slots = 2
> and then ran the test case shared by you. I think there is some memory
> corruption happening. (I did test in debug mode, have not tried in
> release mode). I tried to put some traces to identify the root-cause.
> I observed that worker_1 keeps on moving from 1 table to another table
> correctly, but at some point, it gets corrupted i.e. origin-name
> obtained for it is wrong and it tries to advance that and since that
> origin does not exist, it  asserts and then something else crashes.
> From log: (new trace lines added by me are prefixed by shveta, also
> tweaked code to have my comment 1 fixed to have clarity on worker-id).
>
> form below traces, it is clear that worker_1 was moving from one
> relation to another, always getting correct origin 'pg_16688_1', but
> at the end it got 'pg_16688_49' which does not exist. Second part of
> trace shows who updated 'pg_16688_49', it was done by worker_49 which
> even did not get chance to create this origin due to max_rep_slot
> reached.
>

Thanks for investigating this error. I think it's the same error as the one
Shi reported earlier. [1]
I couldn't reproduce it yet but will apply your tweaks and try again.
Looking into this.

[1]
https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com


Thanks,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-02-16 Thread Melih Mutlu

Hi,

Please see the attached patch for following changes.

Bharath Rupireddy , 30 Oca 2023
Pzt, 15:34 tarihinde şunu yazdı:

> On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu 
> wrote:

It'd be better and clearer to have a separate TAP test file IMO since
> the focus of the feature here isn't to just test for data types. With
> separate tests, you can verify "ERROR:  incorrect binary data format
> in logical replication column 1" cases.
>

Moved some tests from 002_types.pl to 014_binary.pl since this is where
most binary features are tested. It covers now "incorrect data format"
cases too.
Also added some regression tests for copy_format parameter.

> With the above said, do you think checking for publisher versions is
> needed? The user can decide to enable/disable binary COPY based on the
> publisher's version no?
> +/* If the publisher is v16 or later, specify the format to copy data.
> */
> +if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 16)
> +{
>

If the user decides to enable it, then it might be nice to not allow it for
previous versions.
But I'm not sure. I'm okay to remove it if you all agree.

> Few more comments on v7:
> 1.
> +  Specifies the format in which pre-existing data on the
> publisher will
> +  copied to the subscriber. Supported formats are
> +  text and binary. The
> default is
> +  text.
> It'll be good to call out the cases in the documentation as to where
> copy_format can be enabled and needs to be disabled.
>

Modified that description a bit. Can you check if that's okay now?

> 2.
> + errmsg("%s value should be either \"text\" or \"binary\"",
> How about "value must be either "?
>

Done

> 3.
> Why should the subscription's binary option and copy_format option be
> tied at all? Tying these two options hurts usability. Is there a
> fundamental reason? I think they both are/must be independent. One
> deals with data types and another deals with how initial table data is
> copied.
>

My initial purpose with this patch was just making subscriptions with
binary option enabled fully binary from initial copy to apply. Things have
changed a bit when we decided to move binary copy behind a parameter.
I didn't actually think there would be any use case where a user wants the
initial copy to be in binary format for a sub with binary = false. Do you
think it would be useful to copy in binary even for a sub with binary
disabled?

Thanks,
-- 
Melih Mutlu
Microsoft

v8-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-02-16 Thread Melih Mutlu

Hi,

Thanks for reviewing. Please see the v8 here [1]

Takamichi Osumi (Fujitsu) , 1 Şub 2023 Çar,
06:05 tarihinde şunu yazdı:

> (1) general comment
>
> I wondered if the addition of the new option/parameter can introduce some
> confusion to the users.
>
> case 1. When binary = true and copy_format = text, the table sync is
> conducted by text.
> case 2. When binary = false and copy_format = binary, the table sync is
> conducted by binary.
> (Note that the case 2 can be accepted after addressing the 3rd comment of
> Bharath-san in [1].
> I agree with the 3rd comment by itself.)
>

I replied to Bharath's comment [1], can you please check to see if that
makes sense?


> The name of the new subscription parameter looks confusing.
> How about "init_sync_format" or something ?
>

The option to enable initial sync is named "copy_data", so I named the new
parameter as "copy_format" to refer to that copy meant by "copy_data".
Maybe "copy_data_format" would be better. I can change it if you think it's
better.


> (2) The commit message doesn't get updated.
>

Done


> (3) whitespace errors.
>

Done


> (4) pg_dump.c
>

Done


> (5) describe.c
>

Done


> (6)
>
> + * Extract the copy format value from a DefElem.
> + */
> +char
> +defGetCopyFormat(DefElem *def)
>
> Shouldn't this function be static and remove the change of
> subscriptioncmds.h ?
>

I wanted to make "defGetCopyFormat" be consistent with
"defGetStreamingMode" since they're basically doing the same work for
different parameters. And that function isn't static, so I didn't make
"defGetCopyFormat" static too.


> (7) catalogs.sgml
>

Done

(8) create_subscription.sgml
>

Done

Also;

The latest patch doesn't get updated for more than two weeks
> after some review comments. If you don't have time,
> I would like to help updating the patch for you and other reviewers.
>
> Kindly let me know your status.
>

 Sorry for the delay. This patch is currently one of my priorities.
Hopefully I will share quicker updates from now on.

[1]
https://www.postgresql.org/message-id/CAGPVpCQYi9AYQSS%3DRmGgVNjz5ZEnLB8mACwd9aioVhLmbgiAMA%40mail.gmail.com


Thanks,
-- 
Melih Mutlu
Microsoft

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-02-16 Thread Melih Mutlu

Hi Shveta and Shi,

Thanks for your investigations.

shveta malik , 8 Şub 2023 Çar, 16:49 tarihinde şunu
yazdı:

> On Tue, Feb 7, 2023 at 8:18 AM shiy.f...@fujitsu.com
>  wrote:
> > I reproduced the problem I reported before with latest patch (v7-0001,
> > v10-0002), and looked into this problem. It is caused by a similar
> reason. Here
> > is some analysis for the problem I reported [1].#6.
> >
> > First, a tablesync worker (worker-1) started for "tbl1", its originname
> is
> > "pg_16398_1". And it exited because of unique constraint. In
> > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated
> when
> > updating table state to DATASYNC, and the origin is created when
> updating table
> > state to FINISHEDCOPY. So when it exited with state DATASYNC , the
> origin is not
> > created but the originname has been updated in pg_subscription_rel.
> >
> > Then a tablesync worker (worker-2) started for "tbl2", its originname is
> > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to
> sync
> > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of
> "tbl1" -
> > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried
> to drop
> > the origin (although it is not actually created before). After that, it
> called
> > replorigin_by_name to get the originid whose name is "pg_16398_1" and
> the result
> > is InvalidOid. Origin won't be created in this case because the sync
> worker has
> > created a replication slot (when it synced tbl2), so the originid was
> still
> > invalid and it caused an assertion failure when calling
> replorigin_advance().
> >
> > It seems we don't need to drop previous origin in worker-2 because the
> previous
> > origin was not created in worker-1. I think one way to fix it is to not
> update
> > originname of pg_subscription_rel when setting state to DATASYNC, and
> only do
> > that when setting state to FINISHEDCOPY. If so, the originname in
> > pg_subscription_rel will be set at the same time the origin is created.
>
> +1. Update of system-catalog needs to be done carefully and only when
> origin is created.
>

I see that setting originname in the catalog before actually creating it
causes issues. My concern with setting originname when setting the state to
FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update
slot/origin name but it fails somewhere before reaching FINISHEDCOPY and
after creating slot/origin, then this new created slot/origin will be left
behind. It wouldn't be possible to find and drop them since their names are
not stored in the catalog. Eventually, this might also cause hitting
the max_replication_slots limit in case of such failures between origin
creation and FINISHEDCOPY.

One fix I can think is to update the catalog right after creating a new
origin. But this would also require commiting the current transaction to
actually persist the originname. I guess this action of commiting the
transaction in the middle of initial sync could hurt the copy process.

What do you think?

Also; working on an updated patch to address your other comments. Thanks
again.

Best,
-- 
Melih Mutlu
Microsoft

Re: Allow logical replication to copy tables in binary format

2023-02-20 Thread Melih Mutlu

Hi,

Hayato Kuroda (Fujitsu) , 20 Şub 2023 Pzt, 10:12
tarihinde şunu yazdı:

> Dear Melih,
>
> Thank you for updating the patch. Before reviewing, I found that
> cfbot have not accepted v8 patch [1].
>

Thanks for letting me know.
Attached the fixed version of the patch.

Best,
-- 
Melih Mutlu
Microsoft


v9-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2023-02-20 Thread Melih Mutlu

Amit Kapila , 16 Şub 2023 Per, 15:47 tarihinde
şunu yazdı:

> On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu 
> wrote:
> >> 8. Note that the COPY binary format isn't portable across platforms
> >> (Windows to Linux for instance) or major versions
> >> https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical
> >> replication is
> https://www.postgresql.org/docs/devel/logical-replication.html.
> >> I don't see any handling of such cases in copy_table but only a check
> >> for the publisher version. I think we need to account for all the
> >> cases - allow binary COPY only when publisher and subscriber are of
> >> same versions, architectures, platforms. The trick here is how we
> >> identify if the subscriber is of the same type and taste
> >> (architectures and platforms) as the publisher. Looking for
> >> PG_VERSION_STR of publisher and subscriber might be naive, but I'm not
> >> sure if there's a better way to do it.
> >
> >
> > I think having the "copy_format" option with text as default, like I
> replied to your 3rd review above, will keep things as they are now.
> > The patch now will only allow users to choose binary copy as well, if
> they want it and acknowledge the limitations that come with binary copy.
> > COPY command's portability issues shouldn't be an issue right now, since
> the patch still supports text format. Right?
> >
>
> One thing that is not completely clear from above is whether we will
> have any problem if the subscription uses binary mode for copying
> across the server versions. Do we use binary file during the copy used
> in logical replication?
>

Since binary copy relies on COPY command, we may have problems across
different server versions in cases where COPY is not portable.
What I understand from this [1], COPY works across server versions later
than 7.4. This shouldn't be a problem for logical replication.
Currently the patch also does not allow binary copy if the publisher
version is older than 16.

> > Logical replication between different types like int and smallint is
already not
> working properly on HEAD too.
> > Yes, the scenario you shared looks like working. But you didn't create
the
> subscription with binary=true. The patch did not change subscription with
> binary=false case. I believe what you should experiment is binary=true
case
> which already fails in the apply phase on HEAD.
> >
> > Well, with this patch, it will begin to fail in the table copy phase.
But I don't
> think this is a problem because logical replication in binary format is
already
> broken for replications between different data types.
> >
>

> So, doesn't this mean that there is no separate failure mode during
> the initial copy? I am clarifying this to see if the patch really
> needs a separate copy_format option for initial sync?
>

It will fail in a case such as [2] while it would work on HEAD.
What I meant by my above comment was that binary enabled subscriptions are
not already working properly if they replicate between different types. So,
the failure caused by replicating, for example, from smallint to int is not
really introduced by this patch. Such subscriptions would fail in
apply phase anyway. With this patch they will fail while binary copy.

[1] https://www.postgresql.org/docs/current/sql-copy.html
[2]
https://www.postgresql.org/message-id/OSZPR01MB6310B58F069FF8E148B247FDFDA49%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Best,
-- 
Melih Mutlu
Microsoft

Re: Summary function for pg_buffercache

2022-09-23 Thread Melih Mutlu

Hi Andres,

Adjusted the patch so that it will work with meson now.

Also addressed your other reviews as well.
I hope explanations in comments/docs are better now.

Best,
Melih


v11-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: Summary function for pg_buffercache

2022-09-28 Thread Melih Mutlu

Hi all,

The patch needed a rebase due to recent changes on pg_buffercache.
You can find the updated version attached.

Best,
Melih


v12-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: Summary function for pg_buffercache

2022-09-28 Thread Melih Mutlu

Zhang Mingli , 28 Eyl 2022 Çar, 17:31 tarihinde şunu
yazdı:

> Why compute usagecount_avg twice?
>

I should have removed the first one, but I think I missed it.
Nice catch.

Attached an updated version.

Thanks,
Melih


v13-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: Summary function for pg_buffercache

2022-09-28 Thread Melih Mutlu

Hi,

Seems like the commit a448e49bcbe40fb72e1ed85af910dd216d45bad8 reverts the
changes on pg_buffercache.

Why compute usagecount_avg twice?
>
Then, I'm going back to v11 + the fix for this.

Thanks,
Melih


v14-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2022-10-03 Thread Melih Mutlu

Hi Takamichi,

Thanks for your reviews.

I addressed your reviews, please find the attached patch.

osumi.takami...@fujitsu.com , 16 Eyl 2022 Cum,
16:51 tarihinde şunu yazdı:

> (1) whitespace issues
>

Fixed

(2) Suggestion to update another general description about the subscription
>
> Kindly have a look at doc/src/sgml/logical-replication.sgml.
>
> "The data types of the columns do not need to match,
> as long as the text representation of the data can be converted to the
> target type.
> For example, you can replicate from a column of type integer to a column
> of type bigint."
>
> With the patch, I think we have an impact about those descriptions
> since enabling the binary option for a subscription and executing the
> initial synchronization requires the same data types for binary format.
>
> I suggest that we update those descriptions as well.
>

You're right, this needs to be stated in the docs. Modified descriptions
accordingly.


> (3) shouldn't test that we fail expectedly with binary copy for different
> types ?
>
> How about having a test that we correctly fail with different data types
> between the publisher and the subscriber, for instance ?
>

Modified 002_types.pl test such that it now tests the replication between
different data types.
It's expected to fail if the binary is enabled, and succeed if not.


> (4) Error message of the binary format copy
>
> I've gotten below message from data type contradiction (between integer
> and bigint).
> Probably, this is unclear for the users to understand the direct cause
> and needs to be adjusted ?
> This might be a similar comment Euler mentioned in [1].
>
> 2022-09-16 11:54:54.835 UTC [4570] ERROR:  insufficient data left in
> message
> 2022-09-16 11:54:54.835 UTC [4570] CONTEXT:  COPY tab, line 1, column id
>

It's already unclear for users to understand what's the issue if they're
copying data between different column types via the COPY command.
This issue comes from COPY, and logical replication just utilizes COPY.
I don't think it would make sense to adjust an error message from a
functionality which logical replication only uses and has no direct impact
on.
It might be better to do this in a separate patch. What do you think?


> (5) Minor adjustment of the test comment in 002_types.pl.
>
> +is( $result, $sync_result, 'check initial sync on subscriber');
> +is( $result_binary, $sync_result, 'check initial sync on subscriber in
> binary');
>
>  # Insert initial test data
>
> There are two same comments which say "Insert initial test data" in this
> file.
> We need to update them, one for the initial table sync and
> the other for the application of changes.
>

Fixed.

I agree with the direction to support binary copy for v16 and later.
>
> IIUC, the binary format replication with different data types fails even
> during apply phase on HEAD.
> I thought that means, the upgrade concern only applies to a scenario that
> the user executes
> only initial table synchronizations between the publisher and subscriber
> and doesn't replicate any data at apply phase after that. I would say
> this isn't a valid scenario and your proposal makes sense.
>

No, logical replication in binary does not fail on apply phase if data
types are different.
The concern with upgrade (if data types are not the same) would be not
being able to create a new subscription with binary enabled or replicate
new tables added into publication.
Replication of tables from existing subscriptions would not be affected by
this change since they will already be in the apply phase, not tablesync.
Do you think this would still be an issue?


Thanks,
Melih


v3-0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Mingw task for Cirrus CI

2022-10-07 Thread Melih Mutlu

Hi hackers.

Andres Freund , 22 Eyl 2022 Per, 20:03 tarihinde şunu
yazdı:

> Now that meson has been merged, could you try to adjust this patch so it
> builds with meson?
>

Attached patch is the adjusted version to build with meson.

Add a commented-out trigger-type: manual and note in the commit message that
> the committer needs to adjust that piece.
>

Done.

> Also: your original patch said --host=x86_64-w64-mingw32, but the task
> > is called MinGW64.  The most recent patches don't use --host at all, and
> > were building for a 32 bit environment, even though the OS image says
> > MSYSTEM=UCRT64.


> I don't think you should need to use --host, that indicates cross
> compiling,
> which we shouldn't do. I don't think the patch without --host targets
> 32bit -
> the last time CI ran the patch, it built a 64bit PG:


Also removed this with meson changes. Can confirm that it builds 64bit.

Justin Pryzby , 5 Eyl 2022 Pzt, 14:50 tarihinde şunu
yazdı:

> I don't know what direction that idea is going, but it makes working
> with this patch a bit easier when configure is less slow.  Fine with me
> to split it into a separate patch :)


If it's okay, then I would suggest not including that piece into this patch
since I don't know what direction this is going either.


The patch has been changed quite a bit due to meson integration.
I would appreciate any feedback.

Best,
Melih


v6-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: Mingw task for Cirrus CI

2022-10-18 Thread Melih Mutlu

Hi Andres,

Andres Freund , 11 Eki 2022 Sal, 21:23 tarihinde şunu
yazdı:

> I think it might be easier to just set MSYS=winjitdebug
>
> https://www.msys2.org/wiki/JIT-Debugging/#native-windows-processes-started-from-msys2
> and then rely on the existing windows crash reporting stuff.
>

Done.


> You removed a bunch of newlines here and nearby - I assume that wasn't
> intentional?
>

Yes, that was my mistake. Thanks for pointing it out. Fixed.


> There's no need to use dash anymore, the amount of shell script run is
> minimal.
>

Switched back to bash.


> I think the "cd"s here are superfluous given the ninja -C %BUILD_DIR% etc.
>

You're right. Those were needed when it was building with autoconf. Not
anymore. Removed.

Only remembered that just after sending my email: When using b_pch=true
> (which
> saves a lot of time on mingw) ccache won't cache much (IIRC just the pch
> files
> themselves) unless you do something like
> export CCACHE_SLOPPINESS=pch_defines,time_macros
>

Added  CCACHE_SLOPPINESS and CCACHE_SLOPPINESS variables.

You can find the updated patch attached.

Best,
Melih


v7-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: Mingw task for Cirrus CI

2022-10-18 Thread Melih Mutlu

Hi Bilal,

Nazir Bilal Yavuz , 18 Eki 2022 Sal, 15:37 tarihinde
şunu yazdı:

> env:
> ...
> CHERE_INVOKING: 1
> BASH_EXE: C:\msys64\usr\bin\bash.exe
>
> 'CHERE_INVOKING: 1' will cause bash.exe to start from current working
> directory(%CIRRUS_WORKING_DIR%).
>
> In this way, there is no need for
> 'cd %CIRRUS_WORKING_DIR%' when using bash.exe,
> also no need for 'BUILD_DIR' environment variable.
>

 Right, setting CHERE_INVOKING and removing all cd's work and look better.
Thanks for the suggestion.

Here's the updated patch.

Best,
Melih


v8-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: Mingw task for Cirrus CI

2022-10-19 Thread Melih Mutlu

Hi,

Andres Freund , 19 Eki 2022 Çar, 06:19 tarihinde şunu
yazdı:

> It's a bit odd to separate the CCACHE_* variables from each other
> (e.g. BUILD_DIR is inbetween them)...
>
> Perhaps add a comment explaining that otherwise plpython tests fail?


>
With this src/tools/ci/cores_backtrace.sh shouldn't need to be modified
> anymore afaict?
>
> I'd replace --buildtype debug with -Ddebug=true -Doptimization=g.
>


Seems like this could use %MTEST_ARGS%?


All the things you mentioned above are done.



> > +CCACHE_SLOPPINESS: pch_defines,time_macros
> > +CCACHE_DEPEND: 1
>
> I experimented a bit and it looks like ccache doesn't yet quite work in CI,
> but only because the ccache needs to be larger. Looks like we need about
> ~400MB.
>
> A fully cached build is ~2min
>

Then I should increase CCACHE_MAXSIZE, right? Made it 500MB for MinGW.

Sharing the updated version of the patch.

Thanks,
Melih


v9-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: Mingw task for Cirrus CI

2022-10-24 Thread Melih Mutlu

Hi,


> > +
> > +  on_failure:
> > +<<: *on_failure_meson
> > +cores_script: |
> > +  %BASH_EXE% -lc "cd build src/tools/ci/cores_backtrace.sh msys
> build/tmp_install"
> > +
>
> This is wrong - it should just archive the same files that the current
> windows
> task does.
>

Changed it with the on_failure from the other windows task.


> Other than that, I think this is basically ready?
>

If you say so, then I think it's ready.

Best,
Melih


v10-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

[PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-05 Thread Melih Mutlu

Hi hackers,

I created a patch to reuse tablesync workers and their replication slots
for more tables that are not synced yet. So that overhead of creating and
dropping workers/replication slots can be reduced.

Current version of logical replication has two steps: tablesync and apply.
In tablesync step, apply worker creates a tablesync worker for each table
and those tablesync workers are killed when they're done with their
associated table. (the number of tablesync workers running at the same time
is limited by "max_sync_workers_per_subscription")
Each tablesync worker also creates a replication slot on publisher during
its lifetime and drops the slot before exiting.

The purpose of this patch is getting rid of the overhead of
creating/killing a new worker (and replication slot) for each table.
It aims to reuse tablesync workers and their replication slots so that
tablesync workers can copy multiple tables from publisher to subscriber
during their lifetime.

The benefits of reusing tablesync workers can be significant if tables are
empty or close to empty.
In an empty table case, spawning tablesync workers and handling replication
slots are where the most time is spent since the actual copy phase takes
too little time.


The changes in the behaviour of tablesync workers with this patch as
follows:
1- After tablesync worker is done with syncing the current table, it takes
a lock and fetches tables in init state
2- it looks for a table that is not already being synced by another worker
from the tables with init state
3- If it founds one, updates its state for the new table and loops back to
beginning to start syncing
4- If no table found, it drops the replication slot and exits


With those changes, I did some benchmarking to see if it improves anything.
This results compares this patch with the latest version of master branch.
"max_sync_workers_per_subscription" is set to 2 as default.
Got some results simply averaging timings from 5 consecutive runs for each
branch.

First, tested logical replication with empty tables.
10 tables

- master:286.964 ms
- the patch:116.852 ms

100 tables

- master:2785.328 ms
- the patch:706.817 ms

10K tables

- master:39612.349 ms
- the patch:12526.981 ms


Also tried replication tables with some data
10 tables loaded with 10MB data

- master:1517.714 ms
- the patch:1399.965 ms

100 tables loaded with 10MB data

- master:16327.229 ms
- the patch:11963.696 ms


Then loaded more data
10 tables loaded with 100MB data

- master:13910.189 ms
- the patch:14770.982 ms

100 tables loaded with 100MB data

- master:146281.457 ms
- the patch:156957.512


If tables are mostly empty, the improvement can be significant - up to 3x
faster logical replication.
With some data loaded, it can still be faster to some extent.
When the table size increases more, the advantage of reusing workers
becomes insignificant.


I would appreciate your comments and suggestions.Thanks in advance for
reviewing.

Best,
Melih


0001-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-08 Thread Melih Mutlu

Hi Amit and Dilip,

Thanks for the replies.


> > I had a quick look into the patch and it seems it is using the worker
> > array index instead of relid while forming the slot name
>

Yes, I changed the slot names so they include slot index instead of
relation id.
This was needed because I aimed to separate replication slots from
relations.

I think that won't work because each time on restart the slot won't be
> fixed. Now, it is possible that we may drop the wrong slot if that
> state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible
> that while creating a slot, we fail because the same name slot already
> exists due to some other worker which has created that slot has been
> restarted. Also, what about origin_name, won't that have similar
> problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if
> the slot is not the same as we have used in the previous run of a
> particular worker, it may start WAL streaming from a different point
> based on the slot's confirmed_flush_location.
>

You're right Amit. In case of a failure, tablesync phase of a relation may
continue with different worker and replication slot due to this change in
naming.
Seems like the same replication slot should be used from start to end for a
relation during tablesync. However, creating/dropping replication slots can
be a major overhead in some cases.
It would be nice if these slots are somehow reused.

To overcome this issue, I've been thinking about making some changes in my
patch.
So far, my proposal would be as follows:

Slot naming can be like pg__ instead of
pg__. This way each worker can use the same replication
slot during their lifetime.
But if a worker is restarted, then it will switch to a new replication slot
since its pid has changed.

pg_subscription_rel catalog can store replication slot name for each
non-ready relation. Then we can find the slot needed for that particular
relation to complete tablesync.
If a worker syncs a relation without any error, everything works well and
this new replication slot column from the catalog will not be needed.
However if a worker is restarted due to a failure, the previous run of that
worker left its slot behind since it did not exit properly.
And the restarted worker (with a different pid) will see that the relation
is actually in  SUBREL_STATE_FINISHEDCOPY and want to proceed for the
catchup step.
Then the worker can look for that particular relation's replication slot
from pg_subscription_rel catalog (slot name should be there since relation
state is not ready). And tablesync can proceed with that slot.

There might be some cases where some replication slots are left behind. An
example to such cases would be when the slot is removed from
pg_subscription_rel catalog and detached from any relation, but tha slot
actually couldn't be dropped for some reason. For such cases, a slot
cleanup logic is needed. This cleanup can also be done by tablesync workers.
Whenever a tablesync worker is created, it can look for existing
replication slots that do not belong to any relation and any worker (slot
name has pid for that), and drop those slots if it finds any.

What do you think about this new way of handling slots? Do you see any
points of concern?

I'm currently working on adding this change into the patch. And would
appreciate any comment.

Thanks,
Melih

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-08 Thread Melih Mutlu

>
> It seems from your results that performance degrades for large
> relations. Did you try to investigate the reasons for the same?
>

I have not tried to investigate the performance degradation for large
relations yet.
Once I'm done with changes for the slot usage, I'll look into this and come
with more findings.

Thanks,
Melih

Re: logical replication restrictions

2022-07-13 Thread Melih Mutlu

Hi Euler,

I've some comments/questions about the latest version (v4) of your patch.

Firstly, I think the patch needs a rebase. CI currently cannot apply it [1].

22. src/test/subscription/t/032_apply_delay.pl
>
> I received the following error when trying to run these 'subscription'
> tests:
>
> t/032_apply_delay.pl ... No such class log_location at
> t/032_apply_delay.pl line 49, near "my log_location"
> syntax error at t/032_apply_delay.pl line 49, near "my log_location ="
>

I'm having these errors too. Seems like some declarations are missing.


+  specified amount of time. If this value is specified without
>> units,
>
> +  it is taken as milliseconds. The default is zero, adding no
>> delay.
>
> + 
>
> I'm also having an issue when I give min_apply_delay parameter without
units.
I expect that if I set  min_apply_delay to 5000 (without any unit), it will
be interpreted as 5000 ms.

I tried:
postgres=# CREATE SUBSCRIPTION testsub CONNECTION 'dbname=postgres
port=5432' PUBLICATION testpub WITH (min_apply_delay=5000);

And logs showed:
2022-07-13 20:26:52.231 +03 [5422] LOG:  logical replication apply delay:
499 ms
2022-07-13 20:26:52.231 +03 [5422] CONTEXT:  processing remote data for
replication origin "pg_18126" during "BEGIN" in transaction 3152 finished
at 0/465D7A0

Looks like it starts from 500 ms instead of 5000 ms for me. If I state
the unit as ms, then it works correctly.


Lastly, I have a question about this delay during tablesync.
It's stated here that apply delays are not for initial tablesync.

 
>
> +  The delay occurs only on WAL records for transaction begins and
>> after
>
> +  the initial table synchronization. It is possible that the
>
> +  replication delay between publisher and subscriber exceeds the
>> value
>
> +  of this parameter, in which case no delay is added. Note that
>> the
>
> +  delay is calculated between the WAL time stamp as written on
>
> +  publisher and the current time on the subscriber. Delays in
>> logical
>
> +  decoding and in transfer the transaction may reduce the actual
>> wait
>
> +  time. If the system clocks on publisher and subscriber are not
>
> +  synchronized, this may lead to apply changes earlier than
>> expected.
>
> +  This is not a major issue because a typical setting of this
>> parameter
>
> +  are much larger than typical time deviations between servers.
>
> + 
>
>
There might be a case where tablesync workers are in SYNCWAIT state and
waiting for apply worker to tell them to CATCHUP.
And if apply worker is waiting in apply_delay function, tablesync workers
will be stuck at SYNCWAIT state and this might delay tablesync at least
"min_apply_delay" amount of time or more.
Is it something we would want? What do you think?


[1] http://cfbot.cputube.org/patch_38_3581.log


Best,
Melih

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-27 Thread Melih Mutlu

Hi Amit,

I updated the patch in order to prevent the problems that might be caused
by using different replication slots for syncing a table.
As suggested in previous emails, replication slot names are stored in the
catalog. So slot names can be reached later and it is ensured
that same replication slot is used during tablesync step of a table.

With the current version of the patch:
-. "srrelslotname" column is introduced into pg_subscibtion_rel catalog. It
stores the slot name for tablesync

-. Tablesync worker logic is now as follows:
1. Tablesync worker is launched by apply worker for a table.
2. Worker generates a default replication slot name for itself. Slot name
includes subid and worker pid for tracking purposes.
3. If table has a slot name value in the catalog:

i. If the table state is DATASYNC, drop the replication slot from the
catalog and proceed tablesync with a new slot.

ii. If the table state is FINISHEDCOPY, use the replicaton slot from the
catalog, do not create a new slot.

4. Before worker moves to new table, drop any replication slot that are
retrieved from the catalog and used.
5. In case of no table left to sync, drop the replication slot of that sync
worker with worker pid if it exists. (It's possible that a sync worker do
not create a replication slot for itself but uses slots read from the
catalog in each iteration)


I think using worker_pid also has similar risks of mixing slots from
> different workers because after restart same worker_pid could be
> assigned to a totally different worker. Can we think of using a unique
> 64-bit number instead? This will be allocated when each workers
> started for the very first time and after that we can refer catalog to
> find it as suggested in the idea below.
>

I'm not sure how likely to have colliding pid's for different tablesync
workers in the same subscription.
Though ,having pid in slot name makes it easier to track which slot belongs
to which worker. That's why I kept using pid in slot names.
But I think it should be simple to switch to using a unique 64-bit number.
So I can remove pid's from slot names, if you think that it would be
better.


> We should use the same for the origin name as well.
>

I did not really change anything related to origin names. Origin names are
still the same and include relation id. What do you think would be an issue
with origin names in this patch?


> This sounds tricky. Why not first drop slot/origin and then detach it
> from pg_subscription_rel? On restarts, it is possible that we may
> error out after dropping the slot or origin but before updating the
> catalog entry but in such case we can ignore missing slot/origin and
> detach them from pg_subscription_rel. Also, if we use the unique
> number as suggested above, I think even if we don't remove it after
> the relation state is ready, it should be okay.
>

Right, I did not add an additional slot cleanup step. The patch now drops
the slot when we're done with it and then removes it from the catalog.


Thanks,
Melih


v2-0001-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Re: Mingw task for Cirrus CI

2022-07-28 Thread Melih Mutlu

Hi hackers,

I'm sharing the rebased version of this patch, if you're still interested.

I would appreciate any feedback or concerns.

Best,
Melih


Andrew Dunstan , 9 Nis 2022 Cmt, 19:34 tarihinde şunu
yazdı:

>
> On 4/8/22 21:02, Andres Freund wrote:
> > Hi,
> >
> > On 2022-04-08 19:27:58 -0500, Justin Pryzby wrote:
> >> On Thu, Apr 07, 2022 at 10:10:21AM -0700, Andres Freund wrote:
> >>> On 2022-04-06 11:03:37 -0400, Andrew Dunstan wrote:
>  On 3/30/22 20:26, Andres Freund wrote:
> > Could you try using dash to invoke configure here, and whether it
> makes configure faster?
>  I got weird failures re libxml/parser.h when I tried with dash. See
>   (It would be nice if we
>  could see config.log on failure.)
> >>> Since dash won't help us to get the build time down sufficiently, and
> the
> >>> tests don't pass without a separate build tree, I looked at what makes
> >>> config/prep_buildtree so slow.
> >>>
> >>> It's largely just bad code. The slowest part are spawning one expr and
> mkdir
> >>> -p for each directory. One 'cmp' for each makefile doesn't help either.
> >>>
> >>> The expr can be replaced with
> >>>   subdir=${item#$sourcetree}
> >>> that's afaics posix syntax ([1]), not bash.
> >>>
> >>> Spawning one mkdir for each directory can be replaced by a single mkdir
> >>> invocation with all the directories. On my linux workstation that gets
> the
> >>> time for the first loop down from 1005ms to 38ms, really.
> >> Even better?
> >>
> >> (cd "$sourcetree" && find . -print |grep -E '/Makefile$|/GNUmakefile$'
> |grep -v "$sourcetree/doc/src/sgml/images/" |xargs tar c) |
> >> (cd "$buildtree" && tar x)
> > Don't think we depend on tar for building, at the moment. But yes, it'd
> be
> > faster... Tar is certainly a smaller dependency than perl, not sure if
> there's
> > any relevant platform without it?
> >
> >
>
>
> Couple of things here.
>
>
> 1. The second grep should lose "$sourcetree" I think.
>
> 2. Isn't this going to be a change in behaviour in that it will copy
> rather than symlinking the Makefiles? That is in fact the default
> behaviour of msys2's 'ln -s', as I pointed out upthread, but I don't
> think it's what we really want, especially in the general case. If you
> modify the Makefile and you're using a vpath you want to see the change
> reflected in your vpath.
>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>


v3-0001-Added-Windows-with-MinGW-environment-in-Cirrus-CI.patch
Description: Binary data

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-07-28 Thread Melih Mutlu

>
> Why after step 4, do you need to drop the replication slot? Won't just
> clearing the required info from the catalog be sufficient?
>

The replication slots that we read from the catalog will not be used for
anything else after we're done with syncing the table which the rep slot
belongs to.
It's removed from the catalog when the sync is completed and it basically
becomes a slot that is not linked to any table or worker. That's why I
think it should be dropped rather than left behind.

Note that if a worker dies and its replication slot continues to exist,
that slot will only be used to complete the sync process of the one table
that the dead worker was syncing but couldn't finish.
When that particular table is synced and becomes ready, the replication
slot has no use anymore.


> Hmm, I think even if there is an iota of a chance which I think is
> there, we can't use worker_pid. Assume, that if the same worker_pid is
> assigned to another worker once the worker using it got an error out,
> the new worker will fail as soon as it will try to create a
> replication slot.
>

Right. If something like that happens, worker will fail without doing
anything. Then a new one will be launched and that one will continue to
do the work.
The worst case might be having conflicting pid over and over again while
also having replication slots whose name includes one of those pids still
exist.
It seems unlikely but possible, yes.


> I feel it would be better or maybe we need to think of some other
> identifier but one thing we need to think about before using a 64-bit
> unique identifier here is how will we retrieve its last used value
> after restart of server. We may need to store it in a persistent way
> somewhere.
>

We might consider storing this info in a catalog again. Since this last
used value will be different for each subscription, pg_subscription can be
a good place to keep that.


> The problems will be similar to the slot name. The origin is used to
> track the progress of replication, so, if we use the wrong origin name
> after the restart, it can send the wrong start_streaming position to
> the publisher.
>

I understand. But origin naming logic is still the same. Its format is like
pg__ .
I did not need to change this since it seems to me origins should belong to
only one table. The patch does not reuse origins.
So I don't think this change introduces an issue with origin. What do you
think?

Thanks,
Melih

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-08-05 Thread Melih Mutlu

Hi Amit,

>> Why after step 4, do you need to drop the replication slot? Won't just
> >> clearing the required info from the catalog be sufficient?
> >
> >
> > The replication slots that we read from the catalog will not be used for
> anything else after we're done with syncing the table which the rep slot
> belongs to.
> > It's removed from the catalog when the sync is completed and it
> basically becomes a slot that is not linked to any table or worker. That's
> why I think it should be dropped rather than left behind.
> >
> > Note that if a worker dies and its replication slot continues to exist,
> that slot will only be used to complete the sync process of the one table
> that the dead worker was syncing but couldn't finish.
> > When that particular table is synced and becomes ready, the replication
> slot has no use anymore.
> >
>
> Why can't it be used to sync the other tables if any?
>

It can be used. But I thought it would be better not to, for example in the
following case:
Let's say a sync worker starts with a table in INIT state. The worker
creates a new replication slot to sync that table.
When sync of the table is completed, it will move to the next one. This
time the new table may be in FINISHEDCOPY state, so the worker may need to
use the new table's existing replication slot.
Before the worker will move to the next table again, there will be two
replication slots used by the worker. We might want to keep one and drop
the other.
At this point, I thought it would be better to keep the replication slot
created by this worker in the first place. I think it's easier to track
slots this way since we know how to generate the rep slots name.
Otherwise we would need to store the replication slot name somewhere too.



> This sounds reasonable. Let's do this unless we get some better idea.
>

I updated the patch to use an unique id for replication slot names and
store the last used id in the catalog.
Can you look into it again please?


There is no such restriction that origins should belong to only one
> table. What makes you think like that?
>

I did not reuse origins since I didn't think it would significantly improve
the performance as reusing replication slots does.
So I just kept the origins as they were, even if it was possible to reuse
them. Does that make sense?

Best,
Melih


v3-0001-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

Allow logical replication to copy tables in binary format

2022-08-10 Thread Melih Mutlu

Hey hackers,

I see that logical replication subscriptions have an option to enable
binary [1].
When it's enabled, subscription requests publisher to send data in binary
format.
But this is only the case for apply phase. In tablesync, tables are still
copied as text.

To copy tables, COPY command is used and that command supports copying in
binary. So it seemed to me possible to copy in binary for tablesync too.
I'm not sure if there is a reason to always copy tables in text format. But
I couldn't see why not to do it in binary if it's enabled.

You can find the small patch that only enables binary copy attached.

What do you think about this change? Does it make sense? Am I missing
something?

[1] https://www.postgresql.org/docs/15/sql-createsubscription.html

Best,
Melih


0001-Allow-logical-replication-to-copy-table-in-binary.patch
Description: Binary data

Re: Allow logical replication to copy tables in binary format

2022-08-11 Thread Melih Mutlu

Euler Taveira , 11 Ağu 2022 Per, 16:27 tarihinde şunu
yazdı:

> On Thu, Aug 11, 2022, at 8:04 AM, Amit Kapila wrote:
>
> On Thu, Aug 11, 2022 at 7:34 AM Euler Taveira  wrote:
> >
> > The reason to use text format is that it is error prone. There are
> restrictions
> > while using the binary format. For example, if your schema has different
> data
> > types for a certain column, the copy will fail.
> >
>
> Won't such restrictions hold true even during replication?
>
> I expect that the COPY code matches the proto.c code. The point is that
> table
> sync is decoupled from the logical replication. Hence, we should emphasize
> in
> the documentation that the restrictions *also* apply to the initial table
> synchronization.
>

If such restrictions are already the case for replication phase after
initial table sync, then it shouldn't prevent us from enabling binary
option for table sync. Right?
But yes, it needs to be stated somewhere.

Euler Taveira , 11 Ağu 2022 Per, 05:03 tarihinde şunu
yazdı

> I have a few points about your implementation.
>
> * Are we considering to support prior Postgres versions too? These releases
>   support binary mode but it could be an unexpected behavior (initial sync
> in
>   binary mode) for a publisher using 14 or 15 and a subscriber using 16.
> IMO
>   you should only allow it for publisher on 16 or later.
>

How is any issue that might occur due to version mismatch being handled
right now in repliaction after table sync?
What I understand from the documentation is if replication can fail due to
using different pg versions, it just fails. So binary option cannot be
used. [1]
Do you think that this is more serious for table sync and we need to
restrict binary option with different publisher and subscriber versions?
But not for replication?

* Docs should say that the binary option also applies to initial table
>   synchronization and possibly emphasize some of the restrictions.
> * Tests. Are the current tests enough? 014_binary.pl.
>

You're right on both points. I just wanted to know your opinions on this
first. Then the patch will need some tests and proper documentation.

[1] https://www.postgresql.org/docs/15/sql-createsubscription.html

Best,
Melih

1 2 >

1 - 100 of 166 matches

Mail list logo