Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Artemiy Ryabinkov
Why backend send buffer use exactly 8KB? 
(https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134) 



I had this question when I try to measure the speed of reading data. The 
bottleneck was a read syscall. With strace I found that in most cases 
read returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we 
can confirm, that network packets have size 8192 
(https://pastebin.com/FD8abbiA)


So, with well-tuned networking stack, the limit is 8KB. The reason is 
the hardcoded size of Postgres write buffer.


I found discussion, where Tom Lane says that the reason of this limit is 
the size of pipe buffers in Unix machines: 
https://www.postgresql.org/message-id/9426.1388761242%40sss.pgh.pa.us


> Traditionally, at least, that was the size of pipe buffers in Unix 
machines, so in principle this is the most optimal chunk size for 
sending data across a Unix socket.  I have no idea though if that's 
still true in kernels in common use today. For TCP communication it 
might be marginally better to find out the MTU size and use that; but 
it's unclear that it's worth the trouble, or indeed that we can

know the end-to-end MTU size with any reliability.

Does it make sense to make this parameter configurable?

--
Artemiy Ryabinkov
getlag(at)ya(dot)ru




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Tom Lane
Artemiy Ryabinkov  writes:
> Does it make sense to make this parameter configurable?

Not without some proof that it makes a performance difference on
common setups (which you've not provided).

Even with some proof, I'm not sure I'd bother with exposing a
user-tunable knob, as opposed to just making the buffer bigger.
We have far too many GUCs already.

regards, tom lane




RE: Hardware for writing/updating 12,000,000 rows per hour

2019-07-27 Thread farjad . farid
With this kind of design requirements it is worth considering hardware "failure 
& recovery". Even SSDs can and do fail. 

It is not just a matter of just speed. RAID disks of some kind, depending on 
the budget is worth the effort.  



-Original Message-
From: Alvaro Herrera  
Sent: 2019 July 26 22:39
To: Arya F 
Cc: Tom Lane ; Ron ; 
pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now 
> the server has an HDD and it really can't handle a lot of updates and 
> inserts per second. Would changing to a regular SSD be able to easily 
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many 
indexes are there to update, whether the updated columns are indexed or not, 
what the datatypes are, how much locality of access you'll have ... I'm 
probably missing some other important factors.  (Of course, you'll have to tune 
various PG server settings to find your sweet spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably 
cheap way is to rent a machine somewhere with the type of hardware you think 
you'll need, and run your workload there for long enough, making sure to 
carefully observe important metrics such as table size, accumulated bloat, 
checkpoint regime, overall I/O activity, and so on.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






Re: Hardware for writing/updating 12,000,000 rows per hour

2019-07-27 Thread Arya F
On Sat, Jul 27, 2019 at 11:49 AM farjad.farid <
farjad.fa...@checknetworks.com> wrote:

> With this kind of design requirements it is worth considering hardware
> "failure & recovery". Even SSDs can and do fail.
>
> It is not just a matter of just speed. RAID disks of some kind, depending
> on the budget is worth the effort.
>
>
>
> -Original Message-
> From: Alvaro Herrera 
> Sent: 2019 July 26 22:39
> To: Arya F 
> Cc: Tom Lane ; Ron ;
> pgsql-general@lists.postgresql.org
> Subject: Re: Hardware for writing/updating 12,000,000 rows per hour
>
> On 2019-Jul-26, Arya F wrote:
>
> > I think I can modify my application to do a batch update. Right now
> > the server has an HDD and it really can't handle a lot of updates and
> > inserts per second. Would changing to a regular SSD be able to easily
> > do 3000 updates per second?
>
> That's a pretty hard question in isolation -- you need to consider how
> many indexes are there to update, whether the updated columns are indexed
> or not, what the datatypes are, how much locality of access you'll have ...
> I'm probably missing some other important factors.  (Of course, you'll have
> to tune various PG server settings to find your sweet spot.)
>
> I suggest that should be measuring instead of trying to guess.  A
> reasonably cheap way is to rent a machine somewhere with the type of
> hardware you think you'll need, and run your workload there for long
> enough, making sure to carefully observe important metrics such as table
> size, accumulated bloat, checkpoint regime, overall I/O activity, and so on.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
>

Hi Farjad

I was thinking of having physical or logical replication. Or is having RAID
a must if I don't want to lose data?


Re: Default ordering option

2019-07-27 Thread Peter J. Holzer
On 2019-07-26 09:52:52 +0200, Cyril Champier wrote:
> No, the code I pasted was an existing production bug: the last_name
> should have been unique, so the selected patient would always be the
> same. This should have been detected in tests, but since the order was
> "almost always the same", our test was green 99% of the time, so we
> discarded it as flaky.
> 
> Fuzzy testing could be an option, but this would go too far, as for
> Peter extension suggestion. We have huge existing codebase with more
> than 10K tests, and I do not want to modify our whole testing
> strategy.
> 
> Meanwhile, I went for an ORM patch (ActiveRecord) and forbid usages
> that can workaround it.

Another idea:

How do ypu prepare your test data? Do you have a (possibly large) test
database or do you populate a test database with test-specific data in a
fixture?

If you do the latter, you might be able insert the data in random order.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature


Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
On 2019-07-27 14:43:54 +0300, Artemiy Ryabinkov wrote:
> Why backend send buffer use exactly 8KB? 
> (https://github.com/postgres/postgres/blob/249d64999615802752940e017ee5166e726bc7cd/src/backend/libpq/pqcomm.c#L134)
> 
> 
> I had this question when I try to measure the speed of reading data. The
> bottleneck was a read syscall. With strace I found that in most cases read
> returns 8192 bytes (https://pastebin.com/LU10BdBJ). With tcpdump we can
> confirm, that network packets have size 8192 (https://pastebin.com/FD8abbiA)

Well, in most setups, you can't have that large frames. The most common
limit is 1500 +- some overheads. Using jumbo frames isn't that uncommon,
but it has enough problems that I don't think it's that widely used with
postgres.


> So, with well-tuned networking stack, the limit is 8KB. The reason is the
> hardcoded size of Postgres write buffer.

Well, jumbo frames are limited to 9000 bytes.



But the reason you're seeing 8192 sized packages isn't just that we have
an 8kb buffer, I think it's also that that we unconditionally set
TCP_NODELAY:

#ifdef  TCP_NODELAY
on = 1;
if (setsockopt(port->sock, IPPROTO_TCP, TCP_NODELAY,
   (char *) &on, sizeof(on)) < 0)
{
elog(LOG, "setsockopt(%s) failed: %m", "TCP_NODELAY");
return STATUS_ERROR;
}
#endif

With 8KB send size, we'll often unnecessarily send some smaller packets
(both for 1500 and 9000 MTUs), because 8kB doesn't neatly divide into
the MTU. Here's e.g. the ip packet sizes for a query returning maybe
18kB:

1500
1500
1500
1500
1500
1004
1500
1500
1500
1500
1500
1004
1500
414

the dips are because that's where our 8KB buffer + disabling nagle
implies a packet boundary.


I wonder if we ought to pass MSG_MORE (which overrides TCP_NODELAY by
basically having TCP_CORK behaviour for that call) in cases we know
there's more data to send. Which we pretty much know, although we'd need
to pass that knowledge from pqcomm.c to be-secure.c


It might be better to just use larger send sizes however. I think most
kernels are going to be better than us knowing how to chop up the send
size. We're using much larger limits when sending data from the client
(no limit for !win32, 65k for windows), and I don't recall seeing any
problem reports about that.


OTOH, I'm not quite convinced that you're going to see much of a
performance difference in most scenarios. As soon as the connection is
actually congested, the kernel will coalesce packages regardless of the
send() size.


> Does it make sense to make this parameter configurable?

I'd much rather not. It's goign to be too hard to tune, and I don't see
any tradeoffs actually requiring that.

Greetings,

Andres Freund




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
Hi,

On 2019-07-27 11:09:06 -0400, Tom Lane wrote:
> Artemiy Ryabinkov  writes:
> > Does it make sense to make this parameter configurable?
>
> Not without some proof that it makes a performance difference on
> common setups (which you've not provided).

I think us unnecessarily fragmenting into some smaller packets everytime
we send a full 8kB buffer, unless there's already network congestion, is
kind of evidence enough? The combination of a relatively small send
buffer + TCP_NODELAY isn't great.

I'm not quite sure what the smaller buffer is supposed to achieve, at
least these days. In blocking mode (emulated in PG code, using latches,
so we can accept interrupts) we'll always just loop back to another
send() in internal_flush(). In non-blocking mode, we'll fall out of the
loop as soon as the kernel didn't send any data. Isn't the outcome of
using such a small send buffer that we end up performing a) more
syscalls, which has gotten a lot worse in last two years due to all the
cpu vulnerability mitigations making syscalls a *lot* more epensive b)
unnecessary fragmentation?

The situation for receiving data is a bit different. For one, we don't
cause unnecessary fragmentation by using a buffer of a relatively
limited size. But more importantly, copying data into the buffer takes
time, and we could actually be responding to queries earlier in the
data. In contrast to the send case we don't loop around recv() until all
the data has been received.

I suspect we could still do with a bigger buffer, just to reduce the
number of syscalls in bulk loading cases, however.

Greetings,

Andres Freund




RE: Hardware for writing/updating 12,000,000 rows per hour

2019-07-27 Thread farjad . farid
HI Arya,

It is not clear what is the budget and why there is so much data? Is this a 
real time system, e.g. 24/7 operation. Even if each row takes up just 50 bytes, 
that is a lot of data in/out of your CPUs/memory/hard disk, any one of which 
could fail.

Personally I would recommend analyzing the software for any pattern that might 
help you to reduce the use of hard disk. Push the data as much as possible to 
memory, then the overflow to hard disk(if at all possible) I know it might be 
difficult but it could save you a lot of down time/maintenance. Also double 
check reliability of hard disk vs. SSD. Not all SSDs are server grade or 
motherboards or memory.  Use hardware RAID card, not software based as it 
degrades the overall performance.

As a minimum I would recommend a RAID configuration bearing in mind the budget. 
Also check the card manufacturer’s reliability figures. Reliability of all 
components matter. Short term cost saving, could cost a lot more in this kind 
of situations. At least made a request. If they reject it and things go wrong 
you could point out that you had made the request.

What about backing up the data?

Good luck.




From: Arya F 
Sent: 2019 July 27 17:56
To: farjad.farid 
Cc: Alvaro Herrera ; Tom Lane ; 
Ron ; pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour



On Sat, Jul 27, 2019 at 11:49 AM farjad.farid 
mailto:farjad.fa...@checknetworks.com>> wrote:
With this kind of design requirements it is worth considering hardware "failure 
& recovery". Even SSDs can and do fail.

It is not just a matter of just speed. RAID disks of some kind, depending on 
the budget is worth the effort.



-Original Message-
From: Alvaro Herrera mailto:alvhe...@2ndquadrant.com>>
Sent: 2019 July 26 22:39
To: Arya F mailto:arya6...@gmail.com>>
Cc: Tom Lane mailto:t...@sss.pgh.pa.us>>; Ron 
mailto:ronljohnso...@gmail.com>>; 
pgsql-general@lists.postgresql.org
Subject: Re: Hardware for writing/updating 12,000,000 rows per hour

On 2019-Jul-26, Arya F wrote:

> I think I can modify my application to do a batch update. Right now
> the server has an HDD and it really can't handle a lot of updates and
> inserts per second. Would changing to a regular SSD be able to easily
> do 3000 updates per second?

That's a pretty hard question in isolation -- you need to consider how many 
indexes are there to update, whether the updated columns are indexed or not, 
what the datatypes are, how much locality of access you'll have ... I'm 
probably missing some other important factors.  (Of course, you'll have to tune 
various PG server settings to find your sweet spot.)

I suggest that should be measuring instead of trying to guess.  A reasonably 
cheap way is to rent a machine somewhere with the type of hardware you think 
you'll need, and run your workload there for long enough, making sure to 
carefully observe important metrics such as table size, accumulated bloat, 
checkpoint regime, overall I/O activity, and so on.

--
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi Farjad
I was thinking of having physical or logical replication. Or is having RAID a 
must if I don't want to lose data?


Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Tom Lane
Andres Freund  writes:
> It might be better to just use larger send sizes however. I think most
> kernels are going to be better than us knowing how to chop up the send
> size.

Yeah.  The existing commentary about that is basically justifying 8K
as being large enough to avoid performance issues; if somebody can
show that that's not true, I wouldn't have any hesitation about
kicking it up.

(Might be worth malloc'ing it rather than having it as part of the
static process image if we do so, but that's a trivial change.)

regards, tom lane




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
Hi,

On 2019-07-27 18:34:50 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > It might be better to just use larger send sizes however. I think most
> > kernels are going to be better than us knowing how to chop up the send
> > size.

> Yeah.  The existing commentary about that is basically justifying 8K
> as being large enough to avoid performance issues; if somebody can
> show that that's not true, I wouldn't have any hesitation about
> kicking it up.

You think that unnecessary fragmentation, which I did show, isn't good
enough? That does have cost on the network level, even if it possibly
doesn't show up that much in timing.


I wonder if we ought to just query SO_SNDBUF/SO_RCVBUF or such, and use
those (although that's not quite perfect, because there's some added
overhead before data ends up in SNDBUF). Probably with some clamping, to
defend against a crazy sysadmin setting it extremely high.


Additionally we perhaps ought to just not use the send buffer when
internal_putbytes() is called with more data than can fit in the
buffer. We should fill it with as much data as fits in it (so the
pending data like the message header, or smaller previous messages, are
flushed out in the largest size), and then just call secure_write()
directly on the rest. It's not free to memcpy all that data around, when
we already have a buffer.


> (Might be worth malloc'ing it rather than having it as part of the
> static process image if we do so, but that's a trivial change.)

We already do for the send buffer, because we repalloc it in
socket_putmessage_noblock(). Olddly enough we never reduce it's size
after that...

While the receive side is statically allocated, I don't think it ends up
in the process image as-is - as the contents aren't initialized, it ends
up in .bss.

Greetings,

Andres Freund




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Tom Lane
Andres Freund  writes:
> On 2019-07-27 18:34:50 -0400, Tom Lane wrote:
>> Yeah.  The existing commentary about that is basically justifying 8K
>> as being large enough to avoid performance issues; if somebody can
>> show that that's not true, I wouldn't have any hesitation about
>> kicking it up.

> You think that unnecessary fragmentation, which I did show, isn't good
> enough? That does have cost on the network level, even if it possibly
> doesn't show up that much in timing.

I think it is worth doing some testing, rather than just blindly changing
buffer size, because we don't know how much we'd have to change it to
have any useful effect.

> Additionally we perhaps ought to just not use the send buffer when
> internal_putbytes() is called with more data than can fit in the
> buffer. We should fill it with as much data as fits in it (so the
> pending data like the message header, or smaller previous messages, are
> flushed out in the largest size), and then just call secure_write()
> directly on the rest. It's not free to memcpy all that data around, when
> we already have a buffer.

Maybe, but how often does a single putbytes call transfer more than 16K?
(If you fill the existing buffer, but don't have a full bufferload
left to transfer, I doubt you want to shove the fractional bufferload
directly to the kernel.)  Perhaps this added complexity will pay for
itself, but I don't think we should just assume that.

> While the receive side is statically allocated, I don't think it ends up
> in the process image as-is - as the contents aren't initialized, it ends
> up in .bss.

Right, but then we pay for COW when a child process first touches it,
no?  Maybe the kernel is smart about pages that started as BSS, but
I wouldn't bet on it.

regards, tom lane




Re: Why does backend send buffer size hardcoded at 8KB?

2019-07-27 Thread Andres Freund
Hi,

On 2019-07-27 19:10:22 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Additionally we perhaps ought to just not use the send buffer when
> > internal_putbytes() is called with more data than can fit in the
> > buffer. We should fill it with as much data as fits in it (so the
> > pending data like the message header, or smaller previous messages, are
> > flushed out in the largest size), and then just call secure_write()
> > directly on the rest. It's not free to memcpy all that data around, when
> > we already have a buffer.
> 
> Maybe, but how often does a single putbytes call transfer more than
> 16K?

I don't think it's that rare. COPY produces entire rows and sends them
at once, printtup also does, walsender can send pretty large chunks? I
think with several columns after text conversion it's pretty easy to
exceed 16k, not even taking large toasted columns into account.


> (If you fill the existing buffer, but don't have a full bufferload
> left to transfer, I doubt you want to shove the fractional bufferload
> directly to the kernel.)  Perhaps this added complexity will pay for
> itself, but I don't think we should just assume that.

Yea, I'm not certain either. One way to deal with the partially filled
buffer issue would be to use sendmsg() - and have two iovs (one pointing
to the filled buffer, one to the actual data). Wonder if it'd be
worthwhile to do in more scenarios, to avoid unnecessarily copying
memory around.


> > While the receive side is statically allocated, I don't think it ends up
> > in the process image as-is - as the contents aren't initialized, it ends
> > up in .bss.
> 
> Right, but then we pay for COW when a child process first touches it,
> no?  Maybe the kernel is smart about pages that started as BSS, but
> I wouldn't bet on it.

Well, they'll not exist as pages at that point, because postmaster won't
have used the send buffer to a meaningful degree? And I think that's the
same for >4k/pagesize blocks with malloc.  I think there could be a
benefit if we started the buffer pretty small with malloc, and only went
up as needed.

Greetings,

Andres Freund




Re: Hardware for writing/updating 12,000,000 rows per hour

2019-07-27 Thread Neil

> On Jul 27, 2019, at 11:55 AM, Arya F  wrote:
> 
> On Sat, Jul 27, 2019 at 11:49 AM farjad.farid  > wrote:
> With this kind of design requirements it is worth considering hardware 
> "failure & recovery". Even SSDs can and do fail. 
> 
> It is not just a matter of just speed. RAID disks of some kind, depending on 
> the budget is worth the effort.  
> 
> -Original Message-
> From: Alvaro Herrera  > 
> Sent: 2019 July 26 22:39
> To: Arya F mailto:arya6...@gmail.com>>
> Cc: Tom Lane mailto:t...@sss.pgh.pa.us>>; Ron 
> mailto:ronljohnso...@gmail.com>>; 
> pgsql-general@lists.postgresql.org 
> Subject: Re: Hardware for writing/updating 12,000,000 rows per hour
> 
> On 2019-Jul-26, Arya F wrote:
> 
> > I think I can modify my application to do a batch update. Right now 
> > the server has an HDD and it really can't handle a lot of updates and 
> > inserts per second. Would changing to a regular SSD be able to easily 
> > do 3000 updates per second?
> 
> That's a pretty hard question in isolation -- you need to consider how many 
> indexes are there to update, whether the updated columns are indexed or not, 
> what the datatypes are, how much locality of access you'll have ... I'm 
> probably missing some other important factors.  (Of course, you'll have to 
> tune various PG server settings to find your sweet spot.)
> 
> I suggest that should be measuring instead of trying to guess.  A reasonably 
> cheap way is to rent a machine somewhere with the type of hardware you think 
> you'll need, and run your workload there for long enough, making sure to 
> carefully observe important metrics such as table size, accumulated bloat, 
> checkpoint regime, overall I/O activity, and so on.
> 
> -- 
> Álvaro Herrerahttps://www.2ndQuadrant.com/ 
> 
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> Hi Farjad
> 
> I was thinking of having physical or logical replication. Or is having RAID a 
> must if I don’t want to lose data? 

Arya,

Answering the question of what hardware would work better requires a lot of 
thought. I would suggest that you hire someone that can do the calculations to 
get you in the ball park and provide other pros and cons.  Then test with 
rented or cloud equipment like Álvaro suggested.  The following are some basic 
back of the hand calculations to show you the thought process, the numbers 
presented here are probably way off and need to be based on your hardware and 
research.

The first thing you need to understand is how much data are you going to be 
moving around.  You mentioned ’12,000,000 writes/updates per hour’ and the size 
of the tuples getting updated at about 2k.  That is about 3,333 per second (as 
someone else already noted). That translates to about  * 2000 = 6.6 MB/sec 
of basic data movement if they are all inserts, not including index updates, 
replication, vacuum, etc.  If they are all updates then you can double that.  
So lets say they are 1/2 updates, that means basic data movement is 2 x for 
updates and 1x for the inserts so that changes it to 9.9 MB/sec.  Lets say the 
index keys are total 200 Bytes we have 3,330 * 200 * 1.5 (half update, half 
inserts) = 0.99 MB/Sec.  If you have an existing system all of these things can 
be measured.

This brings the total to 10.9 MB/sec sustained operation minimum, not including 
WAL, OS, Vacuum processing, etc. and provided the data is being processed 
evenly over time, which it never is.  This will start to tax a standard HDD 
since it has to handle the OS, WAL log (about the same I/O as the database), 
and database, which probably puts the sustained usage certainly above 22MB/sec, 
considering a single drive handling a little over 40MB/sec (and most drives do 
not perform at their rating). Considering that data storage is not processed 
evenly over time, then you also need to consider peaks in the processing and 
multiply this data rate by a factor (which without knowing your data access 
pattern is impossible to predict).  So we already suspect based on the fact 
that your HDD is not handling it that there might be more going on than you 
have provided and that a single hard drive is nowhere near adequate.

Now lets think about architecture.

Best practice would say to have the OS on one drive, the WAL log on another 
drive, and the database on another drive.  The reason is that you can probably 
get 40+MB/Sec on each HDD drive. Of course server grade hardware with 15,000 
RPM HDDs would be higher performance.  If you only have one drive then the OS, 
the WAL log, and the database are competing for the one resource.  With 3 
drives, depending on the I/O architecture (not a laptop) you can probably get 3 
times the I/O throughput, or 40+MB/sec on each HDD in our contrived example.

The other major help is memory. If you can fit the wh