On Thu, Feb 11, 2021 at 8:09 AM Daniil Zakhlystov <usernam...@yandex-team.ru> wrote: > [ benchmark results ]
So, if I read these results correctly, on the "pg_restore of IMDB database" test, we get 88% of the RX bytes reduction and 99.8% of the TX bytes reduction for 90% of the CPU cost. On the "pgbench" test, which probably has much smaller packets, chunked compression gives us no bandwidth reduction and in fact consumes slightly more network bandwidth -- which seems like it has to be an implementation defect, since we should always be able to fall back to sending the uncompressed packet if the compressed one is larger, or will be after adding the wrapper overhead. But with the current code, at least, we pay about a 30% CPU tax, and there's no improvement. The permanent compression imposes a whopping 90% CPU tax, but we save about 33% on TX bytes and about 14% on RX bytes. If that's an accurate reading of the results, then I would say the "pg_restore of IMDB database" test is pretty much a wash. Some people might prefer to incur the extra CPU cost to get the extra bandwidth savings, and other people might not. But neither group of people really has cause for complaint if the other approach is selected, because the costs and benefits are similar in both cases. But in the pgbench test case, the chunked compression looks horrible. Sure, it's less costly from a CPU perspective than the other approach, but since you don't get any benefit, you'd be far better off disabling compression altogether than using the chunked approach. However, I feel like some of this has almost got to be an implementation deficiency in the "chunked" version of the patch. Now, I haven't looked at that patch. But, there are certainly a number of things that it might be failing to do that could make a big difference: 1. As I mentioned above, we need to fall back to sending the uncompressed message if compression fails to reduce the size, or if it doesn't reduce the size by enough to compensate for the header we have to add to the packet (I assume this is 5 bytes, perhaps 6 if you allow a byte to mention the compression type). 2. Refining this further, if we notice that we are failing to compress messages regularly, maybe we should adaptively give up. The simplest idea would be something like: keep track of what percentage of the time compression succeeds in reducing the message size. If in the last 100 attempts we got a benefit fewer than 75 times, then conclude the data isn't very compressible and switch to only attempting to compress every twentieth packet or so. If the data changes and becomes more compressible again the statistics will eventually tilt back in favor of compressing every packet again; if not, we'll only be paying 5% of the overhead. 3. There should be some minimum size before we attempt compression. pglz gives up right away if the input is less than 32 bytes; I don't know if that's the right limit, but presumably it'd be very difficult to save 5 or 6 bytes out of a message smaller than that, and maybe it's not worth trying even for slightly larger messages. 4. It might be important to compress multiple packets at a time. I can even imagine having two different compressed protocol messages, one saying 'here is a compressed messages' and the other saying 'here are a bunch of compressed messages rolled up into one packet'. But there's a subtler way in which the permanent compression approach could be winning, which is that the compressor can retain state over long time periods. In a single pgbench response, there's doubtless some opportunity for the compressor to find savings, but an individual response doesn't likely include all that much duplication. But just think about how much duplication there is from one response to the next. The entire RowDescription message is going to be exactly the same for every query. If you can represent that in just a couple of bytes, it think that figures to be a pretty big win. If I had to guess, that's likely why the permanent compression approach seems to deliver a significant bandwidth savings even on the pgbench test, while the chunked approach doesn't. Likewise in the other direction: the query doesn't necessarily contain a lot of internal duplication, but it duplicate the previous query to a very large extent. It would be interesting to know whether this theory is correct, and whether anyone can spot a flaw in my reasoning. If it is, that doesn't necessarily mean we can't use the chunked approach, but it certainly makes it less appealing. I can see two ways to go. One would be to just accept that it won't get much benefit in cases like the pgbench example, and mitigate the downsides as well as we can. A version of this patch that caused a 3% CPU overhead in cases where it can't compress would be far more appealing than one that causes a 30% overhead in such cases (which seems to be where we are now). Alternatively, we could imagine that the compressed-message packets as carrying a single continuous compressed stream of bytes, so that the compressor state is retained from one compressed message to the next. Any number of uncompressed messages could could be sent in between, without doing anything to the compression state, but when you send the next compression message, both the sender and receiver feel like the bytes they're now being given are appended onto whatever bytes they saw last. This would presumably reocup a lot of the compression benefit that the permanent compression approach sees on the pgbench test, but it has some notable downsides. In particular, now you have to wonder what exactly you're gaining by not just compressing everything. Nobody snooping on the stream can snoop on an individual packet without having seen the whole history of compressed packets from the beginning of time, nor can some kind of middleware like pgbouncer decompress each payload packet just enough to see what the first byte may be. It's either got to decompress all of every packet to keep its compression state current, or just give up on knowing anything about what's going on inside those packets. And you've got to worry about all the same details about flushing the compressor state that we were worrying about with the compress-everything approach. Blech. Or we could compress everything, as Konstantin proposes. There's another point here that just occurred to me, though. In the pgbench test, we send and receive roughly equal quantities of data at a rate that works out, over the duration of the test, to about 4.8MB/s. On the data-loading test, one direction is insignificant, but the other direction transfers data at a far higher rate, close to 25MB/s. I'm having a little trouble matching these numbers, which I computed from the text results in the email, with the graphs, which seem to show much higher values, but the basic point is clear enough either way: the data load puts a LOT more strain on the network. The reason why that's relevant is that if you're not saturating the network anyway, then why do you want to use compression? It's bound to cost something in terms of CPU, and all you're saving is network bandwidth that wasn't the problem anyway. A bulk load is a lot more likely to hit the limits of your network. At the same time, it's not clear to me that a pgbench test *couldn't* saturate the network. This one was only getting ~2.4k TPS, which is not much. A read-write pgbench can go more than 10x that fast, and a read-only one can go more than 100x that fast, and suddenly that's a whole lot more network consumption. So at the end of the day I'm not really quite sure what is best here. I agree with all of Craig's points about the advantages of packet-level compression, so I'd really prefer to make that approach work if we can. However, it also seems to me that there's room to be fairly concerned about what these test results are showing. -- Robert Haas EDB: http://www.enterprisedb.com