On 10 May 2016 at 05:33, Robert Haas <robertmh...@gmail.com> wrote: > 2. vectorized execution, by which I mean the ability of a node to > return tuples in batches rather than one by one. Andres has opined > more than once that repeated trips through ExecProcNode defeat the > ability of the CPU to do branch prediction correctly, slowing the > whole system down, and that they also result in poor CPU cache > behavior, since we jump all over the place executing a little bit of > code from each node before moving onto the next rather than running > one bit of code first, and then another later. I think that's > probably right. For example, consider a 5-table join where all of > the joins are implemented as hash tables. If this query plan is going > to be run to completion, it would make much more sense to fetch, say, > 100 tuples from the driving scan and then probe for all of those in > the first hash table, and then probe for all of those in the second > hash table, and so on. What we do instead is fetch one tuple and > probe for it in all 5 hash tables, and then repeat. If one of those > hash tables would fit in the CPU cache but all five together will not, > that seems likely to be a lot worse. But even just ignoring the CPU > cache aspect of it for a minute, suppose you want to write a loop to > perform a hash join. The inner loop fetches the next tuple from the > probe table and does a hash lookup. Right now, fetching the next > tuple from the probe table means calling a function which in turn > calls another function which probably calls another function which > probably calls another function and now about 4 layers down we > actually get the next tuple. If the scan returned a batch of tuples > to the hash join, fetching the next tuple from the batch would > probably be 0 or 1 function calls rather than ... more. Admittedly, > you've got to consider the cost of marshaling the batches but I'm > optimistic that there are cycles to be squeezed out here. We might > also want to consider storing batches of tuples in a column-optimized > rather than row-optimized format so that iterating through one or two > attributes across every tuple in the batch touches the minimal number > of cache lines.
It's interesting that you mention this. We identified this as a pain point during our work on column stores last year. Simply passing single tuples around the executor is really unfriendly towards L1 instruction cache, plus also the points you mention about L3 cache and hash tables and tuple stores. I really think that we're likely to see significant gains by processing >1 tuple at a time, so this topic very much interests me. On researching this we've found that other peoples research does indicate that there are gains to be had: http://www.openlinksw.com/weblog/oerling/ In that blog there's a table that indicates that this row-store database saw a 4.4x performance improvement from changing from a tuple-at-a-time executor to a batch tuple executor. Batch Size 1 tuple = 122 seconds Batch Size 10k tuples = 27.7 seconds When we start multiplying those increases with the increases with something like parallel query then we're starting to see very nice gains in performance. Alvaro, Tomas and I had been discussing this and late last year I did look into what would be required to allow this to happen in Postgres. Basically there's 2 sub-projects, I'll describe what I've managed to learn so far about each, and the rough plan that I have to implement them: 1. Batch Execution: a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes. b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to indicate if the struct contains a single or a multiple tuples. Multiple tuples may need to be deformed in a non-lazy fashion in order to prevent too many buffers from having to be pinned at once. Tuples will be deformed into arrays of each column rather than arrays for each tuple (this part is important to support the next sub-project) c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to process a batch TupleTableSlot. This will require some tight loop to aggregate the entire TupleTableSlot at once before returning. d. Add function in execAmi.c which returns true or false depending on if the node supports batch TupleTableSlots or not. e. At executor startup determine if the entire plan tree supports batch TupleTableSlots, if so enable batch scan mode. That at least is my ideas for stage 1. There's still more to work out. e.g should batch mode occur when the query has a LIMIT? we might not want to waste time gather up extra tuples when we're just going to stop after the first one. So perhaps 'e' above should be up to the planner instead. Further development work here might add a node type that de-batches a TupleTableSlot to allow nodes which don't support batching to be in the plan, i.e "mixed execution mode". I'm less excited about this as it may be difficult to cost that operation, probably the time would be better spend just batch-enabling the other node types, which *may* not be all that difficult. I'm also assuming that batch mode (in all cases apart from queries with LIMIT or cursors) will always be faster than tuple-at-a-time, so requires no costings from the planner. 2. Vector processing (I admit that I've given this part much less thought so far, but here's what I have in mind) This depends on batch execution, and is intended to allow the executor to perform function calls to an entire batch at once, rather than tuple-at-a-time. For example, let's take the following example; SELECT a+b FROM t; here (as of now) we'd scan "t" one row at a time and perform a+b after having deformed enough of the tuple to do that. We'd then go and get another Tuple from the scan node and repeat until the scan gave us no more Tuples. With batch execution we'd fetch multiple Tuples from the scan and we'd then perform the call to say int4_pl() multiple times, which still kinda sucks as it means calling int4_pl() possibly millions of times (once per tuple). The vector mode here would require that we modify pg_operator to add a vector function for each operator so that we can call the function passing in an array of Datums and a length to have SIMD operations perform the addition, so we'd call something like int4_pl_vector() only once per batch of tuples allowing the CPU to perform SIMD operations on those datum arrays. This could be done in an incremental way as the code could just callback on the standard function in cases where a vectorised version of it is not available. Thought is needed here about when exactly this decision is made as the user may not have permissions to execute the vector function, so it can't simply be a run time check. These functions would simply return another vector of the results. Aggregates could be given a vector transition function, where something like COUNT(*)'s vector_transfn would simply just current_count += vector_length; This project does appear to require that we bloat the code with 100's of vector versions of each function. I'm not quite sure if there's a better way to handle this. The problem is that the fmgr is pretty much a barrier to SIMD operations, and this was the only idea that I've had so far about breaking through that barrier. So further ideas here are very welcome. The idea here is that these 2 projects help pave the way to bring columnar storage into PostgreSQL. Without these we're unlikely to get much benefit of columnar storage as we'd be stuck processing rows one at a time still. Adding columnar storage on the top of the above should further increase performance as we can skip the tuple-deform step and pull columnar array/vectors directly into a TupleTableSlot, although some trickery would be involved here when the scan has keys. I just want to add that both of the above do require more thought. We realised that this was required quite late in our column store work (which we've all now taken a break from to work on other things), so we've had little time to look much further into it. Although I should be starting work again on this in the next few months in the hopes to have something, even the most simple version of it in 9.7. Comments are welcome -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers