Added to TODO: > * Consider being smarter about memory and external files used during > sorts > > http://archives.postgresql.org/pgsql-hackers/2007-11/msg01101.php > http://archives.postgresql.org/pgsql-hackers/2007-12/msg00045.php
--------------------------------------------------------------------------- Simon Riggs wrote: > Just wanted to review a few thoughts and ideas around improving external > sorts, as recently encouraged to do by Jim Nasby. > > Current issues/opportunities are these: > > ISSUES > > a) Memory is always in short supply, so using what we have more > effectively is going to be welcome. > > b) Heap sort has a reasonably strong anti-memory effect, meaning that > there is an optimum amount of memory for any sort. This shows itself > with the CPU time increasing during run forming, making this stage of > the sort CPU bound. > > c) Many sorts are performed prior to aggregation. It might be possible > to aggregate prior to writing to disk, as a way of reducing the overall > I/O cost. Benefit would occur when the total CPU cost was same no matter > when aggregation occurred; that would not apply in all cases, so we > would need to sense when benefit was possible. > > d) Generally reducing the I/O cost of sorting may help the merging > stages of a sort. > > > SOLUTIONS > > The ideas that Greg Stark, Jim Nasby, Heikki and myself have discussed > to date were the following: > > 1. Sort I/O Compression > 2. Aggregation during Sort > 3. Memory Pools > 4. Dynamic Heap Management > 5. Dynamic Run Handling > > I've added (5) to the list as well, which hasn't yet been discussed. > > 1. SORT I/O COMPRESSION > > This idea is not dead yet, it just needs a full set of tests to confirm > that there is benefit in all cases. If there's not benefit in all cases, > we may be able to work out which cases those are, so we know when to use > it. > > > 2. AGGREGATION DURING SORT > > Many sorts are preliminary steps before aggregation. Aggregation during > run forming would potentially reduce size of heap and reduce number of > comparisons. For many types of aggregate this would not theoretically > increase the number of ops since sum(), avg(), min(), max() are all > commutative according to their inputs. We would probably need to add > another option to Aggregate Functions to indicate the possibility of > calculating the aggregate in this way, since some aggregates might rely > on the current situation that they expect all their inputs at once in > sorted order. (Windowed aggregates are unlikely to be this way). > > > 3. MEMORY POOLS > > Solving a) could be done by sensible management and allocation of > resources. Discussed before, so not rehashed here. > > > 4. DYNAMIC HEAP MANAGEMENT > > The size of the active heap required to produce the fewest number of > runs varies as the sort progresses. For example, sorting an already > sorted input needs a trivial heap size. > > Larger heap sizes simply avoid forming more runs, which is not > necessarily a bad thing. More runs only become bad things when we go > beyond our ability to perform a single final merge (see Dynamic Run > Handling below). > > Smaller heap sizes reduce the number of comparisons required, plus > increase the L2+ cache efficiencies. Those two things are the cause of > the anti-memory effect. > > Because of b), optimising the size of the heap could potentially be a > good thing. This can make a considerable difference for nearly sorted > data (measurements required...). > > When we have M amount of memory available to us, we don't start by using > it all. We start with m memory and only increase up to M if required. > Runs are built with memory set at m. If a tuple arrives that would force > the formation of a new run we assess > > i) do we care if another run is formed? Use our knowledge of the likely > amount of data coming our way, compared with number of runs formed so > far and see if we really care. If we don't care, allow the new run to be > formed and carry on with just heap size of m. (see Dynamic Run Handling > later). > > ii) if we do care about number of runs, then allow the heap to grow by > increments up to the full size of M. Increments would be at least x2 and > possibly x4. That way we always have work space to rearrange the heap. > > All of this dances too cleverly around the exact technique and potential > costs of rearranging the heap. That is not to be ignored and is the next > task in evaluating and accepting/dismissing this potential technique. > > In combination with memory pooling this technique might also allow > memory to be better distributed to other users. > > > 5. DYNAMIC RUN HANDLING (in Final Merge) > > Another way of addressing a) is to simply make better use of memory > itself. Let's look at that in more detail: > > Number of runs that can be merged at once is currently fixed, based upon > available memory. This has the underlying assumption that all runs will > be concurrently active during final merging, which may not always be > true. > > If we have random data then almost all runs will overlap with all other > runs, i.e. the min and max values are sufficiently wide that the runs do > all overlap. In many cases, data arrives in somewhat sorted order, e.g. > financial data is fairly regular with some late payers but not many, and > those trail off with a fairly tight decay. In the somewhat sorted case > we find that the actual overlap is less than total, so there are many > later runs that don't overlap the earlier ones. In the best case we > would find run 1 and 2 overlap, runs 2 and 3 overlap, then 3 and 4 > overlap. > > This is also the point where I suggest breaking away from Knuth > completely. All of the main algorithms described by Knuth are tape > sorts. A run is written to a particular tape and then stays there until > "moved" to another tape. That means we have to get super-clever about > how runs should be written and formed (see Knuth). If we realise that > the runs aren't fixed to particular tapes they are all just independent > runs, we can radically rethink sorting. There is no need to implement > Cascade Sort, but we do need to rethink merging from the ground up. (All > of which is a relief, because Knuth et al are definitely smarter than > me, but I've got disks and lots of memory and those guys had tapes.). > > If we track the min and max values for each run, when run building is > finished we will be able to build a merging plan that allows us to be > smart about the runs we should bring together. We start with the run > with the lowest min value, as well as all runs that overlap that run. > When that run is exhausted we move to the next lowest and at that point > start merging all runs that overlap that one. > > This then means we may be able to begin final merging with more runs > than the current cut-off. It's possible that we could merge an infinite > number of runs in final merge with fixed memory. If we *do* need to > merge we can work out which runs should be our best pre-merge > candidates, based upon how big they are and which other runs they > overlap. (That's much better than being forced to merge tapes 2, 7 and > 17 because some bizarre math says so (see Knuth).) > > Anyway, claiming to have found a better way than Knuth makes me feel a > little nervous, so some searching questions on this are very welcome. > > Interestingly, if we combine this technique with dynamic heap management > we may be able to allow a very large number of efficiently written runs > to form without it causing any merging. > > mac_man recently noted the possibility that some runs don't overlap at > all and so can be merged for free. That's true, though doesn't actually > improve the basic idea here which is building a merge plan after runs > have been formed, with an eye on minimizing and potentially elimination > the merge phase. > > There's probably some typos or thinkos above, so go easy on me Greg! > They aren't there because I want to skim over anything. > > I'm not likely to get a chance to do all of this in the near future, so > documenting it now should help others to carry things forward. > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers