On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote: > > It looks like the parameter input_tuples passed to hash_spill_init() > in lookup_hash_entries() is the number of groups estimated by > planner. > However, when reloading a spill file, if we run out of memory and > re-spill, hash_spill_init() is passed batch->input_groups (which is > actually set from input_ngroups which is the number of tuples in the > spill file). So, input_tuples is groups and input_groups is > input_tuples. It may be helpful to rename this.
You're right; this is confusing. I will clarify this in the next patch. > So, it looks like the HASH_PARTITION_FACTOR is only used when > re-spilling. The initial hashtable will use work_mem. > It seems like the reason for using it when re-spilling is to be very > conservative to avoid more than one re-spill and make sure each spill > file fits in a hashtable in memory. It's used any time a spill happens, even the first spill. I'm flexible on the use of HASH_PARTITION_FACTOR though... it seems not everyone thinks it's a good idea. To me it's just a knob to tune and I tend to think over-partitioning is the safer bet most of the time. > The comment does seem to point to some other reason, though... I have observed some anomalies where smaller work_mem values (for already-low values of work_mem) result faster runtime. The only explanation I have is caching effects. > 256 actually seems very large. hash_spill_npartitions() will be > called > for every respill, so, HASH_MAX_PARTITIONS it not the total number of > spill files permitted, but, actually, it is the number of respill > files in a given spill (a spill set). So if you made X partitions > initially and every partition re-spills, now you would have (at most) > X * 256 partitions. Right. Though I'm not sure there's any theoretical max... given enough input tuples and it will just keep getting deeper. If this is a serious concern maybe I should make it depth-first recursion by prepending new work items rather than appending. That would still not bound the theoretical max, but it would slow the growth. > If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill > files take up a lot of memory at that point? Yes. Each file keeps a BLCKSZ buffer, plus some other metadata. And it does create a file, so it's offloading some work to the OS to manage that new file. It's annoying to properly account for these costs because the memory needs to be reserved at the time we are building the hash table, but we don't know how many partitions we want until it comes time to spill. And for that matter, we don't even know whether we will need to spill or not. There are two alternative approaches which sidestep this problem: 1. Reserve a fixed fraction of work_mem, say, 1/8 to make space for however many partitions that memory can handle. We would still have a min and max, but the logic for reserving the space would be easy and so would choosing the number of partitions to create. * Pro: simple * Con: lose the ability to choose the numer of partitions 2. Use logtape.c instead (suggestion from Heikki). Supporting more logical tapes doesn't impose costs on the OS, and we can potentially use a lot of logical tapes. * Pro: can use lots of partitions without making lots of files * Con: buffering still needs to happen somewhere, so we still need memory for each logical tape. Also, we risk losing locality of read access when reading the tapes, or perhaps confusing readahead. Fundamentally, logtapes.c was designed for sequential write, random read; but we are going to do random write and sequential read. Regards, Jeff Davis