On Wed, Jul 28, 2021 at 2:10 PM Andres Freund <and...@anarazel.de> wrote: > On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote: > > > Each IO will have its own TBMIterateResult allocated and returned by the > > PgStreamingRead helper and freed later by > > heapam_scan_bitmap_next_block() before requesting the next block. > > Previously it was allocated once and saved in the TBMIterator in the > > BitmapHeapScanState node and reused. Because of this, the table AM API > > routine, table_scan_bitmap_next_block() now defines the TBMIterateResult > > as an output parameter. > > > > I haven't made the GIN code reasonable yet either (it uses the TID > > bitmap functions that I've changed). > > I don't quite understand the need to change the tidbitmap interface, or > maybe rather I'm not convinced that pessimistically preallocating space > is a good idea? >
TBMIterator cannot contain a TBMIterateResult because it prefetches blocks and calls tbm_iterate() for each one, which would overwrite the relevant information in the TBMIterateResult before it has been returned to heapam_scan_bitmap_next_block().* Thus, we need at least as many TBMIterateResults as the size of the prefetch window at its largest. We could save some memory if we separated the data in TBMIterateResult and made a new struct, let's call it BitmapBlockState, with just the block number, buffer number, and recheck to be used and returned by bitmapheapscan_pgsr_next_single(). We need both block and buffer because we need to distinguish between hit_end, skip_fetch, and invalid block number conditions in the caller. We need recheck before initiating IO to determine if we should skip_fetch. Then a separate struct which is much the same as the existing TBMIterateResult could be maintained in the BitmapHeapScanState node and passed into heapam_scan_bitmap_next_block() along with the bitmap (a new parameter). In heapam_scan_bitmap_next_block(), after getting the BitmapBlockState from pg_streaming_read_get_next(), we could call tbm_find_pageentry() with the block number and bitmap. For a non-lossy page, we could then scrape the offsets and ntuples using the PageTableEntry. If it is lossy, we would set recheck and ntuples accordingly. (I do wonder if that allows us to distinguish between a lossy page and a block number that is erroneous and isn't in the bitmap--but maybe that can't happen.) However, we would still have as many palloc() calls (one for every block to create the BitmapBlockState. We would have less outstanding memory by limiting the number of offsets arrays created. We would still need to pass the recheck flag, ntuples, and buffer back up to BitmapHeapNext(), so, at that point we would still need a data structure that is basically the same as the existing TBMIterateResult. Alternatively, we could keep an array of TBMIterateResults the size of the prefetch window and reuse them -- though I'm not sure where to keep it and how to manage it when the window gets resized. In my current patch, I allocate and free one TBMIterateResult for each block. The amount of outstanding memory will be #ios in prefetch window * sizeof(TBMIterateResult). We don't want to always palloc() memory for the TBMIterateResult inside of tbm_iterate(), since other users (like GIN) still only need one TBMIterateResult So, if the TBMIterateResult is not inside of the TBMIterator and tbm_iterate() does not allocate the memory, we need to pass it in as an output parameter, and, if we do that, it felt odd to also return it -- hence the function signature change. One alternative I tried was having the TBMIterator have a pointer to the TBMIterateResult and then users of it can allocate the TBMIterateResult and set it in the TBMIterator before calling tbm_iterate(). But, then we need to expose the TBMIterator outside of the TID bitmap API. Also, it felt weird to have a member of the iterator which must not be NULL when tbm_iterate() is called but which isn't set up in tbm_begin_iterate(). > > > > static bool > > heapam_scan_bitmap_next_block(TableScanDesc scan, > > - TBMIterateResult *tbmres) > > + TBMIterateResult **tbmres) > > ISTM that we possibly shouldn't expose the TBMIterateResult outside of > the AM after this change? It feels somewhat like an implementation > detail now. It seems somewhat odd to expose a ** to set a pointer that > nodeBitmapHeapscan.c then doesn't really deal with itself. > All the members of the TBMIterateResult are populated in bitmapheapscan_pgsr_next_single() and then most of it is used by heapam_scan_bitmap_next_block() to - detect error conditions and done-ness - fill in the HeapScanDesc with the information needed by heapam_scan_bitmap_next_tuple() (rs_cbuf [which is basically redundant with TBMIterateResult->buffer] and rs_vistuples) However, some of the information is used up in BitmapHeapNext() and in heapam_scan_bitmap_next_tuple() and doesn't go in the HeapScanDesc: - BitmapHeapNext() uses the state of the TBMIterateResult to determine if the bitmap is exhausted, since the return value of table_scan_bitmap_next_block() indicates an error condition and not done-ness - BitmapHeapNext() uses recheck to determine whether or not to recheck qual conditions - heapam_scan_bitmap_next_tuple() uses the validity of the buffer to determine if it should return empty tuples - heapam_scan_bitmap_next_tuple() uses ntuples to determine how many empty tuples to return So, if we don't want to pass around a TBMIterateResult, we would have to 1) change the return value of heapam_scan_bitmap_next_block() and 2) find another appropriate place for the information above (or another way to represent the encoded information). It is also worth noting that heapam_scan_bitmap_next_tuple() took a TBMIterateResult before without using it, so I assume your foresaw other table AMs using it? Overall, the whole thing still feels a bit yucky to me. It doesn't quite feel like the right things are in the right places, but, I haven't put my finger on the culprit. I do think putting the buffer in the TBMIterateResult is an inappropriate addition to the TID Bitmap API. Also, I would like to move this code: if (node->tbmres->ntuples >= 0) node->exact_pages++; else node->lossy_pages++; from where it is in BitmapHeapNext(). It seems odd that that is the only part of BitmapHeapNext() that reaches inside of the TBMIterateResult. Also, as it is, it is incorrect--it doesn't count the first page. I could duplicate it under the first call to table_scan_bitmap_next_block(), but I wasn't looking forward to doing so. > > > @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm) > > * Create the TBMIterator struct, with enough trailing space to serve the > > * needs of the TBMIterateResult sub-struct. > > */ > > - iterator = (TBMIterator *) palloc(sizeof(TBMIterator) + > > - MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); > > + iterator = (TBMIterator *) palloc(sizeof(TBMIterator)); > > iterator->tbm = tbm; > > Hm? > I removed the TBMIterateResult from the TBMIterator, so, we should no longer allocate memory for the offsets array when creating the TBMIterator. * I think that having TBMIterateResult inside of TBMIterator is not well-defined C language behavior. In [1], it says "Structures with flexible array members (or unions who have a recursive-possibly structure member with flexible array member) cannot appear as array elements or as members of other structures." [1] https://en.cppreference.com/w/c/language/struct