On Thu, Dec 14, 2017 at 11:45 AM, Andres Freund <and...@anarazel.de> wrote: > + bool overflow; /* Continuation of previous > chunk? */ > + char data[FLEXIBLE_ARRAY_MEMBER]; > +} SharedTuplestoreChunk; > > Ah. I was thinking we could have the 'overflow' variable be an int, > indicating the remaining length of the oversized tuple. That'd allow us > to skip ahead to the end of the oversized tuple in concurrent processes > after hitting it.
Right, that is a bit better as it avoids extra read-skip cycles for multi-overflow-chunk cases. Done that way. > + if (accessor->write_pointer + > accessor->sts->meta_data_size >= > + accessor->write_end) > + elog(ERROR, "meta-data too long"); > > That seems more like an Assert than a proper elog? Given that we're > calculating size just a few lines above... It's an error because the logic is not smart enough to split the optional meta-data and tuple size over multiple chunks. I have added comments there to explain. That error can be reached by CALL test_sharedtuplestore(1, 1, 1, 32756, 1), but 32755 is OK. My goal here is to support arbitrarily large tuples, not arbitrarily large per-tuple meta-data, since for my use case I only need 4 bytes (a hash value). This could be improved if required by later features (probably anyone wanting more general meta-data would want variable sized meta-data anyway, whereas this is fixed, and it would also be nice if oversized tuples didn't have to start at the beginning of a new chunk). I fixed two nearby fencepost bugs: I made the limit that triggers that error smaller by size(uint32) and fixed a problem when small tuples appear after an oversize tuple in a final overflow chunk (found by hacking the test module to create mixtures of different sized tuples). > + if (accessor->sts->meta_data_size > 0) > + memcpy(accessor->write_pointer, meta_data, > + accessor->sts->meta_data_size); > + written = accessor->write_end - > accessor->write_pointer - > + accessor->sts->meta_data_size; > + memcpy(accessor->write_pointer + > accessor->sts->meta_data_size, > + tuple, written); > > Also, shouldn't the same Assert() be here as well if you have it above? No, when it comes to the tuple we just write as much of it as will fit, and write the rest in the loop below. Comments improved to make that clear. > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not read from shared > tuplestore temporary file"), > + errdetail("Short read while reading > meta-data"))); > > The errdetail doesn't follow the style guide (not a sentence ending with > .), and seems internal-ish. I'm ok with keeping it, but perhaps we > should change all these to be errdetail_internal()? Seems pointless to > translate all of them. Done. > + LWLockAcquire(&p->lock, LW_EXCLUSIVE); > + eof = p->read_page >= p->npages; > + if (!eof) > + { > + read_page = p->read_page; > + p->read_page += STS_CHUNK_PAGES; > + } > + LWLockRelease(&p->lock); > > So if we went to the world I'm suggesting, with overflow containing the > length till the end of the tuple, this'd probably would have to look a > bit different. Yeah. I almost wanted to change it back to a spinlock but now it's grown bigger again... > + if (!eof) > + { > + SharedTuplestoreChunk chunk_header; > + > + /* Make sure we have the file open. */ > + if (accessor->read_file == NULL) > + { > + char name[MAXPGPATH]; > + > + sts_filename(name, accessor, > accessor->read_participant); > + accessor->read_file = > + BufFileOpenShared(accessor->fileset, > name); > + if (accessor->read_file == NULL) > + elog(ERROR, "could not open temporary > file %s", name); > > Isn't this more an Assert or just not anything? There's now way > BufFileOpenShared should ever return NULL, no? Right. As of commit 923e8dee this can no longer return NULL (instead it would raise an error), so I removed this redundant check. > + /* If this is an overflow chunk, we skip it. */ > + if (chunk_header.overflow) > + continue; > + > + accessor->read_ntuples = 0; > + accessor->read_ntuples_available = > chunk_header.ntuples; > + accessor->read_bytes = > offsetof(SharedTuplestoreChunk, data); > > Perhaps somewhere around here comment that we'll just loop around and > call sts_read_tuple() in the next loop iteration? Done. -- Thomas Munro http://www.enterprisedb.com
parallel-hash-v29.patchset.tgz
Description: GNU Zip compressed data