Re: [HACKERS] Parallel Hash take II

Thomas Munro Thu, 14 Dec 2017 00:08:04 -0800

On Thu, Dec 14, 2017 at 11:45 AM, Andres Freund <and...@anarazel.de> wrote:
> +       bool            overflow;               /* Continuation of previous 
> chunk? */
> +       char            data[FLEXIBLE_ARRAY_MEMBER];
> +} SharedTuplestoreChunk;
>
> Ah. I was thinking we could have the 'overflow' variable be an int,
> indicating the remaining length of the oversized tuple. That'd allow us
> to skip ahead to the end of the oversized tuple in concurrent processes
> after hitting it.


Right, that is a bit better as it avoids extra read-skip cycles for
multi-overflow-chunk cases.  Done that way.

> +                       if (accessor->write_pointer + 
> accessor->sts->meta_data_size >=
> +                               accessor->write_end)
> +                               elog(ERROR, "meta-data too long");
>
> That seems more like an Assert than a proper elog? Given that we're
> calculating size just a few lines above...

It's an error because the logic is not smart enough to split the
optional meta-data and tuple size over multiple chunks.  I have added
comments there to explain.  That error can be reached by CALL
test_sharedtuplestore(1, 1, 1, 32756, 1), but 32755 is OK.  My goal
here is to support arbitrarily large tuples, not arbitrarily large
per-tuple meta-data, since for my use case I only need 4 bytes (a hash
value).  This could be improved if required by later features
(probably anyone wanting more general meta-data would want variable
sized meta-data anyway, whereas this is fixed, and it would also be
nice if oversized tuples didn't have to start at the beginning of a
new chunk).

I fixed two nearby fencepost bugs: I made the limit that triggers that
error smaller by size(uint32) and fixed a problem when small tuples
appear after an oversize tuple in a final overflow chunk (found by
hacking the test module to create mixtures of different sized tuples).

> +                       if (accessor->sts->meta_data_size > 0)
> +                               memcpy(accessor->write_pointer, meta_data,
> +                                          accessor->sts->meta_data_size);
> +                       written = accessor->write_end - 
> accessor->write_pointer -
> +                               accessor->sts->meta_data_size;
> +                       memcpy(accessor->write_pointer + 
> accessor->sts->meta_data_size,
> +                                  tuple, written);
>
> Also, shouldn't the same Assert() be here as well if you have it above?

No, when it comes to the tuple we just write as much of it as will
fit, and write the rest in the loop below.  Comments improved to make
that clear.

> +                       ereport(ERROR,
> +                                       (errcode_for_file_access(),
> +                                        errmsg("could not read from shared 
> tuplestore temporary file"),
> +                                        errdetail("Short read while reading 
> meta-data")));
>
> The errdetail doesn't follow the style guide (not a sentence ending with
> .), and seems internal-ish. I'm ok with keeping it, but perhaps we
> should change all these to be errdetail_internal()? Seems pointless to
> translate all of them.

Done.

> +               LWLockAcquire(&p->lock, LW_EXCLUSIVE);
> +               eof = p->read_page >= p->npages;
> +               if (!eof)
> +               {
> +                       read_page = p->read_page;
> +                       p->read_page += STS_CHUNK_PAGES;
> +               }
> +               LWLockRelease(&p->lock);
>
> So if we went to the world I'm suggesting, with overflow containing the
> length till the end of the tuple, this'd probably would have to look a
> bit different.

Yeah.  I almost wanted to change it back to a spinlock but now it's
grown bigger again...

> +               if (!eof)
> +               {
> +                       SharedTuplestoreChunk chunk_header;
> +
> +                       /* Make sure we have the file open. */
> +                       if (accessor->read_file == NULL)
> +                       {
> +                               char            name[MAXPGPATH];
> +
> +                               sts_filename(name, accessor, 
> accessor->read_participant);
> +                               accessor->read_file =
> +                                       BufFileOpenShared(accessor->fileset, 
> name);
> +                               if (accessor->read_file == NULL)
> +                                       elog(ERROR, "could not open temporary 
> file %s", name);
>
> Isn't this more an Assert or just not anything? There's now way
> BufFileOpenShared should ever return NULL, no?

Right.  As of commit 923e8dee this can no longer return NULL (instead
it would raise an error), so I removed this redundant check.

> +                       /* If this is an overflow chunk, we skip it. */
> +                       if (chunk_header.overflow)
> +                               continue;
> +
> +                       accessor->read_ntuples = 0;
> +                       accessor->read_ntuples_available = 
> chunk_header.ntuples;
> +                       accessor->read_bytes = 
> offsetof(SharedTuplestoreChunk, data);
>
> Perhaps somewhere around here comment that we'll just loop around and
> call sts_read_tuple() in the next loop iteration?

Done.

-- 
Thomas Munro
http://www.enterprisedb.com

parallel-hash-v29.patchset.tgz
Description: GNU Zip compressed data

Re: [HACKERS] Parallel Hash take II

Reply via email to