Hi Bharath, On Thu, Jul 9, 2020 at 7:33 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > Thanks Amit for buying into the idea. I agree that your patch looks > clean and simple compared to mine and I'm okay with your patch. > > I reviewed and tested your patch, below are few comments:
Thanks for checking it out. > I think we can remove(and delete the function from the code) the > CopyGetInt16() and have the code directly to save the function call > cost. It gets called for each attribute/column for each row/tuple to > just call CopyReadFromRawBuf() and set the byte order. From a > readability perspective it's okay to have this function, but cost wise > I feel no need for that function at all. In one of our earlier > work(parallel copy), we observed that having a new function or few > extra statements in this copy from path which gets hit for each row, > incurs noticeable execution cost. > > The same way, we can also avoid using CopyGetInt32() function call in > CopyReadBinaryAttribute() for the same reason stated above. I agree that removing the function call overhead in this case is worth the slight loss of readability. > In CopyReadFromRawBuf(), can the "saveTo" parameter be named "dest" > and use that with (char *) typecast directly, instead of having a > local variable? Though it may/may not be a standard practice, let's > have the parameter name all lower case to keep it consistent with > other function's parameters in the copy.c file. Agreed. > Seems like there's a bug in below part of the code. Use case is > simple, have some junk value at the end of the binary file, then with > your patch the query succeeds, but it should report the below error. > Here, on fld_count == -1 instead of reading from file, we must be > reading it from the buffer, as we would have already read all the data > from the file into the buffer. > if (cstate->copy_dest != COPY_OLD_FE && > CopyGetData(cstate, &dummy, 1, 1) > 0) > ereport(ERROR, > (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), > errmsg("received copy data after EOF marker"))); > > I also tried with some intentionally corrupted binary datasets, (apart > from above issue) patch works fine. Yeah, I see the bug. I should've checked all the call sites of CopyGetData() and made sure there is only one left, that is, CopyLoadRawBuffer(). > For the case where required nbytes may not fit into the buffer in > CopyReadFromRawBuf, I'm sure this can happen only for field data, > (field count , and field size are of fixed length and can fit in the > buffer), instead of reading them in parts of buff size into the buffer > (using CopyLoadRawBuf) and then DRAIN_COPY_RAW_BUF() to the > destination, I think we can detect this condition using requested > bytes and the buffer size and directly read from the file to the > destination buffer and then reload the raw_buffer for further > processing. I think this way, it will be good. Hmm, I'm afraid that this will make the code more complex for apparently small benefit. Is this really that much of a problem performance wise? > I have few synthesized test cases where fields can be of larger size. > I executed them on your patch, but didn't debug to see whether > actually we hit the code where required nbytes can't fit in the entire > buffer. I will try this on the next version of the patch. > > > > > HEAD patched > > foo5 8.5 6.5 > > foo10 14 10 > > foo20 25 18 > > > > Numbers might improve a bit, if we remove the extra function calls as > stated above. Here the numbers with the updated patch: HEAD patched (v2) foo5 8.5 6.1 foo10 14 9.4 foo20 25 16.7 -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
CopyFrom-binary-buffering_v2.patch
Description: Binary data