On Wed, Oct 30, 2024 at 4:54 AM Joel Jacobson <j...@compiler.org> wrote: > > On Wed, Oct 30, 2024, at 09:14, Joel Jacobson wrote: > > $ psql -f bench_result.sql > > Ops, I realized I benchmarked a debug build, > reran the benchmark with `meson setup build --buildtype=release`, > and also added benchmarking of HEAD: > > [plot image showing time distribution by format and version] > > > For those unable to see the image, here is the same data as a text table: > > select format, version, min(time_ms), round(avg(time_ms),2) avg, > max(time_ms), round(stddev(time_ms)) as stddev from bench group by 1,2 order > by 1,2; > > format | version | min | avg | max | stddev > --------+---------+----------+---------+----------+-------- > csv | HEAD | 2862.537 | 2895.91 | 2968.420 | 27 > csv | v16 | 2834.663 | 2869.02 | 2926.489 | 28 > csv | v17 | 2958.570 | 3002.71 | 3056.949 | 31 > raw | v16 | 1697.899 | 1723.48 | 1798.020 | 22 > raw | v17 | 1720.251 | 1753.56 | 1811.399 | 28 > text | HEAD | 2442.845 | 2476.60 | 2543.767 | 33 > text | v16 | 2358.062 | 2389.06 | 2472.382 | 32 > text | v17 | 2517.778 | 2552.65 | 2621.161 | 30 > (8 rows) > > I think the improvement for the 'text' format between HEAD and v16 could be > just noise, > possibly due to different binary layout when compiling. > > v17 (single function) seems slower than v16 (separate functions), also in > release build. >
Here are review comments on v17 patches: + data values. Unlike in other formats, the delimiter in + <literal>raw</literal> format can be any string, including multi-byte + characters. If no <literal>DELIMITER</literal> is specified, the entire + input or output is treated as a single data value. As I mentioned in a separate email, if we use the OS default EOL as the default EOL in raw format, it would not be necessary to allow it to be multi characters. I think it's worth considering it. --- * copyfromparse.c * Parse CSV/text/binary format for COPY FROM. We need to update here as well. -- - * 4. CopyReadAttributesText/CSV() function takes the input line from + * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from I think we don't have CopyReadAttributesRaw() function now. --- I think it would be better to explain how to parse data in raw mode, especially which steps in the pipeline we skip, in the comment at the top of copyfromparse.c. --- - if (cstate->opts.format == COPY_FORMAT_CSV) + if (cstate->opts.format == COPY_FORMAT_RAW) { - /* - * If character is '\r', we may need to look ahead below. Force - * fetch of the next character if we don't already have it. We - * need to do this before changing CSV state, in case '\r' is also - * the quote or escape character. - */ - if (c == '\r') + if (delim_len == 0) { - IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0); + /* When reading entire file, consume all remaining bytes at once */ + input_buf_ptr = copy_buf_len; + continue; } + else I think that the code become much simpler if we do 'continue' at the end of 'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'. --- @@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate) return result; } + /* * Return decimal value for a hexadecimal digit */ @@ -1937,7 +2013,6 @@ endfield: return fieldno; } - /* * Read a binary attribute */ Unnecessary line addition and removal. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com