Re: New "raw" COPY format

Masahiko Sawada Fri, 01 Nov 2024 14:29:30 -0700

On Wed, Oct 30, 2024 at 4:54 AM Joel Jacobson <j...@compiler.org> wrote:
>
> On Wed, Oct 30, 2024, at 09:14, Joel Jacobson wrote:
> > $ psql -f bench_result.sql
>
> Ops, I realized I benchmarked a debug build,
> reran the benchmark with `meson setup build --buildtype=release`,
> and also added benchmarking of HEAD:
>
> [plot image showing time distribution by format and version]
>
>
> For those unable to see the image, here is the same data as a text table:
>
> select format, version, min(time_ms), round(avg(time_ms),2) avg, 
> max(time_ms), round(stddev(time_ms)) as stddev from bench group by 1,2 order 
> by 1,2;
>
> format | version |   min    |   avg   |   max    | stddev
> --------+---------+----------+---------+----------+--------
> csv    | HEAD    | 2862.537 | 2895.91 | 2968.420 |     27
> csv    | v16     | 2834.663 | 2869.02 | 2926.489 |     28
> csv    | v17     | 2958.570 | 3002.71 | 3056.949 |     31
> raw    | v16     | 1697.899 | 1723.48 | 1798.020 |     22
> raw    | v17     | 1720.251 | 1753.56 | 1811.399 |     28
> text   | HEAD    | 2442.845 | 2476.60 | 2543.767 |     33
> text   | v16     | 2358.062 | 2389.06 | 2472.382 |     32
> text   | v17     | 2517.778 | 2552.65 | 2621.161 |     30
> (8 rows)
>
> I think the improvement for the 'text' format between HEAD and v16 could be 
> just noise,
> possibly due to different binary layout when compiling.
>
> v17 (single function) seems slower than v16 (separate functions), also in 
> release build.
>


Here are review comments on v17 patches:

+    data values. Unlike in other formats, the delimiter in
+    <literal>raw</literal> format can be any string, including multi-byte
+    characters. If no <literal>DELIMITER</literal> is specified, the entire
+    input or output is treated as a single data value.

As I mentioned in a separate email, if we use the OS default EOL as
the default EOL in raw format, it would not be necessary to allow it
to be multi characters. I think it's worth considering it.

---
 * copyfromparse.c
 *      Parse CSV/text/binary format for COPY FROM.

We need to update here as well.

--
- * 4. CopyReadAttributesText/CSV() function takes the input line from
+ * 4. CopyReadAttributesText/CSV/Raw() function takes the input line from

I think we don't have CopyReadAttributesRaw() function now.

---
I think it would be better to explain how to parse data in raw mode,
especially which steps in the pipeline we skip, in the comment at the
top of copyfromparse.c.

---
-       if (cstate->opts.format == COPY_FORMAT_CSV)
+       if (cstate->opts.format == COPY_FORMAT_RAW)
        {
-           /*
-            * If character is '\r', we may need to look ahead below.  Force
-            * fetch of the next character if we don't already have it.  We
-            * need to do this before changing CSV state, in case '\r' is also
-            * the quote or escape character.
-            */
-           if (c == '\r')
+           if (delim_len == 0)
            {
-               IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
+               /* When reading entire file, consume all remaining
bytes at once */
+               input_buf_ptr = copy_buf_len;
+               continue;
            }
+           else

I think that the code become much simpler if we do 'continue' at the
end of  'if (cstate->opts.format == COPY_FORMAT_RAW)' branch, instead
of doing 'if (cstate->opts.format == COPY_FORMAT_RAW) {} else ...'.

---
@@ -1461,6 +1536,7 @@ CopyReadLineText(CopyFromState cstate)
    return result;
 }

+
 /*
  * Return decimal value for a hexadecimal digit
  */
@@ -1937,7 +2013,6 @@ endfield:
    return fieldno;
 }

-
 /*
  * Read a binary attribute
  */

Unnecessary line addition and removal.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New "raw" COPY format

Reply via email to