On Fri, Oct 18, 2024, at 19:24, Joel Jacobson wrote: > Attachments: > * v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch > * v11-0002-Add-raw-format-to-COPY-command.patch
Here is a demo of a importing a decently sized real text file, that can't currently be imported without the CSV hack: $ head /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 .package-cache-mutate devel/cargo bin admin/base-files bin/archdetect admin/ubiquity bin/autopartition admin/ubiquity bin/autopartition-crypto admin/ubiquity bin/autopartition-loop admin/ubiquity bin/autopartition-lvm admin/ubiquity bin/block-attr admin/ubiquity bin/blockdev-keygen admin/ubiquity bin/blockdev-wipe admin/ubiquity This file uses a combination of tabs and spaces, in between the two columns, so none of the existing formats are suitable to deal with this file. $ ls -lah /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 -rw-r--r-- 1 root root 791M Apr 24 02:07 /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 To import using the CSV hack, we first have find two bytes that don't exist anyway, which can be done using e.g. ripgrep. The below command verifies \x01 and \x02 don't exist anywhere: $ rg -uuu --multiline '(?-u)[\x01|\x02]' /var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64 $ Knowing these bytes don't exist anywhere, we can then safely use these as delimiter and quote characters, as a hack to disable these features: CREATE TABLE package_contents (raw_line text); COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT CSV, DELIMITER E'\x01', QUOTE E'\x02'); COPY 8443588 Time: 3882.100 ms (00:03.882) Time: 3552.991 ms (00:03.553) Time: 3748.038 ms (00:03.748) Time: 3775.947 ms (00:03.776) Time: 3729.020 ms (00:03.729) I tested writing a Rust program that would read the file line-by-line and INSERT each line instead. This is of course a lot slower, since it has to execute each insert separately: $ cargo run --release Compiling insert_package_contents v0.1.0 (/home/joel/insert_package_contents) Finished `release` profile [optimized] target(s) in 0.70s Running `target/release/insert_package_contents` Connecting to the PostgreSQL database... Successfully connected to the database. Starting to insert lines from the file... Successfully inserted 8443588 lines into package_contents in 134.65s. New approach using the RAW format: COPY package_contents FROM '/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' (FORMAT RAW, DELIMITER E'\n'); COPY 8443588 Time: 2918.489 ms (00:02.918) Time: 3020.372 ms (00:03.020) Time: 3336.589 ms (00:03.337) Time: 3067.268 ms (00:03.067) Time: 3343.694 ms (00:03.344) Apart from the convenience improvement, it seems to be somewhat faster already. /Joel