On Fri, Oct 18, 2024, at 19:24, Joel Jacobson wrote:
> Attachments:
> * v11-0001-Refactor-ProcessCopyOptions-introduce-CopyFormat-enu.patch
> * v11-0002-Add-raw-format-to-COPY-command.patch

Here is a demo of a importing a decently sized real text file,
that can't currently be imported without the CSV hack:

$ head 
/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
.package-cache-mutate                                       devel/cargo
bin                                                         admin/base-files
bin/archdetect                                              admin/ubiquity
bin/autopartition                                           admin/ubiquity
bin/autopartition-crypto                                    admin/ubiquity
bin/autopartition-loop                                      admin/ubiquity
bin/autopartition-lvm                                       admin/ubiquity
bin/block-attr                                              admin/ubiquity
bin/blockdev-keygen                                         admin/ubiquity
bin/blockdev-wipe                                           admin/ubiquity

This file uses a combination of tabs and spaces, in between the two columns,
so none of the existing formats are suitable to deal with this file.

$ ls -lah 
/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
-rw-r--r-- 1 root root 791M Apr 24 02:07 
/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64

To import using the CSV hack, we first have find two bytes that don't exist 
anyway,
which can be done using e.g. ripgrep. The below command verifies \x01 and \x02
don't exist anywhere:

$ rg -uuu --multiline '(?-u)[\x01|\x02]' 
/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64
$

Knowing these bytes don't exist anywhere,
we can then safely use these as delimiter and quote characters,
as a hack to disable these features:

CREATE TABLE package_contents (raw_line text);

COPY package_contents FROM 
'/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' 
(FORMAT CSV, DELIMITER E'\x01', QUOTE E'\x02');
COPY 8443588
Time: 3882.100 ms (00:03.882)
Time: 3552.991 ms (00:03.553)
Time: 3748.038 ms (00:03.748)
Time: 3775.947 ms (00:03.776)
Time: 3729.020 ms (00:03.729)

I tested writing a Rust program that would read the file line-by-line and 
INSERT each line instead.
This is of course a lot slower, since it has to execute each insert separately:

$ cargo run --release
   Compiling insert_package_contents v0.1.0 (/home/joel/insert_package_contents)
    Finished `release` profile [optimized] target(s) in 0.70s
     Running `target/release/insert_package_contents`
Connecting to the PostgreSQL database...
Successfully connected to the database.
Starting to insert lines from the file...
Successfully inserted 8443588 lines into package_contents in 134.65s.

New approach using the RAW format:

COPY package_contents FROM 
'/var/lib/apt/lists/se.archive.ubuntu.com_ubuntu_dists_noble_Contents-amd64' 
(FORMAT RAW, DELIMITER E'\n');
COPY 8443588
Time: 2918.489 ms (00:02.918)
Time: 3020.372 ms (00:03.020)
Time: 3336.589 ms (00:03.337)
Time: 3067.268 ms (00:03.067)
Time: 3343.694 ms (00:03.344)

Apart from the convenience improvement,
it seems to be somewhat faster already.

/Joel


Reply via email to