Hello list, I am having issues with performance inserting data in Postgres and would like to ask for help figuring out the problem as I ran out of ideas.
I have a process that generates a CSV file with 1 million records in it every 5 minutes and each file is about 240MB. I need this data to be inserted into a table on Postgres. A peculiarity about it is that the data on these files might be duplicate. I might have a row on the first file that is also present on the second or the third and so on. I don't care about the duplicates, so I have a unique constraint on my table to discard those. The data in the CSV is pretty simple: ``` user_id,name,url ``` The table is defined like this: ``` create unlogged table users_no_dups ( created_ts timestamp without time zone, user_id bigint not null, name text, url text, unique(user_id) ); ``` Table is created as `unlogged` as a way to improve performance. I am aware of the consequences of this possibly causing data loss. My process for inserting data into the table is as follows: * Create an intermediary table `users` as follows: ``` create unlogged table users ( created_ts timestamp without time zone default current_timestamp, user_id bigint, name text, url text ) with (autovacuum_enabled = false, toast.autovacuum_enabled = false) ``` * Use `COPY` to copy the data from the CSV file into an intermediary table ``` copy users(user_id, name, url) from 'myfile.csv' with(format csv, header true, delimiter ',', quote '"', escape '\\') ``` * Insert the data from the `users` table into the `users_no_dups` table ``` insert into users_no_dups ( created_ts, user_id, name, url ) ( select created_ts, user_id, name, url from users ) on conflict do nothing ``` * Drop the `users` table * Repeat the whole thing for the next file. Running the above loop worked fine for about 12 hours. Each file was taking about 30 seconds to be processed. About 4 seconds to create the `users` table and have the CSV data loaded into it and anything between 20 and 30 seconds to insert the data from `users` into `users_no_dups`. All of a sudden inserting from `users` into `users_no_dups` started taking 20+ minutes. I recreated the table with a `fillfactor` of `30` and tried again and things were running well again with that same 30 seconds for processing. Again after about 12 hours, things got really slow. Recreating the table now isn't really providing any improvements. I tried recreating it with a `fillfactor` of `10`, but it was taking too long and too much space (the table had 300GB with the fillfactor set to 30; with it set to 10 it went up to almost 1TB). Watching on iotop, the `INSERT` statement `WRITE` speed is always between 20 and 100 K/s now. When I first started inserting the `WRITE` speed is always above 100M/s. If I try to copy the `users_no_dups` table to another table (say users_no_dups_2 with the same structure), the `WRITE` speed also goes to 100M/s or more until it gets to the last 2 GB of data being copied. Then speed goes down to the 20 to 100K/s again and stays there (I know this from watching `iotop`). I have the following custom configuration on my postgres installation that I've done in order to try to improve the performance: ``` ssl = off shared_buffers = 8GB work_mem = 12GB maintenance_work_mem = 12GB max_stack_depth = 4MB synchronous_commit = off wal_writer_flush_after = 128MB max_wal_size = 32GB min_wal_size = 80MB effective_cache_size = 96GB ``` Information about the machine: ``` Processor: 2x Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz (6 cores each, 12 threads) RAM: 256GB Disk1: 2TB SSD SATA-3 Samsung Evo 860 Disk2: 6TB HDD SATA-3 Seagate Exos Enterprise 7200RPM Disk3: 8TB HDD SATA-3 Seagate Exos Enterprise 7200RPM Disk1 and Disk2 are configured as a single logical volume. Table `users_no_dups` is in a tablespace on `Disk3`. The defaul tablespace is in the logical volume composed by `Disk1` and `Disk2`. OS: Ubuntu Linux 19.10 Postgres version: PostgreSQL 11.7 (Ubuntu 11.7-0ubuntu0.19.10.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit ``` Any ideas why I am seeing this decrease in performance with the insert or any suggestions on how I can try to figure this out? Sorry for the wall of text. Just trying to give as much info as I have. Henrique