Alright, so it seems like partitioning and changing the shared_buffers as well as adding the order by helped to a certain extent, but the writes are still slow. Inserting a 1 million records file is taking almost 3 minutes (way better than the 20+ minutes, but still pretty slow compared to the 20 seconds it used to take).
The interesting thing for me right now is: If I try to insert the data from a file that has already been inserted (meaning all the data will end up being rejected due to the unique constraint), it only takes between 1 and 4 seconds for the insertion to finish executing. For regular files (which usually have 30% new unique records (meaning about 300k new records)), it is taking those 3 minutes. **UPDATE** I started writing this email and then it occurred to me something I should try. Leaving the information above for historical reasons. Basically I went ahead and ran a `reindex` on all the partitions now to see if it would improve the performance and seems like that did it! I used the following script to reindex all of the partitions (the name of my partitions all start with ubp_): ``` DO $$DECLARE r record; BEGIN FOR r IN select indexname from pg_indexes where tablename like 'ubp_%' LOOP EXECUTE 'reindex index ' || r.indexname; END LOOP; END$$; ``` After doing this, processing of each file is taking anything between 8 and 20 seconds (most of them seem to be taking 8 seconds though). So, this is great! In summary, what I ended up having to do was: * Raise shared_buffers to 160GB * Add an `order by` to the `select` subquery in the `insert` statement * Partition the table * Tune postgres configurations as shown below: ~~~ ssl = off shared_buffers = 160GB work_mem = 12GB maintenance_work_mem = 12GB max_stack_depth = 4MB synchronous_commit = off wal_writer_flush_after = 128MB max_wal_size = 32GB min_wal_size = 80MB effective_cache_size = 96GB ~~~ I can't tell if the raising of the `shared_buffers` was the reason for the performance gains or the adding of the `order by` was the responsible. Doesn't hurt to do both anyways. I know for a fact that the `reindex` of each partition made a huge difference in the end as explained above (bringing insert time down from 3 minutes to 8 seconds). I have about 1800 files in my backlog to be processed now (18 billion records). I have started processing them and will report back in case performance degrades once again. Thanks everybody for the help so far! I really appreciate it. Henrique PS: I checked the `dirty` ratios for the OS: $ sysctl vm.dirty_ratio vm.dirty_ratio = 20 $ sysctl vm.dirty_background_ratio vm.dirty_background_ratio = 10 $ sysctl vm.dirty_expire_centisecs vm.dirty_expire_centisecs = 3000 These are default values; if what I understood from them is right, it seems to me that these values should be fine. On Mon, Jul 13, 2020 at 9:02 PM Henrique Montenegro <typ...@gmail.com> wrote: > > > On Mon, Jul 13, 2020 at 8:05 PM Jeff Janes <jeff.ja...@gmail.com> wrote: > >> On Mon, Jul 13, 2020 at 10:23 AM Henrique Montenegro <typ...@gmail.com> >> wrote: >> >> insert into users_no_dups ( >>> created_ts, >>> user_id, >>> name, >>> url >>> ) ( >>> select >>> created_ts, >>> user_id, >>> name, >>> url >>> from >>> users >>> ) on conflict do nothing >>> >> >> Once the size of the only index exceeds shared_buffers by a bit (the >> amount of "a bit" depends on your RAM, kernel version, settings >> for dirty_background_ratio, dirty_expire_centisecs, and probably other >> things, and is not easy to predict) the performance falls off a cliff when >> inserting values in a random order. Every insert dirties a random index >> leaf page, which quickly gets evicted from shared_buffers to make room for >> other random leaf pages to be read in, and then turns into flush calls when >> the kernel freaks out about the amount and age of dirty pages held in >> memory. >> > > That is interesting to know. I will do some research on those things. > > >> What happens if you add an "ORDER BY user_id" to your above select? >> > > I don't know. I will give it a try right now. > >> >> >>> shared_buffers = 8GB >>> RAM: 256GB >>> >> >> Or, crank up shared_buffers by a lot. Like, beyond the size of the >> growing index, or up to 240GB if the index ever becomes larger than that. >> And make the time between checkpoints longer. If the dirty buffers are >> retained in shared_buffers longer, chances of them getting dirtied >> repeatedly between writes is much higher than if you just toss them to the >> kernel and hope for the best. >> >> > I cranked it up to 160GB to see how it goes. > > Cheers, >> >> Jeff >> > > I created the partitions as well as mentioned before. I was able to > partition the table based on the user_id (found some logic to it). I was > transferring the data from the original table (about 280 million records; > 320GB) to the new partitioned table and things were going well with write > speeds between 30MB/s and 50MB/s. After reading 270GB of the 320GB (in 4 > and a half hours) and writing it to the new partitioned table, write speed > went down to 7KB/s. It is so frustrating. > > I will keep the partitions and try your suggestions to see how it goes. > > I apologize for the long time between replies, it is just that testing > this stuff takes 4+ hours each run. > > If there are any other suggestions of things for me to look meanwhile as > well, please keep them coming. > > Thanks! > > Henrique >