On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund <and...@2ndquadrant.com> wrote: > On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote: >> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund <and...@2ndquadrant.com> >> wrote: >> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote: >> >> I'm inclined to think that this is a livelock, and so the problem >> >> isn't evident from the structure of the B-Tree, but it can't hurt to >> >> check. >> > >> > My guess is rather that it's contention on the freelist lock via >> > StrategyGetBuffer's. I've seen profiles like this due to exactly that >> > before - and it fits to parallel loading quite well. >> >> I think I've got it to pop again. s_lock is only showing 35% >> (increasing very slowly if at all) but performance is mostly halted. >> Frame pointer is compiled out. perf report attached. > >> 35.82% postgres [.] s_lock >> 23.71% postgres [.] tas >> 14.01% postgres [.] tas >> 6.82% postgres [.] spin_delay >> 5.93% postgres [.] LWLockRelease >> 4.36% postgres [.] LWLockAcquireCommon > > Interesting. This profile looks quite different?
yep, it's very stable, and the database is approximately frozen. > What kind of hardware is this on? linux on crap workstation box, except I have good ssd (intel 3500). model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz top - 17:44:00 up 11 days, 6:25, 15 users, load average: 6.94, 6.97, 5.73 Tasks: 259 total, 8 running, 250 sleeping, 0 stopped, 1 zombie %Cpu0 : 95.7 us, 0.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 93.5 us, 2.2 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 95.7 us, 0.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 93.6 us, 2.1 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 8131028 total, 7977608 used, 153420 free, 21424 buffers KiB Swap: 8340476 total, 1884900 used, 6455576 free, 5201188 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32492 mmoncure 20 0 175m 12m 9040 R 64.0 0.2 9:28.42 postgres 32367 mmoncure 20 0 175m 12m 9128 R 57.3 0.2 9:35.87 postgres 31792 mmoncure 20 0 175m 92m 88m R 52.9 1.2 9:57.50 postgres 32725 mmoncure 20 0 175m 13m 9852 R 52.9 0.2 9:08.50 postgres 31899 mmoncure 20 0 179m 104m 97m R 50.7 1.3 9:49.51 postgres 32368 mmoncure 20 0 175m 12m 9200 R 50.7 0.2 9:33.77 postgres 31282 mmoncure 20 0 184m 152m 140m R 46.3 1.9 10:26.29 postgres 666 mmoncure 20 0 20640 1684 1116 R 2.2 0.0 0:00.01 top Some more information what's happening: This is a ghetto logical replication engine that migrates data from sql sever to postgres, consolidating a sharded database into a single set of tables (of which there are only two). There is only one index on the destination table, and it's composite int,int in both cases. A single source table is replicated in a single transaction, and it's staged to a set of two temp tables, one for deletes, one for inserts. I log each step in the database. Here are the times: cds2=# select cdsruntableid, finished - started, inserted from cdsruntable where cdsrunid = 62 order by started; cdsruntableid │ ?column? │ inserted ───────────────┼─────────────────┼────────── 833 │ 00:00:33.24044 │ 11860 834 │ 00:00:35.44981 │ 19099 835 │ 00:02:01.725186 │ 530923 836 │ 00:01:29.578811 │ 211737 838 │ 00:01:17.393461 │ 64258 837 │ 00:00:56.849106 │ 227844 839 │ 00:02:12.016504 │ 630309 840 │ 00:00:00.00111 │ 841 │ 00:01:09.058806 │ 155309 842 │ 00:01:54.723747 │ 704422 843 │ 00:01:14.965304 │ 844 │ 00:00:45.410953 │ 59934 845 │ 00:00:34.302632 │ 14848 846 │ 00:00:46.913832 │ 89893 848 │ 00:05:22.155189 │ 2410622 847 │ 00:01:25.199141 │ 268157 849 │ 00:01:16.168352 │ 117511 850 │ 00:00:34.809862 │ 1175 851 │ 00:01:12.565397 │ 67214 852 │ 00:01:03.734776 │ 20129 853 │ 00:00:41.038456 │ 62765 854 │ 00:01:03.478917 │ 14967 855 │ 00:00:33.88803 │ 6901 856 │ 00:00:36.381886 │ 6670 857 │ 00:00:36.626623 │ 8268 858 │ 00:01:14.391584 │ 363448 859 │ 00:01:44.619005 │ 533395 860 │ 00:01:02.042255 │ 212202 861 │ 00:00:00.001065 │ 863 │ 00:00:58.265506 │ 215234 862 │ 00:02:45.572635 │ 827941 865 │ 00:01:28.049165 │ 241020 864 │ 00:01:51.465643 │ 531012 866 │ 00:01:20.273391 │ 419357 868 │ 00:01:25.479443 │ 294262 869 │ 00:00:46.400825 │ 46136 870 │ 00:00:42.337286 │ 25934 871 │ 00:00:48.350152 │ 91375 872 │ 00:00:00.001743 │ 873 │ 00:01:04.596749 │ 199744 876 │ 00:01:15.210301 │ 276276 877 │ 00:02:36.447723 │ 1693887 878 │ 00:00:55.925104 │ 36938 879 │ 00:00:00.016874 │ 882 │ 00:00:50.479248 │ 171901 881 │ │ 884 │ │ 883 │ │ 867 │ │ 874 │ │ 875 │ │ 880 │ │ (52 rows) There is a an 8 thread worker that walks that queue in id order and migrates one source table of data to the destination postgres server. It's been stuck for quite some time now, and I'm certain that if I restart the database, and rerun the incremental processing routine, it would work those last seven tables with no issues (at least that's what happened last time). It's been bogged with no progress now for a half hour again, so there's some kind of weird bug here. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers