On Wed, Jan 28, 2015 at 12:47 PM, Tom Lane wrote:
> Merlin Moncure writes:
>> ...hm, I spoke to soon. So I deleted everything, and booted up a new
>> instance 9.4 vanilla with asserts on and took no other action.
>> Applying the script with no data activity fails an assertion every
>> single tim
Merlin Moncure writes:
> ...hm, I spoke to soon. So I deleted everything, and booted up a new
> instance 9.4 vanilla with asserts on and took no other action.
> Applying the script with no data activity fails an assertion every
> single time:
> TRAP: FailedAssertion("!(flags & 0x0010)", File: "d
On Wed, Jan 28, 2015 at 8:05 AM, Merlin Moncure wrote:
> On Thu, Jan 22, 2015 at 3:50 PM, Merlin Moncure wrote:
>> I still haven't categorically ruled out pl/sh yet; that's something to
>> keep in mind.
>
> Well, after bisection proved not to be fruitful, I replaced the pl/sh
> calls with dummy c
On Thu, Jan 22, 2015 at 3:50 PM, Merlin Moncure wrote:
> I still haven't categorically ruled out pl/sh yet; that's something to
> keep in mind.
Well, after bisection proved not to be fruitful, I replaced the pl/sh
calls with dummy calls that approximated the same behavior and the
problem went awa
On Thu, Jan 22, 2015 at 03:50:03PM -0600, Merlin Moncure wrote:
> Quick update: not done yet, but I'm making consistent progress, with
> several false starts. (for example, I had a .conf problem with the
> new dynamic shared memory setting and git merrily bisected down to the
> introduction of th
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure wrote:
> Quick update: not done yet, but I'm making consistent progress, with
> several false starts. (for example, I had a .conf problem with the
> new dynamic shared memory setting and git merrily bisected down to the
> introduction of the featur
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure wrote:
>
> So far, the 'nasty' damage seems to generally if not always follow a
> checksum failure and the checksum failures are always numerically
> adjacent. For example:
>
> [cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING: page
> verificat
On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan wrote:
> On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure wrote:
>> ISTM the next step is to bisect the problem down over the weekend in
>> order to to narrow the search. If that doesn't turn up anything
>> productive I'll look into taking other s
On Fri, Jan 16, 2015 at 6:21 AM, Heikki Linnakangas
wrote:
> It looks very much like that a page has for some reason been moved to a
> different block number. And that's exactly what Peter found out in his
> investigation too; an index page was mysteriously copied to a different
> block with ident
On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure wrote:
> ISTM the next step is to bisect the problem down over the weekend in
> order to to narrow the search. If that doesn't turn up anything
> productive I'll look into taking other steps.
That might be the quickest way to do it, provided you c
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund wrote:
> Hi,
>
> On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
>> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
>> > On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
>> >> Running this test on another set of hardware to verify
On Fri, Jan 16, 2015 at 8:22 AM, Andres Freund wrote:
> Is there any chance you can package this somehow so that others can run
> it locally? It looks hard to find the actual bug here without adding
> instrumentation to to postgres.
That's possible but involves a lot of complexity in the setup be
Hi,
On 2015-01-16 08:05:07 -0600, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
> > On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
> >> Running this test on another set of hardware to verify -- if this
> >> turns out to be a false alarm which it may very
On 01/16/2015 04:05 PM, Merlin Moncure wrote:
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
Running this test on another set of hardware to verify -- if this
turns out to be a false alarm which it may very well be, I can only
of
On Fri, Jan 16, 2015 at 8:05 AM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
>> On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
>>> Running this test on another set of hardware to verify -- if this
>>> turns out to be a false alarm which it may very wel
On Thu, Jan 15, 2015 at 5:10 PM, Peter Geoghegan wrote:
> On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
>> Running this test on another set of hardware to verify -- if this
>> turns out to be a false alarm which it may very well be, I can only
>> offer my apologies! I've never had a new
On Thu, Jan 15, 2015 at 3:00 PM, Merlin Moncure wrote:
> Running this test on another set of hardware to verify -- if this
> turns out to be a false alarm which it may very well be, I can only
> offer my apologies! I've never had a new drive fail like that, in
> that manner. I'll burn the other
On Thu, Jan 15, 2015 at 4:03 PM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 1:32 PM, Merlin Moncure wrote:
>> Since it's possible the database is a loss, do you see any value in
>> bootstrappinng it again with checksums turned on? One point of note
>> is that this is a brand spanking new SS
On Thu, Jan 15, 2015 at 1:32 PM, Merlin Moncure wrote:
> Since it's possible the database is a loss, do you see any value in
> bootstrappinng it again with checksums turned on? One point of note
> is that this is a brand spanking new SSD, maybe we nee to rule out
> hardware based corruption?
hm!
On Thu, Jan 15, 2015 at 1:15 PM, Andres Freund wrote:
> Hi,
>
>> The plot thickens! I looped the test, still stock 9.4 as of this time
>> and went to lunch. When I came back, the database was in recovery
>> mode. Here is the rough sequence of events.
>>
>
> Whoa. That looks scary. Did you see (s
On 2015-01-15 20:15:42 +0100, Andres Freund wrote:
> > WARNING: did not find subXID 14955 in MyProc
> > CONTEXT: PL/pgSQL function cdsreconcileruntable(bigint) line 35
> > during exception cleanup
> > WARNING: you don't own a lock of type RowExclusiveLock
> > CONTEXT: PL/pgSQL function cdsrecon
Hi,
> The plot thickens! I looped the test, still stock 9.4 as of this time
> and went to lunch. When I came back, the database was in recovery
> mode. Here is the rough sequence of events.
>
Whoa. That looks scary. Did you see (some of) those errors before? Most
of them should have been emitte
On Thu, Jan 15, 2015 at 8:02 AM, Merlin Moncure wrote:
> On Thu, Jan 15, 2015 at 6:04 AM, Heikki Linnakangas
> wrote:
>> On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
>>>
>>> So now the question is: how did that inconsistency arise? It didn't
>>> necessarily arise at the time of the (presumed) s
On Thu, Jan 15, 2015 at 6:02 AM, Merlin Moncure wrote:
> Question: Coming in this morning I did an immediate restart and logged
> into the database and queried pg_class via index. Everything was
> fine, and the leftright verify returns nothing. How did it repair
> itself without a reindex?
May
On Thu, Jan 15, 2015 at 6:04 AM, Heikki Linnakangas
wrote:
> On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
>>
>> So now the question is: how did that inconsistency arise? It didn't
>> necessarily arise at the time of the (presumed) split of block 2 to
>> create 9. It could be that the opaque area
On 01/15/2015 03:23 AM, Peter Geoghegan wrote:
So now the question is: how did that inconsistency arise? It didn't
necessarily arise at the time of the (presumed) split of block 2 to
create 9. It could be that the opaque area was changed by something
else, some time later. I'll investigate more.
On Wed, Jan 14, 2015 at 8:50 PM, Peter Geoghegan wrote:
> I am mistaken on one detail here - blocks 2 and 9 are actually fully
> identical. I still have no idea why, though.
So, I've looked at it in more detail and it appears that the page of
block 2 split at some point, thereby creating a new pa
On Wed, Jan 14, 2015 at 5:23 PM, Peter Geoghegan wrote:
> My immediate observation here is that blocks 2 and 9 have identical
> metadata (from their page opaque area), but partially non-matching
> data items (however, the number of items on each block is consistent
> and correct according to that
On Wed, Jan 14, 2015 at 5:23 PM, Peter Geoghegan wrote:
> My immediate observation here is that blocks 2 and 9 have identical
> metadata (from their page opaque area), but partially non-matching
> data items (however, the number of items on each block is consistent
> and correct according to that
On Wed, Jan 14, 2015 at 4:53 PM, Merlin Moncure wrote:
> yeah. via:
> cds2=# \copy (select s as page, (bt_page_items('pg_class_oid_index',
> s)).* from generate_series(1,12) s) to '/tmp/page_items.csv' csv
> header;
My immediate observation here is that blocks 2 and 9 have identical
metadata (fr
On Wed, Jan 14, 2015 at 6:50 PM, Peter Geoghegan wrote:
> This is great, but it's not exactly clear which bt_page_items() page
> is which - some are skipped, but I can't be sure which. Would you mind
> rewriting that query to indicate which block is under consideration by
> bt_page_items()?
yeah.
This is great, but it's not exactly clear which bt_page_items() page
is which - some are skipped, but I can't be sure which. Would you mind
rewriting that query to indicate which block is under consideration by
bt_page_items()?
Thanks
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (
On Wed, Jan 14, 2015 at 6:26 PM, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 5:39 PM, Peter Geoghegan wrote:
>> On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure wrote:
>>> (gdb) print BufferGetBlockNumber(buf)
>>> $15 = 9
>>>
>>> ..and it stays 9, continuing several times having set breakpoi
On Wed, Jan 14, 2015 at 4:26 PM, Merlin Moncure wrote:
> The index is the oid index on pg_class. Some more info:
>
> *) temp table churn is fairly high. Several dozen get spawned and
> destroted at the start of a replication run, all at once, due to some
> dodgy coding via dblink. During the re
On Wed, Jan 14, 2015 at 5:39 PM, Peter Geoghegan wrote:
> On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure wrote:
>> (gdb) print BufferGetBlockNumber(buf)
>> $15 = 9
>>
>> ..and it stays 9, continuing several times having set breakpoint.
>
>
> And the index involved? I'm pretty sure that this in
On Wed, Jan 14, 2015 at 3:38 PM, Merlin Moncure wrote:
> (gdb) print BufferGetBlockNumber(buf)
> $15 = 9
>
> ..and it stays 9, continuing several times having set breakpoint.
And the index involved? I'm pretty sure that this in an internal page, no?
--
Peter Geoghegan
--
Sent via pgsql-hac
On Wed, Jan 14, 2015 at 2:32 PM, Peter Geoghegan wrote:
> On Wed, Jan 14, 2015 at 12:24 PM, Peter Geoghegan wrote:
>> Could you write some code to print out the block number (i.e.
>> "BlockNumber blkno") if there are more than, say, 5 retries within
>> _bt_moveright()?
>
> Obviously I mean that t
On Wed, Jan 14, 2015 at 12:24 PM, Peter Geoghegan wrote:
> Could you write some code to print out the block number (i.e.
> "BlockNumber blkno") if there are more than, say, 5 retries within
> _bt_moveright()?
Obviously I mean that the block number should be printed, no matter
whether or not the P
On Wed, Jan 14, 2015 at 11:49 AM, Merlin Moncure wrote:
> so it looks like nobody ever exits from _bt_moveright. any last
> requests before I start bisecting down?
Could you write some code to print out the block number (i.e.
"BlockNumber blkno") if there are more than, say, 5 retries within
_b
On Wed, Jan 14, 2015 at 9:49 AM, Andres Freund wrote:
> On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
>> On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund
>> wrote:
>> > If you gdb in, and type 'fin' a couple times, to wait till the function
>> > finishes, is there actually any progress? I'm
On Wed, Jan 14, 2015 at 7:22 AM, Merlin Moncure wrote:
> I'll try to pull commits that Peter suggested and see if that helps
> (I'm getting ready to bring the database down). I can send the code
> off-list if you guys think it'd help.
Thanks for the code!
I think it would be interesting to see
On 2015-01-14 09:47:19 -0600, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund wrote:
> > If you gdb in, and type 'fin' a couple times, to wait till the function
> > finishes, is there actually any progress? I'm wondering whether it's
> > just many catalog accesses + contenti
On Wed, Jan 14, 2015 at 9:30 AM, Andres Freund wrote:
> If you gdb in, and type 'fin' a couple times, to wait till the function
> finishes, is there actually any progress? I'm wondering whether it's
> just many catalog accesses + contention, or some other
> problem. Alternatively set a breakpoint
On 2015-01-14 09:22:45 -0600, Merlin Moncure wrote:
> On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund wrote:
> > On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
> >> Merlin Moncure writes:
> >> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
> >> >> What are the autovac processes doing (accordin
On Wed, Jan 14, 2015 at 9:11 AM, Andres Freund wrote:
> On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
>> Merlin Moncure writes:
>> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
>> >> What are the autovac processes doing (according to pg_stat_activity)?
>>
>> > pid,running,waiting,query
>> >
On 2015-01-14 10:13:32 -0500, Tom Lane wrote:
> Merlin Moncure writes:
> > Yes, it is pg_class is coming from LockBufferForCleanup (). As you
> > can see above, it has a shorter runtime. So it was killed off once
> > about a half hour ago which did not free up the logjam. However, AV
> > spaw
Andres Freund writes:
> On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
>> Hah, I suspected as much. Is that the one that's stuck in
>> LockBufferForCleanup, or the other one that's got a similar backtrace
>> to all the user processes?
> Do you have a theory? Right now it primarily looks like cont
Merlin Moncure writes:
> Yes, it is pg_class is coming from LockBufferForCleanup (). As you
> can see above, it has a shorter runtime. So it was killed off once
> about a half hour ago which did not free up the logjam. However, AV
> spawned it again and now it does not respond to cancel.
Int
On 2015-01-14 10:05:01 -0500, Tom Lane wrote:
> Merlin Moncure writes:
> > On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
> >> What are the autovac processes doing (according to pg_stat_activity)?
>
> > pid,running,waiting,query
> > 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.
On Wed, Jan 14, 2015 at 9:05 AM, Tom Lane wrote:
> Merlin Moncure writes:
>> On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
>>> What are the autovac processes doing (according to pg_stat_activity)?
>
>> pid,running,waiting,query
>> 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.p
Merlin Moncure writes:
> On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
>> What are the autovac processes doing (according to pg_stat_activity)?
> pid,running,waiting,query
> 7105,00:28:40.789221,f,autovacuum: VACUUM ANALYZE pg_catalog.pg_class
Hah, I suspected as much. Is that the one that'
On Wed, Jan 14, 2015 at 8:41 AM, Tom Lane wrote:
> Merlin Moncure writes:
>> There were seven process with that backtrace exact backtrace (except
>> that randomly they are sleeping in the spinloop). Something else
>> interesting: autovacuum has been running all night as well. Unlike
>> the oth
Merlin Moncure writes:
> There were seven process with that backtrace exact backtrace (except
> that randomly they are sleeping in the spinloop). Something else
> interesting: autovacuum has been running all night as well. Unlike
> the other process however, cpu utilization does not register on
On Wed, Jan 14, 2015 at 8:03 AM, Merlin Moncure wrote:
> Here's a backtrace:
>
> #0 0x00750a97 in spin_delay ()
> #1 0x00750b19 in s_lock ()
> #2 0x00750844 in LWLockRelease ()
> #3 0x0073 in LockBuffer ()
> #4 0x004b2db4 in _bt_relandgetbuf ()
> #5
On Tue, Jan 13, 2015 at 7:24 PM, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure wrote:
>> Some more information what's happening:
>> This is a ghetto logical replication engine that migrates data from
>> sql sever to postgres, consolidating a sharded database into a sing
On Tue, Jan 13, 2015 at 3:54 PM, Merlin Moncure wrote:
> Some more information what's happening:
> This is a ghetto logical replication engine that migrates data from
> sql sever to postgres, consolidating a sharded database into a single
> set of tables (of which there are only two). There is onl
On Tue, Jan 13, 2015 at 3:54 PM, Andres Freund wrote:
>> I don't remember seeing _bt_moveright() or _bt_compare() figuring so
>> prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
>> reference to _bt_binsrch() in either profile.
>
> Well, we do a _bt_moveright pretty early on,
On Tue, Jan 13, 2015 at 4:05 PM, Tom Lane wrote:
> I'm not convinced that Peter is barking up the right tree. I'm noticing
> that the profiles seem rather skewed towards parser/planner work; so I
> suspect the contention is probably on access to system catalogs. No
> idea exactly why though.
I
On 2015-01-13 19:05:10 -0500, Tom Lane wrote:
> Merlin Moncure writes:
> > On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan wrote:
> >> In case it isn't clear, I think that the proximate cause here may well
> >> be either one (or both) of commits
> >> efada2b8e920adfdf7418862e939925d2acd1b89 and/
Merlin Moncure writes:
> On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan wrote:
>> In case it isn't clear, I think that the proximate cause here may well
>> be either one (or both) of commits
>> efada2b8e920adfdf7418862e939925d2acd1b89 and/or
>> 40dae7ec537c5619fc93ad602c62f37be786d161. Probably
On Tue, Jan 13, 2015 at 5:54 PM, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure wrote:
>>> I don't remember seeing _bt_moveright() or _bt_compare() figuring so
>>> prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
>>> reference to _bt_binsrch() in ei
On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund wrote:
> On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
>> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund
>> wrote:
>> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> >> I'm inclined to think that this is a livelock, and so the proble
On Tue, Jan 13, 2015 at 3:50 PM, Merlin Moncure wrote:
>> I don't remember seeing _bt_moveright() or _bt_compare() figuring so
>> prominently, where _bt_binsrch() is nowhere to be seen. I can't see a
>> reference to _bt_binsrch() in either profile.
>
> hm, this is hand compiled now, I bet the sym
On 2015-01-13 15:49:33 -0800, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund wrote:
> > My guess is rather that it's contention on the freelist lock via
> > StrategyGetBuffer's. I've seen profiles like this due to exactly that
> > before - and it fits to parallel loading q
On Tue, Jan 13, 2015 at 5:49 PM, Peter Geoghegan wrote:
> On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund wrote:
>> My guess is rather that it's contention on the freelist lock via
>> StrategyGetBuffer's. I've seen profiles like this due to exactly that
>> before - and it fits to parallel loading
On Tue, Jan 13, 2015 at 3:21 PM, Andres Freund wrote:
> My guess is rather that it's contention on the freelist lock via
> StrategyGetBuffer's. I've seen profiles like this due to exactly that
> before - and it fits to parallel loading quite well.
I'm not saying you're wrong, but the breakdown of
On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund wrote:
> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
> >> I'm inclined to think that this is a livelock, and so the problem
> >> isn't evident from the structure of the B-Tree, but it ca
On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund wrote:
> On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> I'm inclined to think that this is a livelock, and so the problem
>> isn't evident from the structure of the B-Tree, but it can't hurt to
>> check.
>
> My guess is rather that it's conte
On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
> I'm inclined to think that this is a livelock, and so the problem
> isn't evident from the structure of the B-Tree, but it can't hurt to
> check.
My guess is rather that it's contention on the freelist lock via
StrategyGetBuffer's. I've seen p
On Tue, Jan 13, 2015 at 2:29 PM, Merlin Moncure wrote:
> On my workstation today (running vanilla 9.4.0) I was testing some new
> code that does aggressive parallel loading to a couple of tables.
Could you give more details, please? For example, I'd like to see
representative data, or at least th
On Tue, Jan 13, 2015 at 4:33 PM, Andres Freund wrote:
> Hi,
>
> On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
>> On my workstation today (running vanilla 9.4.0) I was testing some new
>> code that does aggressive parallel loading to a couple of tables. It
>> ran ok several dozen times and fr
Hi,
On 2015-01-13 16:29:51 -0600, Merlin Moncure wrote:
> On my workstation today (running vanilla 9.4.0) I was testing some new
> code that does aggressive parallel loading to a couple of tables. It
> ran ok several dozen times and froze up with no external trigger.
> There were at most 8 active
On my workstation today (running vanilla 9.4.0) I was testing some new
code that does aggressive parallel loading to a couple of tables. It
ran ok several dozen times and froze up with no external trigger.
There were at most 8 active backends that were stuck (the loader is
threaded to a cap) -- eac
73 matches
Mail list logo