Lars Ellenberg wrote:
On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
OK....

Since there was no ssh-as-root between the cluster nodes, I didn't
send all the logs along from every node in the cluster - and it
didn't occur to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL
the nodes in the cluster reported the same problem at the same
time...

That makes it a lot less likely to be a race condition with the disk
writing infrastructure...

I've attached the relevant lines from the various machines -
slightly processed (date stamp format changed and a few other minor
things).

Let me know if you want me to send all the system logs along...

There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().


Also, for my reference - what method are you using to compute the
digest of the file?  That is, what command should I execute to get
the same results?

It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
 echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.

This is a change from how it used to be (the last time I looked - at least according to my not-always-reliable memory). Thanks for the update.


2010/03/31_19:02:52     vhost0384       [13294]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 :
retrieveCib(tmp1, tmp2, FALSE) != NULL

So it did not verify right after it was written.
Can you reproduce?

I have no idea. I didn't do anything much. Hopefully the test suite does a lot more strenuous things...

The core files may actually contains some hints,
so have a look there.

None of them verified. All the nodes in the cluster failed the test at the same time - and now I have no official CIBs on disk - on any cluster nodes... I sent Andrew all the CIBs, and all the core files, and basically everything under /var/lib/heartbeat/ from one machine. They're from the latest official release - so the binaries that match them are readily available.

        Thanks Lars!


--
    Alan Robertson <al...@unix.sh>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to