On 06/11/2012 04:31 PM, Florian Haas wrote:
On 06/11/12 22:14, Matthias Hensler wrote:
On Mon, Jun 11, 2012 at 06:35:18PM +0200, Matthias Hensler wrote:
[...]
I checked the changelog for 8.3.12, but nothing obviously struck me.
Also diffing the sourcetrees 8.3.11->8.3.12 I did not find any
obvious.
Let me follow up on this myself. As suggested on IRC I tried to build
drbd from source, just to take the elrepo packages from the equation.
So I started with DRBD 8.3.13, and as expected I had a low performance.
Then I tried 8.3.11, and I also had a low performance (although 8.3.11
from elrepo worked fine).
That left me puzzled for a while, since I examined the elrepo packages
more closely. As it seemed, all working drbd versions where build on
2.6.32-71, while all broken versions where build on 2.6.32-220.
So, I installed the old el6 2.6.32-71 kernel (took me a while to find
it, since it was removed from nearly all archives) and its devel
package, booted into that kernel and build two new versions from source:
8.3.11 and 8.3.13. Then I booted back to 2.6.32-220.
First try with my selfcompiled 8.3.11 modules: everything is fine.
Second try with my selfcompiled 8.3.13 modules: still everything is
fine.
Indeed, the problem lies within the kernel version used to build the
drbd.ko module. I double checked by using all userland tools from 8.3.13
elrepo build together with my drbd.ko build on 2.6.32-71 (but run from
2.6.32-220).
Just to be clear: all tests were made with kernel 2.6.32-220, and the
userland version does not matter.
drbd.ko | 8.3.11 | 8.3.13
---------------------+--------+-------
build on 2.6.32-71 | good | good
build on 2.6.32-220 | bad | bad
So, how to debug this further? I would suspect looking at the symbols of
both modules might give a clue?
As a knee-jerk response based on a hunch -- you've been warned :) --,
this could be related to the BIO_RW_BARRIER vs. FLUSH/FUA dance that the
RHEL 6 kernel has been doing between the initial RHEL 6 release, and
more recent updates (when they've been backporting the "let's kill
barriers" upstream changes from post-2.6.32).
Try configuring your disk section with no-disk-barrier, no-disk-flushes
and no-md-flushes (in both configurations) and see if your kernel module
change still makes a difference.
Of course, in production you should only use those options if you have
no volatile caches involved in the I/O path.
Not sure if this is useful, but I sure hope it is. :)
Cheers,
Florian
Oh! Please let me know if this works. :)
digimer
--
Digimer
Papers and Projects: https://alteeve.com
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user