Hi Andreas,

I saw almost exactly what you described when using ocfs2 on web servers. Some 
time late at night, the load would go through the roof on 1 web server because 
there were lots of apache processes in the uninterruptible "D" state If I 
stopped apache on the problem server and the load dropped, but went back up as 
soon as I started it again.

Turns out I'd hit a free space fragmentation problem. While df reported I had 
heaps of free space (>50% from memory!), I couldn't write (echo >>) to the log 
files on the problem web server. Note that you'll find you can still create 
small files and append to small files, but not the larger apache log files.

The fact that it happens late at night was very confusing, but eventually made 
sense. As the day goes on, the log files get bigger and bigger pieces of 
contiguous free space are required to extend the file. Eventually, a contiguous 
piece of free space cannot be found and your writes will start to fail.

A *partial* fix went into 2.6.33. It's partial because it doesn't fix the free 
space fragmentation issue but rather allows the problem node to steal some free 
space from the node that is still ok. All it does is prolong the problem a 
little such that writes will start to fail on both nodes at the same time.

Another thing you can do that doesn't require a kernel upgrade is to reduce the 
number of node slots. The default is 8 (-N to mkfs.ocfs2) so reducing this will 
free up some *contiguous* free space. Unfortunately this is an offline 
operation.

This may not be your issue, but it certainly sounds familiar. I recall it was 
very frustrating trying to diagnose the issue.

Cheers,

Brad


On Wed, 3 Mar 2010 11:04:48 +0100
"Andreas Kossmann" <kossmann.andr...@gmx.de> wrote:

> Hello all,
> 
> I have an enviroment with 2 Debian 5.0 servers. 
> Kernel is 2.6.26-2-amd64. I have installed drbd-8.0.14 and ocfs2-tools 1.4.1.
> It is an Active/Active WebCluster with Apache.
> The 2 servers write to the same log files.
> 
> In my test enviroment everything works fine. In the production environment I 
> have the problem, that after a few weeks the Apache-Servers goes crazy and 
> get a very high load >100.
> 
> First I thought the problem may be drbd, but I have read many problemes with 
> ocfs2 and apache load average.
> 
> The curios thing is that the load is often very high at times where request 
> are very small ( eg. 11:00 PM )
> 
> I've disconnected the second webserver from the network and checked the 
> filesystem. A few bitmap errors occured and i repaired them. Then I changed 
> the drbd config so, that only webserver 1 is primary and the webserver 2 is 
> secondary. So webserver 2 cannot write to the device. 
> 
> After I connect the webserver 2 to the network again and the sync from the 
> primary starts. The load on webserver 1 is going > 100.
> 
> I have also tested the connection with webserver 2 with disconnected drbd I 
> discovered that the load on webserver 1 is going i little higher also.
> 
> Is there any solution for the ocfs2 load problem with apache?
> If there is no solution I hvae to change from active/active to active/passive 
> with ext3 as filesystem.
> 
> Please, help me.
> 
> Thanks a lot
> 
> Andreas

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to