Re: Storage server

Bob Proulx Fri, 07 Sep 2012 13:16:42 -0700

Stan Hoeppner wrote:
> Dan Ritter wrote:
> > You can put cheap SATA disks in, instead of expensive SAS disks.
> > The performance may not be as good, but I suspect you are
> > looking at sheer capacity rather than IOPS.
> 
> Stick with enterprise quality SATA disks.  Throwing "drive of the week"
> consumer models, i.e. WD20EARS, in the chassis simply causes unnecessary
> heartache down the road.


There are inexpensive disks and then there are cheap disks.  Those
"green" drives are in definitely the "cheap" category.  Much harder to
deal with than the "inexpensive" category.  I think there are three
big classifications of drives.  Good quality enterprise drives.
Inexpensive drives.  Cheap drives.  The cheap drives are really
terrible!

> > Now, the next thing: I know it's tempting to make a single
> > filesystem over all these disks. Don't. The fsck times will be
> > horrendous. Make filesystems which are the size you need, plus a
> > little extra. It's rare to actually need a single gigantic fs.

Agreed.  But for me it isn't about the fsck time.  It is about the
size of the problem.  If you have full 100G filesystem and there is a
problem then you have a 100G problem.  It is painful.  But you can
handle it.  If you have a full 10T filesystem and there is a problem
then there is a *HUGE* problem.  It is so much more than painful.

Therefore when practical I like to compartmentalize things so that
there is isolation between problems.  Whether the problem is due to a
hardware failure, a software failure or a human failure.  All of which
are possible.  Having compartmentalization makes dealing with the
problem easier and smaller.

> Whjat?  Are you talking crash recovery boot time "fsck"?  With any
> modern journaled FS log recovery is instantaneous.  If you're talking
> about an actual structure check, XFS is pretty quick regardless of inode
> count as the check is done in parallel.  I can't speak to EXTx as I
> don't use them.

You should try an experiment and set up a terabyte ext3 and ext4
filesystem and then perform a few crash recovery reboots of the
system.  It will change your mind.  :-)

> For a multi terabyte backup server, XFS is the only way to go
> anyway.  Using XFS also allows infinite growth without requiring
> array reshapes nor LVM, while maintaining striped write alignment
> and thus maintaining performance.

I agree that XFS is a superior filesystem for large filesystems.  I
have used it there for years.

XFS has one unfortunate missing feature.  You can't resize a
filesystem to be smaller.  You can resize them larger.  But not
smaller.  This is a missing feature that I miss as compared to other
filesystems.

Unfortunately I have some recent FUD concerning xfs.  I have had some
recent small idle xfs filesystems trigger kernel watchdog timer
recoveries recently.  Emphasis on idle.  Active filesystems are always
fine.  I used /tmp as a large xfs filesystem but swapped it to be ext4
due to these lockups.  Squeeze.  Everything current.  But when idle it
would periodically lock up and the only messages in the syslog and on
the system console were concerning xfs threads timed out.  When the
kernel froze it always had these messages displayed[1].  It was simply
using /tmp as a hundred gig or so xfs filesystem.  Doing nothing but
changing /tmp from xfs to ext4 resolved the problem and it hasn't seen
a kernel lockup since.  I saw that problem on three different machines
but effectively all mine and very similar software configurations.
And by kernel lockup I mean unresponsive and it took a power cycle to
free it.

I hesitated to say anything because of lacking real data but it means
I can't completely recommend xfs today even though I have given it
strong recommendations in the past.  I am thinking that recent kernels
are not completely clean specifically for idle xfs filesystems.
Meanwhile active ones seem to be just fine.  Would love to have this
resolved one way or the other so I could go back to recommending xfs
again without reservations.

> There are hundreds of 30TB+ and dozens of 100TB+ XFS filesystems in
> production today, and I know of one over 300TB and one over 500TB,
> attached to NASA's two archival storage servers.

Definitely XFS can handle large filesystems.  And definitely when
there is a good version of everything all around it has been a very
good and reliable performer for me. I wish my recent bad experiences
were resolved.

But for large filesystems such as that I think you need a very good
and careful administrator to manage the disk farm.  And that includes
disk use policies as much as it includes managing kernel versions and
disk hardware.  Huge problems of any sort need more careful management.

> When using correctly architected reliable hardware there's no reason one
> can't use a single 500TB XFS filesystem.

Although I am sure it would work I would hate to have to deal with a
problem that large when there is a need for disaster recovery.  I
guess that is why *I* don't manage storage farms that are that large. :-)

Bob

[1] Found an old log trace.  Stock Squeeze.  Everything current.  /tmp
    was the only xfs filesystem on the machine.  Most of the time the
    recovery would work fine.  But whenever the machine was locked up
    frozen this was always displayed on the console.  Doing nothing
    but replacing xfs /tmp with ext4 /tmp and the system freeze
    problem disappeared.  I could put it back and see if the kernel
    freeze reappears but I don't want to.

May 21 09:05:38 fs kernel: [3865560.844047] INFO: task xfssyncd:1794 blocked 
for more than 120 seconds.
May 21 09:05:38 fs kernel: [3865560.925322] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 21 09:05:38 fs kernel: [3865561.021188] xfssyncd      D 00000010     0  
1794      2 0x00000000
May 21 09:05:38 fs kernel: [3865561.021204]  f618d940 00000046 f6189980 
00000010 f8311020 c13fc000 c13fc000 c13f7604
May 21 09:05:38 fs kernel: [3865561.021234]  f618dafc c3508000 00000001 
00036b3f 00000000 00000000 c3554700 d21b3280
May 21 09:05:38 fs kernel: [3865561.021265]  c3503604 f618dafc 3997fc05 
c35039c8 c106ef0b 00000000 00000000 00000000
May 21 09:05:38 fs kernel: [3865561.021295] Call Trace:
May 21 09:05:38 fs kernel: [3865561.021326]  [<c106ef0b>] ? 
rcu_process_gp_end+0x27/0x63
May 21 09:05:38 fs kernel: [3865561.021339]  [<c125d891>] ? 
schedule_timeout+0x20/0xb0
May 21 09:05:38 fs kernel: [3865561.021352]  [<c1132b9b>] ? 
__lookup_tag+0x8e/0xee
May 21 09:05:38 fs kernel: [3865561.021362]  [<c125d79a>] ? 
wait_for_common+0xa4/0x100
May 21 09:05:38 fs kernel: [3865561.021374]  [<c102daad>] ? 
default_wake_function+0x0/0x8
May 21 09:05:38 fs kernel: [3865561.021405]  [<f8bad6d2>] ? 
xfs_reclaim_inode+0xca/0x117 [xfs]
May 21 09:05:38 fs kernel: [3865561.021425]  [<f8bade3c>] ? 
xfs_inode_ag_walk+0x44/0x73 [xfs]
May 21 09:05:38 fs kernel: [3865561.021445]  [<f8bad71f>] ? 
xfs_reclaim_inode_now+0x0/0x4c [xfs]
May 21 09:05:38 fs kernel: [3865561.021465]  [<f8badea1>] ? 
xfs_inode_ag_iterator+0x36/0x58 [xfs]
May 21 09:05:38 fs kernel: [3865561.021484]  [<f8bad71f>] ? 
xfs_reclaim_inode_now+0x0/0x4c [xfs]
May 21 09:05:38 fs kernel: [3865561.021504]  [<f8baded1>] ? 
xfs_reclaim_inodes+0xe/0x10 [xfs]
May 21 09:05:38 fs kernel: [3865561.021530]  [<f8badef6>] ? 
xfs_sync_worker+0x23/0x5c [xfs]
May 21 09:05:38 fs kernel: [3865561.021549]  [<f8bad901>] ? 
xfssyncd+0x134/0x17d [xfs]
May 21 09:05:38 fs kernel: [3865561.021569]  [<f8bad7cd>] ? xfssyncd+0x0/0x17d 
[xfs]
May 21 09:05:38 fs kernel: [3865561.021580]  [<c10441e0>] ? kthread+0x61/0x66
May 21 09:05:38 fs kernel: [3865561.021590]  [<c104417f>] ? kthread+0x0/0x66
May 21 09:05:38 fs kernel: [3865561.021601]  [<c1003d47>] ? 
kernel_thread_helper+0x7/0x10

signature.asc
Description: Digital signature

Re: Storage server

Reply via email to