Mark Dotson said the following on 11/14/06 1:18 PM:
I've had continued problems with the 3ware series SATA cards and the
Tyan boards. Specifically, I have a "Tyan S5360-1U" and both a
9500S-4LP and a 8506 series 3ware cards.
In my case the first error is different, but the 'resetting' over and
over is VERY familiar. This could be triggered by a simple file copy
from one part of a container to another; degrading the unit and
triggering the resetting crap. Note that the drives are fine, I tested
that first thing.
Sep 8 11:59:23 localhost kernel: 3w-9xxx: scsi0: WARNING:
(0x06:0x002C): Unit #1: Command (0x2a) timed out, resetting card.
Sep 8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
Cache synchronized after power fail:unit=0.
Sep 8 11:59:41 localhost kernel: 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E):
Cache synchronized after power fail:unit=1.
I also found this problem to exist across platforms, not just FreeBSD.
For example, the excerpt above is from a CentOS box.
All tests were done with newest firmware for both card and mobo, and
using the newest drivers provided by 3ware.
Once I removed the card and drives from the Tyan system and stuck them
in pretty much ANY other system, they worked fantastically.
I don't have an answer for the "resetting problem" as of yet... 3ware
and Tyan (And my system vendor "Appro") are still trying to find my
specific problem and solve it. I believe they are currently doing the
"replace everything" method of troubleshooting.
Mark, thank you.
It's good to know that the resetting problem exist on other platforms too.
We already found out that replacing the entire box with identical one
doesn't help, so unfortunately we'll have to start replacing components
by using different brands or models.
I wouldn't like to touch the I/O subsystem (these are already loaded
production machines), so like you said, the safest bet would be to try
another motherboard.
However I don't see many Dual Opteron based boards suggested by the
3ware's compatibility list. The next one that comes in mind from that
list is Supermicro H8DC8, but it looks more like a gamers dream
(High-End PCI-e Graphics, SLI, etc. but no on-board VGA) than a server
board.
I'm quite surprised that the top Opteron based motherboard manufacturer
listed in the 3ware web site motherboard compatibility docs:
http://3ware.com/products/pdf/Motherboard_compatibility_list_9550SX_2006_06.pdf
makes 2 out of 5 boards that are marked as compatible, but perform so
bad with 3ware cards.
I know what happens here in this mailing list when somebody looks for
good SATA cards (Re: 3ware, 3ware, ...), I replied myself too.
So are there any success stories with 3ware 9550SX (SATA II) and dual
AMD Opteron server boards, or it's time to go back with Intel?
Regards,
Atanas
Atanas wrote:
Has anyone experiencing this:
twa0: ERROR: (0x05: 0x2018): Passthru request timed out!: request =
0xca839d20
twa0: INFO: (0x16: 0x1108): Resetting controller...:
twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=0
...
twa0: INFO: (0x04: 0x005E): Cache synchronization completed: unit=7
twa0: INFO: (0x04: 0x0001): Controller reset occurred: resets=1
twa0: INFO: (0x16: 0x1107): Controller reset done!:
This happens on 6.2-PRERELEASE i386 (and on 6.1 since its release) on
a number of machines with the following hardware configuration:
- Tyan K8SE 2892, 2 AMD Opteron 270 CPUs, 4GB RAM
- 3ware 9550SX-8LP, 8 500GB Seagate ST3500641AS SATA drives
(configured as 8 SINGLE DISK units, aka JBOD)
All hardware components, including the server chassis, are listed in
the 3ware hardware compatibility lists. It doesn't seem to be a
cabling or power issue. The controller and hard drives are already
flashed to the latest firmware revisions. I tried turning off NCQ, but
it didn't make any difference. I tried also switching the kernel from
PAE to non-PAE (reducing the usable memory to 3GB), but it didn't help
either.
I have another machines with similar I/O configurations (3ware), but
with Intel motherboards and running FreeBSD-5.5, and these run fine
for about a year already. Now I'm thinking about swapping the drives
between a working Intel and AMD based box, to see where controller
timeouts will follow.
The problem happens sporadically once in a month or so and is very
hard to reproduce. Sometimes it takes several weeks until the next
crash happens, sometimes it crashes again in just a few hours.
When the thing happens, the kernel sometimes panics (most likely due
to the inconsistent filesystem state caused by the controller reset),
sometimes just hangs. It can be interrupted (I have a serial console),
but the only usable thing after that seems to be "call cpu_reset()",
followed by full (and sometimes painfully long) filesystem check.
Here are the diffs against the default GENERIC and PAE kernel
configurations:
< cpu I486_CPU
< ident GENERIC
< options INET6 # IPv6 communications protocols
< options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI
> options QUOTA
> options SMP # Symmetric MultiProcessor Kernel
> options BREAK_TO_DEBUGGER
> options DDB
> options KDB
> options KDB_UNATTENDED
> options IPFIREWALL
> options DUMMYNET
I'm attaching the dmesg.boot following the latest crash.
Regards,
Atanas
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"