Dieter <openbsd <at> sopwith.solgatos.com> writes:

> 
> Recovering from Seagate's problematic 7200.11 firmware.
> 
> Most of you have read about the problems with Seagate's
> 7200.11 disks.  For those of you that haven't, the firmware
> on many of these drives is buggy, and can "brick" the drive
> when powering up or rebooting the system.  Thus far,
> Seagate's response has been less than wonderful.  We need
> a FLOSS solution.
> 
> Goals:
> 
>       1) Ability to read the number of log entries.
> 
>       2) Ability to change the number of log entries.

As far I know the drive internal event counter can only be accessed
or changed from firmware level (ie. serial/pc-3000). Maybe disabling
the S.M.A.R.T automatic off-line data collection (and/or the attribute
autosave) with smartctl could somehow prevent the internal event log 
from reaching the magic value (320 or 320+x*256) because it does save
data to reserved drive area (in case of errors it even includes POH).

POH = Power On Hours

> 
>       3) Ability to install new firmware from Unix.
>

Drive firmware flashing from (S)ATA interface level could be done 
on UNIX but doing so from a mounted file-system (to avoid a reboot)
and/or without controller reset might have castrophic results (would
risk to say it's even more critical than updating system BIOS because 
there more variables - ie. different controllers, RAID, etc).

> We need for this to work with any flavor of Unix,
> on any CPU arch, without reboot or power cycle.
> We need for this to work on one drive without affecting
> other drives.
> 
> I don't expect to be able to write FLOSS firmware for the drives, so
> this isn't listed as a goal.  If you think you can, please feel free.

I also think the firmware should be open-source with a portable (any  arch)
update tool. This would allow many improvements and a much more reliable
bug tracking/testing process (ie. there are many firmware bugs like NCQ 
stuttering issue with some versions, self-test log holes, etc).

Writing FLOSS firmware would require some degree of cooperation from Seagate.

> 
> The problem:
> 
> "IF the drive is powered down when there are 320 entries in this journal
> or log, then when it is powered back up, the drive errors out on init and
> won't boot properly - to the point that it won't even report it's
> information to the BIOS."
> 
>                       Maxtorman, slashdot discussion [2]
> 
> If Maxtorman is correct, then once the drive has been operating awhile,
> we have a 1 in 320 chance that the circular log is at entry 320.  We want
> to be able to find out how many log entries the disk currently has, and
> we want to be able to change the number of log entries away from 320,
> while we wait for Seagate to get its act together and release firmware
> that works properly.  Since Seagate's solution will require attaching
> the drive to an x86 system and booting a FreeDOS ISO from CD, if the log
> is at 320 that boot will brick the drive.
> 
> There are other firmware problems with the 7200.11 series, but this is
> the biggie.
> 
> Once Seagate releases working firmware, we want to be able to install
> it from Unix, on any CPU arch.  Seagate's release can only install
> on x86 using FreeDOS.
> 
> *ATA Commands that may be useful:
> 
> command name                  command code in hex   page [1] pdf page [1]
> Read Log Ext                  0x2F                    27      33
> S.M.A.R.T. Read Log Sector    0xB0 / 0xD5             28,34   34,40
> S.M.A.R.T. Write Log Sector   0xB0 / 0xD6             28,34   34.40
> Write Log Extended            0x3F                    28      34
> Download Microcode            0x92                    27      33
> 
> Questions:
> 
>       Is Maxtorman correct about the 320 log entries?
> 
>       Are the commands listed above the ones we need?
>       What is the difference between the "Log Extended"
>       and the S.M.A.R.T. Log Sector?
>       Is "Microcode" the same as "firmware"?  (Seagate uses
>       the term firmware elsewhere in the manual, but I don't
>       find any sort of "write firmware" command.)
> 
>       Where can we get more detailed info about these
>       commands and how to use them?

Maxtorman is right about the 320 but it's bit more complicated. Here 
is the failure root cause detailed descrption (no NDA pets were hurt):

"The firmware issue is that the end boundary of the event log circular
buffer (320) was set incorrectly. During Event Log initialization, the
boundary condition that defines the end of the Event Log is off by one.
During power up, if the Event Log counter is at entry 320, or a multiple
of (320 + x*256), and if a particular data pattern (dependent on the type
of tester used during the drive manufacturing test process) had been present
in the reserved-area system tracks when the drive's reserved-area file 
system was created during manufacturing, firmware will increment the Event
Log pointer past the end of the event log data structure. This error is
detected and results in an "Assert Failure", which causes the drive to
hang as a failsafe measure. When the drive enters failsafe further updates
to the counter become impossible and the condition will remain through
subsequent power cycles. The problem only arises if a power cycle 
initialization 
occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once
a drive is in this state, there is no path to resolve/recover existing failed
drives without Seagate technical intervention."

Reply via email to