Dieter <openbsd <at> sopwith.solgatos.com> writes: > > Recovering from Seagate's problematic 7200.11 firmware. > > Most of you have read about the problems with Seagate's > 7200.11 disks. For those of you that haven't, the firmware > on many of these drives is buggy, and can "brick" the drive > when powering up or rebooting the system. Thus far, > Seagate's response has been less than wonderful. We need > a FLOSS solution. > > Goals: > > 1) Ability to read the number of log entries. > > 2) Ability to change the number of log entries.
As far I know the drive internal event counter can only be accessed or changed from firmware level (ie. serial/pc-3000). Maybe disabling the S.M.A.R.T automatic off-line data collection (and/or the attribute autosave) with smartctl could somehow prevent the internal event log from reaching the magic value (320 or 320+x*256) because it does save data to reserved drive area (in case of errors it even includes POH). POH = Power On Hours > > 3) Ability to install new firmware from Unix. > Drive firmware flashing from (S)ATA interface level could be done on UNIX but doing so from a mounted file-system (to avoid a reboot) and/or without controller reset might have castrophic results (would risk to say it's even more critical than updating system BIOS because there more variables - ie. different controllers, RAID, etc). > We need for this to work with any flavor of Unix, > on any CPU arch, without reboot or power cycle. > We need for this to work on one drive without affecting > other drives. > > I don't expect to be able to write FLOSS firmware for the drives, so > this isn't listed as a goal. If you think you can, please feel free. I also think the firmware should be open-source with a portable (any arch) update tool. This would allow many improvements and a much more reliable bug tracking/testing process (ie. there are many firmware bugs like NCQ stuttering issue with some versions, self-test log holes, etc). Writing FLOSS firmware would require some degree of cooperation from Seagate. > > The problem: > > "IF the drive is powered down when there are 320 entries in this journal > or log, then when it is powered back up, the drive errors out on init and > won't boot properly - to the point that it won't even report it's > information to the BIOS." > > Maxtorman, slashdot discussion [2] > > If Maxtorman is correct, then once the drive has been operating awhile, > we have a 1 in 320 chance that the circular log is at entry 320. We want > to be able to find out how many log entries the disk currently has, and > we want to be able to change the number of log entries away from 320, > while we wait for Seagate to get its act together and release firmware > that works properly. Since Seagate's solution will require attaching > the drive to an x86 system and booting a FreeDOS ISO from CD, if the log > is at 320 that boot will brick the drive. > > There are other firmware problems with the 7200.11 series, but this is > the biggie. > > Once Seagate releases working firmware, we want to be able to install > it from Unix, on any CPU arch. Seagate's release can only install > on x86 using FreeDOS. > > *ATA Commands that may be useful: > > command name command code in hex page [1] pdf page [1] > Read Log Ext 0x2F 27 33 > S.M.A.R.T. Read Log Sector 0xB0 / 0xD5 28,34 34,40 > S.M.A.R.T. Write Log Sector 0xB0 / 0xD6 28,34 34.40 > Write Log Extended 0x3F 28 34 > Download Microcode 0x92 27 33 > > Questions: > > Is Maxtorman correct about the 320 log entries? > > Are the commands listed above the ones we need? > What is the difference between the "Log Extended" > and the S.M.A.R.T. Log Sector? > Is "Microcode" the same as "firmware"? (Seagate uses > the term firmware elsewhere in the manual, but I don't > find any sort of "write firmware" command.) > > Where can we get more detailed info about these > commands and how to use them? Maxtorman is right about the 320 but it's bit more complicated. Here is the failure root cause detailed descrption (no NDA pets were hurt): "The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one. During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further updates to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention."