Update: This error/problem is now starting to occur once or twice-a-day and it is usually when a "backup stgpool" of our primary disk pool is happening.
There is nothing in any of our hardware/os logs, including the PERC controller logs. There is a Dell PERC firmware upgrade pending that is labeled "Urgent" that we will pursue. If this is another one of our "bad spots" in one of our disk volumes, can someone from IBM help decode the error to perhaps point to what stgpool volume has the "problem"? We ran an audit on one of the 20+ volumes in this stgpool but nothing showed up as "bad". With over 30TB to run audits on (and of course they are always busy), it will take a while. The latest message: 10/13/2017 11:55:53 AM ANR1330E The server has detected possible corruption in an object that is being restored or moved. The actual values for the incorrect frame are: magic 20890B50 hdr version 25350 hdr length 2320 sequence number 2114564210 data length D07D0F20 server ID 175174927 segment ID 9270951345039929524 crc 4C5AC0C. 10/13/2017 11:55:53 AM ANR1331E Invalid frame detected. Expected magic 53454652 sequence number 71 server id 0 segment id 2720204019. On Wed, Oct 11, 2017 at 9:33 AM, Skylar Thompson <skyl...@u.washington.edu> wrote: > Content preview: I'm not aware of a fix for the problem (it's with Dell > PERC > H810s) but the problem manifested itself in lots and lots of media > errors > on a physical device, visible when you export the controller log. The > symptoms > for TSM included both CRC errors in the pool and also sporadically > awful > I/O throughput. [...] > > Content analysis details: (0.7 points, 5.0 required) > > pts rule name description > ---- ---------------------- ------------------------------ > -------------------- > 0.7 SPF_NEUTRAL SPF: sender does not match SPF record > (neutral) > -0.0 RP_MATCHES_RCVD Envelope sender domain matches handover relay > domain > X-Barracuda-Connect: mx.gs.washington.edu[128.208.8.134] > X-Barracuda-Start-Time: 1507728824 > X-Barracuda-Encrypted: ECDHE-RSA-AES256-GCM-SHA384 > X-Barracuda-URL: https://148.100.49.28:443/cgi-mod/mark.cgi > X-Barracuda-Scan-Msg-Size: 5262 > X-Virus-Scanned: by bsmtpd at marist.edu > X-Barracuda-BRTS-Status: 1 > X-Barracuda-Spam-Score: 0.00 > X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of > TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=5.5 tests= > X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.3.43799 > Rule breakdown below > pts rule name description > ---- ---------------------- ------------------------------ > -------------------- > > I'm not aware of a fix for the problem (it's with Dell PERC H810s) but the > problem manifested itself in lots and lots of media errors on a physical > device, visible when you export the controller log. The symptoms for TSM > included both CRC errors in the pool and also sporadically awful I/O > throughput. > > The controller logs identified the slot with the media errors, and > replacing the drive made all the above problems go away. Of course the real > solution is going to be retiring these soon-to-be-EOSL'd devices, and I've > finally got a budget to do it... > > I'm not actually aware of a fix for the problem, though I didn't spend a > lot of time looking for one given that we'll be getting rid of the > equipment in a few weeks. It could very well be an interaction between the > RAID HBA and physical disk firmware. Unfortunately the system has a mix of > disk vendors since Dell isn't consistent about which vendor they ship for > replacements, but the drive I identified was a Fujitsu MBD2300RC. > > On Tue, Oct 10, 2017 at 02:18:01PM -0400, Zoltan Forray wrote: > > Thank you for the info. We have started running AUDIT's but with 30TB+ > in > > this disk stgpool, it will take a while. I am very interested in > > additional details on the RAID firmware issue you mentioned - any > specifics > > would be very helpful. AFAIK, we are up-to-date on all Dell firmware (we > > patch fairly regularly). > > > > Within the past 9-months, this server has had 3-diskpool volumes (all > part > > of RAID-5 arrays) suddenly become "bad", requiring full restores, with no > > explanation since there was no sign of hardware problems. While I did > open > > a PMR with IBM, by the time they looked at my last failure, they said > there > > was nothing they could do to analyze the problem and to call them back > the > > next time it happens. > > > > On Tue, Oct 10, 2017 at 2:04 PM, Skylar Thompson < > skyl...@u.washington.edu> > > wrote: > > > > > Hi Zoltan, > > > > > > We ran into this recently, and it was caused by a firmware bug in a > RAID > > > adapter that caused it not to fail and obviously-failing disk in our > disk > > > spool. We followed the procedure here: > > > > > > https://www.ibm.com/support/knowledgecenter/en/SSGSG7_7.1. > > > 6/tshoot/r_pdg_1330_1331_msg.html > > > > > > It did take a few AUDIT VOLUME-MOVE DATA cycles to find everything but > now > > > it's happy. In a few cases, the file shown by SHOW INVO was obviously > > > detritus, so we deleted it client-side with DELETE BACKUP instead of an > > > audit, because it takes a long time to audit our disk volumes. > > > > > > On Tue, Oct 10, 2017 at 01:56:47PM -0400, Zoltan Forray wrote: > > > > Recently we started seeing these errors on one of our servers: > > > > > > > > 10/10/2017 13:35:51 ANR1330E The server has detected possible > corruption > > > > in > > > > an object that is being restored or moved. The > > > actual > > > > > > > > values for the incorrect frame are: magic > 53454652 > > > > hdr > > > > version 2 hdr length 32 sequence number > > > > 22610 > > > > data length 3FFB0 server ID 0 > segment ID > > > > > > > > 2720223190 crc 0. (SESSION: 39218, > PROCESS: > > > > 171) > > > > 10/10/2017 13:35:51 ANR1331E Invalid frame detected. Expected magic > > > > 53454652 > > > > > > > > The Process ID points to a Backup Stgpool process (the only thing > > > running), > > > > not anything being "moved or restored". There are also a bunch of > > > sessions > > > > running/stuck/hung but that is a different problem. > > > > > > > > Any idea on how to determine what is causing this? We've seen the > error > > > > quite a few times within the past few days. > > > > > > > > -- > > > > *Zoltan Forray* > > > > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator > > > > Xymon Monitor Administrator > > > > VMware Administrator > > > > Virginia Commonwealth University > > > > UCC/Office of Technology Services > > > > www.ucc.vcu.edu > > > > zfor...@vcu.edu - 804-828-4807 > > > > Don't be a phishing victim - VCU and other reputable organizations > will > > > > never use email to request that you reply with your password, social > > > > security number or confidential personal information. For more > details > > > > visit http://phishing.vcu.edu/ > > > > > > -- > > > -- Skylar Thompson (skyl...@u.washington.edu) > > > -- Genome Sciences Department, System Administrator > > > -- Foege Building S046, (206)-685-7354 > > > -- University of Washington School of Medicine > > > > > > > > > > > -- > > *Zoltan Forray* > > Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator > > Xymon Monitor Administrator > > VMware Administrator > > Virginia Commonwealth University > > UCC/Office of Technology Services > > www.ucc.vcu.edu > > zfor...@vcu.edu - 804-828-4807 > > Don't be a phishing victim - VCU and other reputable organizations will > > never use email to request that you reply with your password, social > > security number or confidential personal information. For more details > > visit http://phishing.vcu.edu/ > > -- > -- Skylar Thompson (skyl...@u.washington.edu) > -- Genome Sciences Department, System Administrator > -- Foege Building S046, (206)-685-7354 > -- University of Washington School of Medicine > -- *Zoltan Forray* Spectrum Protect (p.k.a. TSM) Software & Hardware Administrator Xymon Monitor Administrator VMware Administrator Virginia Commonwealth University UCC/Office of Technology Services www.ucc.vcu.edu zfor...@vcu.edu - 804-828-4807 Don't be a phishing victim - VCU and other reputable organizations will never use email to request that you reply with your password, social security number or confidential personal information. For more details visit http://phishing.vcu.edu/